From vlad at lists.openfabrics.org  Sat Aug  1 02:58:11 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  1 Aug 2009 02:58:11 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090801-0200 daily build status
Message-ID: <20090801095812.32630E61B3C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats'
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats':
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-128.el5
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats'
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats':
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-128.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-93.el5
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats'
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats':
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-93.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:375: warning: pointer targets in assignment differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats':
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:375: warning: pointer targets in assignment differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats':
/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From yossi.openib at gmail.com  Sat Aug  1 05:01:24 2009
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Sat, 01 Aug 2009 15:01:24 +0300
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>
References: <4A6DDFCE.9060009@voltaire.com>
	<4A70154F.7080300@gmail.com>	<f0e08f230907290330j777bb2f9j4063d497e66e305d@mail.gmail.com>	<4A703DA4.9080300@Voltaire.COM>	<f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>	<4A705B3A.7060404@Voltaire.COM>	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>	<4A731818.3060500@voltaire.com>	<f0e08f230907311050wa750cf2n497039acafdab3b4@mail.gmail.com>	<4A733D24.3040201@voltaire.com>
	<f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>
Message-ID: <4A742E94.2070002@gmail.com>

Hal Rosenstock wrote:
> 
>     Yes, but AFAIK the only "bad" case is if the LID stays the same but
>     LMC changes to a lower
>     value. In this case the path refresh will not happen when it is
>     supposed to.
> 
>  
> What's the impact of that ?
>  
> Also the LID can change at the same time as the LMC.
>  
> I can't tell if all the possible cases are handled properly. Are they ?
>  

Let's see:
Only LID changes - handled correctly.
LMC (and possibly LID) change - either we "catch" this, or we don't.
If we do, the path and LMC will be refreshed so we will not keep refreshing
the path forever (like it could have been if we didn't refresh the LMC).
If we don't - ipoib packets will not reach the neighbour, which is the same
situation there is today.


From vlad at lists.openfabrics.org  Sun Aug  2 03:00:55 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  2 Aug 2009 03:00:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090802-0200 daily build status
Message-ID: <20090802100055.BEDAAE61D78@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg':
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink'
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg':
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink'
/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Sun Aug  2 03:07:50 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:07:50 +0300
Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [4/6] fix libibnetdisc
 API consistency and bugs
In-Reply-To: <1248714771.16723.327.camel@auk31.llnl.gov>
References: <1248714771.16723.327.camel@auk31.llnl.gov>
Message-ID: <20090802100750.GA5287@me>

On 10:12 Mon 27 Jul     , Al Chu wrote:
> Make api more consistent and make struct ibnd_fabric a struct that
> represents just fabric data by removing the ibmad_port and making it a
> function paramete in appropriate functions.
> 
> Al
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From: Albert Chu <chu11 at llnl.gov>
> Date: Thu, 23 Jul 2009 14:14:57 -0700
> Subject: [PATCH] Make api more consistent and make struct ibnd_fabric a struct that represents just fabric data by removing the ibmad_port and making it a function paramete in appropriate functions.
> 
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 03:08:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:08:41 +0300
Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [5/6] fix libibnetdisc
 API consistency and bugs
In-Reply-To: <1248714779.16723.328.camel@auk31.llnl.gov>
References: <1248714779.16723.328.camel@auk31.llnl.gov>
Message-ID: <20090802100841.GB5287@me>

On 10:12 Mon 27 Jul     , Al Chu wrote:
> Check input parameters to libibnetdisc functions
> 
> Al
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From: Albert Chu <chu11 at llnl.gov>
> Date: Thu, 23 Jul 2009 14:15:25 -0700
> Subject: [PATCH] Check input parameters to libibnetdisc functions
> 
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 03:08:47 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:08:47 +0300
Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [6/6] fix libibnetdisc
 API consistency and bugs
In-Reply-To: <1248714781.16723.329.camel@auk31.llnl.gov>
References: <1248714781.16723.329.camel@auk31.llnl.gov>
Message-ID: <20090802100847.GC5287@me>

On 10:13 Mon 27 Jul     , Al Chu wrote:
> Remove timeout_ms parameter to ibnd_discover_fabric, timeout parameter
> should be specified via the ibmad_port.  Remove extraneous use of global
> timeout_ms in library.  Adjust ibnetdiscover, ibqueryerrors, iblinkinfo,
> and test code appropriately for adjustment.
> 
> Al
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From: Albert Chu <chu11 at llnl.gov>
> Date: Thu, 23 Jul 2009 14:16:14 -0700
> Subject: [PATCH] Remove timeout_ms parameter to ibnd_discover_fabric, timeout parameter should be specified via the ibmad_port.  Remove extraneous use of global timeout_ms in library.  Adjust ibnetdiscover, ibqueryerrors, iblinkinfo, and test code appropriately for adjustment.
> 
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 03:09:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:09:01 +0300
Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [0/6] fix libibnetdisc
 API consistency and bugs
In-Reply-To: <1248714723.16723.322.camel@auk31.llnl.gov>
References: <1248714723.16723.322.camel@auk31.llnl.gov>
Message-ID: <20090802100901.GD5287@me>

Hi Al,

On 10:12 Mon 27 Jul     , Al Chu wrote:
> 
> This is a redo of my previous patch series.  Ira or myself will instead
> write 1 big patch later on to make a lot of the structs more public.
> These are the patches to fix bugs and/or make things more consistent for
> what's already there.

I applied this series. Please next time when you are sending patch
series use different subjects for email messaged, so that patch subject
will be more descriptive. Actually you can look at
/usr/src/linux/Documentation/SubmittingPatches for more details about
desirable patch and patch series format.

Sasha


From sashak at voltaire.com  Sun Aug  2 03:09:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:09:40 +0300
Subject: [ofa-general] [PATCHv2] opensm/mesh/lash: Fix use after free
	problem in osm_mesh_node_delete
In-Reply-To: <20090731135147.GA10365@comcast.net>
References: <20090731135147.GA10365@comcast.net>
Message-ID: <20090802100940.GE5287@me>

Hi Hal,

On 09:51 Fri 31 Jul     , Hal Rosenstock wrote:
> 
> When osm_mesh_node_delete is called, osm_switch_delete may already have
> been called so sw->p_sw is no longer valid to be used although it was
> being used to obtain num_ports.
> 
> Fix this by performing osm_mesh_delete_switches at the end of lash_process.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Rather than saving num_ports in the mesh node structure on creation and using
> this on deletion, mesh switches deletion should occur at end of the lash
> calculation as none of this state is needed after that 
> Approach proposed by Sasha
> 
> diff --git a/opensm/include/opensm/osm_mesh.h b/opensm/include/opensm/osm_mesh.h
> index 173fa86..89c07e5 100644
> --- a/opensm/include/opensm/osm_mesh.h
> +++ b/opensm/include/opensm/osm_mesh.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2088      System Fabric Works, Inc.
> + * Copyright (c) 2009      HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -70,6 +71,7 @@ typedef struct _mesh_node {
>  } mesh_node_t;
>  
>  void osm_mesh_node_delete(struct _lash *p_lash, struct _switch *sw);
> +void osm_mesh_delete_switches(struct _lash *p_lash);
>  int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw);
>  int osm_do_mesh_analysis(struct _lash *p_lash);
>  
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 23fad87..b22fe6e 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2008,2009      System Fabric Works, Inc. All rights reserved.
> + * Copyright (c) 2009           HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -1358,6 +1359,20 @@ void osm_mesh_node_delete(lash_t *p_lash, switch_t *sw)
>  }
>  
>  /*
> + * osm_mesh_delete_switches - cleanup switches resources
> + */
> +void osm_mesh_delete_switches(lash_t *p_lash)
> +{
> +	if (p_lash->switches) {
> +		unsigned id;
> +		for (id = 0; ((int)id) < p_lash->num_switches; id++)
> +			if (p_lash->switches[id])
> +				osm_mesh_node_delete(p_lash,
> +						     p_lash->switches[id]);
> +	}
> +}

Why should it be in osm_mesh.c? osm_mesh_node_create() and
osm_mesh_node_delete() are called in osm_ucast_lash.c now.

For me it looks that more appropriate place for such cleanup is
lash_free_structures() func in osm_ucast_lash.c.

Sasha


From sashak at voltaire.com  Sun Aug  2 03:32:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 13:32:56 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return
 success/failure status
In-Reply-To: <20090731135316.GB10365@comcast.net>
References: <20090731135316.GB10365@comcast.net>
Message-ID: <20090802103256.GF5287@me>

Hi Hal,

On 09:53 Fri 31 Jul     , Hal Rosenstock wrote:
> 
> based on valid/invalid hop count rather than relying on debug assert

When could an invalid hop count be passed to this function? And what
could happen?

ib_smp_init_new() is a simple structure fill-up helper (inlined and
defined in header file) and I don't think that we need to check
parameters there.

This patch also introduces sort of inconsistency - a hop count is checked
and other parameters aren't.

> Handle invalid status appropriate in callers of ib_smp_init_new
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
> index beb7492..6668d96 100644
> --- a/opensm/include/iba/ib_types.h
> +++ b/opensm/include/iba/ib_types.h
> @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN const ib_smp_t * const p_smp)
>  *
>  * TODO
>  *	This is too big for inlining, but leave it here for now
> -*	since there is not yet another convient spot.
> +*	since there is not yet another convenient spot.
>  *
>  * SYNOPSIS
>  */
> -static inline void OSM_API
> +static inline boolean_t OSM_API
>  ib_smp_init_new(IN ib_smp_t * const p_smp,
>  		IN const uint8_t method,
>  		IN const ib_net64_t trans_id,
> @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
>  		IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid)
>  {
>  	CL_ASSERT(p_smp);
> -	CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX);
> +
> +	if (hop_count >= IB_SUBNET_PATH_HOPS_MAX)
> +		return FALSE;
>  	p_smp->base_ver = 1;
>  	p_smp->mgmt_class = IB_MCLASS_SUBN_DIR;
>  	p_smp->class_ver = 1;
> @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
>  
>  	/* copy the path */
>  	memcpy(&p_smp->initial_path, path_out, sizeof(p_smp->initial_path));
> +	return TRUE;
>  }
>  
>  /*
> diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c
> index be9a92b..7934173 100644
> --- a/opensm/opensm/osm_req.c
> +++ b/opensm/opensm/osm_req.c
> @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm,
>  		ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id),
>  		cl_ntoh32(attr_mod), cl_ntoh64(tid));
>  
> -	ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
> -			IB_MAD_METHOD_GET,
> -			tid,
> -			attr_id,
> -			attr_mod,
> -			p_path->hop_count,
> -			sm->p_subn->opt.m_key,
> -			p_path->path, IB_LID_PERMISSIVE, IB_LID_PERMISSIVE);
> +	if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
> +			     IB_MAD_METHOD_GET,
> +			     tid,
> +			     attr_id,
> +			     attr_mod,
> +			     p_path->hop_count,
> +			     sm->p_subn->opt.m_key,
> +			     p_path->path,
> +			     IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) {
> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: "
> +			"ib_smp_init_new failed: hop count %d\n",
> +			p_path->hop_count);

This is assumption on how ib_smp_init_new() is actually implemented -
not perfect.

Sasha


From hal.rosenstock at gmail.com  Sun Aug  2 03:50:56 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 06:50:56 -0400
Subject: [ofa-general] [PATCHv2] opensm/mesh/lash: Fix use after free 
	problem in osm_mesh_node_delete
In-Reply-To: <20090802100940.GE5287@me>
References: <20090731135147.GA10365@comcast.net> <20090802100940.GE5287@me>
Message-ID: <f0e08f230908020350r3f66a4aer1e6d0a6e226457b7@mail.gmail.com>

Hi Sasha,

On Sun, Aug 2, 2009 at 6:09 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi Hal,
>
> On 09:51 Fri 31 Jul     , Hal Rosenstock wrote:
> >
> > When osm_mesh_node_delete is called, osm_switch_delete may already have
> > been called so sw->p_sw is no longer valid to be used although it was
> > being used to obtain num_ports.
> >
> > Fix this by performing osm_mesh_delete_switches at the end of
> lash_process.
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> > ---
> > Changes since v1:
> > Rather than saving num_ports in the mesh node structure on creation and
> using
> > this on deletion, mesh switches deletion should occur at end of the lash
> > calculation as none of this state is needed after that
> > Approach proposed by Sasha
> >
> > diff --git a/opensm/include/opensm/osm_mesh.h
> b/opensm/include/opensm/osm_mesh.h
> > index 173fa86..89c07e5 100644
> > --- a/opensm/include/opensm/osm_mesh.h
> > +++ b/opensm/include/opensm/osm_mesh.h
> > @@ -1,5 +1,6 @@
> >  /*
> >   * Copyright (c) 2088      System Fabric Works, Inc.
> > + * Copyright (c) 2009      HNR Consulting. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -70,6 +71,7 @@ typedef struct _mesh_node {
> >  } mesh_node_t;
> >
> >  void osm_mesh_node_delete(struct _lash *p_lash, struct _switch *sw);
> > +void osm_mesh_delete_switches(struct _lash *p_lash);
> >  int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw);
> >  int osm_do_mesh_analysis(struct _lash *p_lash);
> >
> > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> > index 23fad87..b22fe6e 100644
> > --- a/opensm/opensm/osm_mesh.c
> > +++ b/opensm/opensm/osm_mesh.c
> > @@ -1,5 +1,6 @@
> >  /*
> >   * Copyright (c) 2008,2009      System Fabric Works, Inc. All rights
> reserved.
> > + * Copyright (c) 2009           HNR Consulting. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -1358,6 +1359,20 @@ void osm_mesh_node_delete(lash_t *p_lash, switch_t
> *sw)
> >  }
> >
> >  /*
> > + * osm_mesh_delete_switches - cleanup switches resources
> > + */
> > +void osm_mesh_delete_switches(lash_t *p_lash)
> > +{
> > +     if (p_lash->switches) {
> > +             unsigned id;
> > +             for (id = 0; ((int)id) < p_lash->num_switches; id++)
> > +                     if (p_lash->switches[id])
> > +                             osm_mesh_node_delete(p_lash,
> > +                                                  p_lash->switches[id]);
> > +     }
> > +}
>
> Why should it be in osm_mesh.c? osm_mesh_node_create() and
> osm_mesh_node_delete() are called in osm_ucast_lash.c now.
>
> For me it looks that more appropriate place for such cleanup is
> lash_free_structures() func in osm_ucast_lash.c.


Not quite as it cannot be cleaned up until after discover_network_properties
is called and succeeds.

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090802/f61c3e60/attachment.html>

From hnrose at comcast.net  Sun Aug  2 03:53:31 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 06:53:31 -0400
Subject: [ofa-general] [PATCHv3] opensm/osm_lash: Fix use after free problem
	in osm_mesh_node_delete
Message-ID: <20090802105331.GA26002@comcast.net>


When osm_mesh_node_delete is called, osm_switch_delete may already have
been called so sw->p_sw is no longer valid to be used although it was
being used to obtain num_ports.

Fix this by performing delete_switches at the end of lash_process.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v2:
Moved mesh switches deletion into lash

Changes since v1:
Rather than saving num_ports in the mesh node structure on creation and using
this on deletion, mesh switches deletion should occur at end of the lash
calculation as none of this state is needed after that 
Approach proposed by Sasha

diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 1c55a90..cf8e793 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -5,6 +5,7 @@
  * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
  * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved.
+ * Copyright (c) 2009      HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw)
 	free(sw);
 }
 
+static void delete_switches(lash_t *p_lash)
+{
+	if (p_lash->switches) {
+		unsigned id;
+		for (id = 0; ((int)id) < p_lash->num_switches; id++)
+			if (p_lash->switches[id])
+				osm_mesh_node_delete(p_lash,
+						     p_lash->switches[id]);
+	}
+}
+
+
 static void free_lash_structures(lash_t * p_lash)
 {
 	unsigned int i, j, k;
@@ -1219,7 +1232,7 @@ static int lash_process(void *context)
 
 	return_status = discover_network_properties(p_lash);
 	if (return_status != IB_SUCCESS)
-		goto Exit;
+		goto Exit2;
 
 	return_status = init_lash_structures(p_lash);
 	if (return_status != IB_SUCCESS)
@@ -1234,6 +1247,9 @@ static int lash_process(void *context)
 	populate_fwd_tbls(p_lash);
 
 Exit:
+	delete_switches(p_lash);
+
+Exit2:
 	if (p_lash->vl_min)
 		free_lash_structures(p_lash);
 	OSM_LOG_EXIT(p_log);


From hal.rosenstock at gmail.com  Sun Aug  2 03:59:27 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 06:59:27 -0400
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to 
	return success/failure status
In-Reply-To: <20090802103256.GF5287@me>
References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me>
Message-ID: <f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>

HI Sasha,

On Sun, Aug 2, 2009 at 6:32 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi Hal,
>
> On 09:53 Fri 31 Jul     , Hal Rosenstock wrote:
> >
> > based on valid/invalid hop count rather than relying on debug assert
>
> When could an invalid hop count be passed to this function?


Some out of tree user.


> And what could happen?


It writes past the end of the path array.


>
> ib_smp_init_new() is a simple structure fill-up helper (inlined and
> defined in header file) and I don't think that we need to check
> parameters there.
>
> This patch also introduces sort of inconsistency - a hop count is checked
> and other parameters aren't.


It's to protect against writing past end of array. Do any other parameters
need checking ? I think they just result in some timeout condition
resulting.

-- Hal


>
>
> > Handle invalid status appropriate in callers of ib_smp_init_new
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> > ---
> > diff --git a/opensm/include/iba/ib_types.h
> b/opensm/include/iba/ib_types.h
> > index beb7492..6668d96 100644
> > --- a/opensm/include/iba/ib_types.h
> > +++ b/opensm/include/iba/ib_types.h
> > @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN
> const ib_smp_t * const p_smp)
> >  *
> >  * TODO
> >  *    This is too big for inlining, but leave it here for now
> > -*    since there is not yet another convient spot.
> > +*    since there is not yet another convenient spot.
> >  *
> >  * SYNOPSIS
> >  */
> > -static inline void OSM_API
> > +static inline boolean_t OSM_API
> >  ib_smp_init_new(IN ib_smp_t * const p_smp,
> >               IN const uint8_t method,
> >               IN const ib_net64_t trans_id,
> > @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
> >               IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid)
> >  {
> >       CL_ASSERT(p_smp);
> > -     CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX);
> > +
> > +     if (hop_count >= IB_SUBNET_PATH_HOPS_MAX)
> > +             return FALSE;
> >       p_smp->base_ver = 1;
> >       p_smp->mgmt_class = IB_MCLASS_SUBN_DIR;
> >       p_smp->class_ver = 1;
> > @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
> >
> >       /* copy the path */
> >       memcpy(&p_smp->initial_path, path_out,
> sizeof(p_smp->initial_path));
> > +     return TRUE;
> >  }
> >
> >  /*
> > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c
> > index be9a92b..7934173 100644
> > --- a/opensm/opensm/osm_req.c
> > +++ b/opensm/opensm/osm_req.c
> > @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm,
> >               ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id),
> >               cl_ntoh32(attr_mod), cl_ntoh64(tid));
> >
> > -     ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
> > -                     IB_MAD_METHOD_GET,
> > -                     tid,
> > -                     attr_id,
> > -                     attr_mod,
> > -                     p_path->hop_count,
> > -                     sm->p_subn->opt.m_key,
> > -                     p_path->path, IB_LID_PERMISSIVE,
> IB_LID_PERMISSIVE);
> > +     if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
> > +                          IB_MAD_METHOD_GET,
> > +                          tid,
> > +                          attr_id,
> > +                          attr_mod,
> > +                          p_path->hop_count,
> > +                          sm->p_subn->opt.m_key,
> > +                          p_path->path,
> > +                          IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) {
> > +             OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: "
> > +                     "ib_smp_init_new failed: hop count %d\n",
> > +                     p_path->hop_count);
>
> This is assumption on how ib_smp_init_new() is actually implemented -
> not perfect.
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090802/1ef72d67/attachment.html>

From sashak at voltaire.com  Sun Aug  2 04:16:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 14:16:01 +0300
Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after free
 problem in osm_mesh_node_delete
In-Reply-To: <20090802105331.GA26002@comcast.net>
References: <20090802105331.GA26002@comcast.net>
Message-ID: <20090802111601.GI5287@me>

On 06:53 Sun 02 Aug     , Hal Rosenstock wrote:
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index 1c55a90..cf8e793 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -5,6 +5,7 @@
>   * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
>   * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
>   * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved.
> + * Copyright (c) 2009      HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw)
>  	free(sw);
>  }
>  
> +static void delete_switches(lash_t *p_lash)

Would delete_mesh_switches() (or cleanup_mesh*()) be a better name? It
doesn't delete lash's switches, only mesh nodes.

> +{
> +	if (p_lash->switches) {
> +		unsigned id;
> +		for (id = 0; ((int)id) < p_lash->num_switches; id++)
> +			if (p_lash->switches[id])
> +				osm_mesh_node_delete(p_lash,
> +						     p_lash->switches[id]);
> +	}
> +}
> +
> +
>  static void free_lash_structures(lash_t * p_lash)
>  {
>  	unsigned int i, j, k;
> @@ -1219,7 +1232,7 @@ static int lash_process(void *context)
>  
>  	return_status = discover_network_properties(p_lash);

discover_network_properties() can fail in a middle of allocations and
full clean is desired anyway. It should be safe to 'goto Exit' below
since mesh node deletion is protected against not yet initialized input.

Sasha

>  	if (return_status != IB_SUCCESS)
> -		goto Exit;
> +		goto Exit2;
>  
>  	return_status = init_lash_structures(p_lash);
>  	if (return_status != IB_SUCCESS)
> @@ -1234,6 +1247,9 @@ static int lash_process(void *context)
>  	populate_fwd_tbls(p_lash);
>  
>  Exit:
> +	delete_switches(p_lash);
> +
> +Exit2:
>  	if (p_lash->vl_min)
>  		free_lash_structures(p_lash);
>  	OSM_LOG_EXIT(p_log);
> 


From hal.rosenstock at gmail.com  Sun Aug  2 04:17:21 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 07:17:21 -0400
Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after free 
	problem in osm_mesh_node_delete
In-Reply-To: <20090802111601.GI5287@me>
References: <20090802105331.GA26002@comcast.net> <20090802111601.GI5287@me>
Message-ID: <f0e08f230908020417n6e7688eayd39e5731c6231ba3@mail.gmail.com>

On Sun, Aug 2, 2009 at 7:16 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:53 Sun 02 Aug     , Hal Rosenstock wrote:
> > diff --git a/opensm/opensm/osm_ucast_lash.c
> b/opensm/opensm/osm_ucast_lash.c
> > index 1c55a90..cf8e793 100644
> > --- a/opensm/opensm/osm_ucast_lash.c
> > +++ b/opensm/opensm/osm_ucast_lash.c
> > @@ -5,6 +5,7 @@
> >   * Copyright (c) 2007      Simula Research Laboratory. All rights
> reserved.
> >   * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
> >   * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights
> reserved.
> > + * Copyright (c) 2009      HNR Consulting. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t *
> sw)
> >       free(sw);
> >  }
> >
> > +static void delete_switches(lash_t *p_lash)
>
> Would delete_mesh_switches() (or cleanup_mesh*()) be a better name? It
> doesn't delete lash's switches, only mesh nodes.


Sure.


>
>
> > +{
> > +     if (p_lash->switches) {
> > +             unsigned id;
> > +             for (id = 0; ((int)id) < p_lash->num_switches; id++)
> > +                     if (p_lash->switches[id])
> > +                             osm_mesh_node_delete(p_lash,
> > +                                                  p_lash->switches[id]);
> > +     }
> > +}
> > +
> > +
> >  static void free_lash_structures(lash_t * p_lash)
> >  {
> >       unsigned int i, j, k;
> > @@ -1219,7 +1232,7 @@ static int lash_process(void *context)
> >
> >       return_status = discover_network_properties(p_lash);
>
> discover_network_properties() can fail in a middle of allocations and
> full clean is desired anyway. It should be safe to 'goto Exit' below
> since mesh node deletion is protected against not yet initialized input.


It's not; I had tried doing that.

-- Hal


>
>
> Sasha
>
> >       if (return_status != IB_SUCCESS)
> > -             goto Exit;
> > +             goto Exit2;
> >
> >       return_status = init_lash_structures(p_lash);
> >       if (return_status != IB_SUCCESS)
> > @@ -1234,6 +1247,9 @@ static int lash_process(void *context)
> >       populate_fwd_tbls(p_lash);
> >
> >  Exit:
> > +     delete_switches(p_lash);
> > +
> > +Exit2:
> >       if (p_lash->vl_min)
> >               free_lash_structures(p_lash);
> >       OSM_LOG_EXIT(p_log);
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090802/d4d8139d/attachment.html>

From hal.rosenstock at gmail.com  Sun Aug  2 04:18:22 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 07:18:22 -0400
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to 
	return success/failure status
In-Reply-To: <f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me>
	<f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
Message-ID: <f0e08f230908020418q7f023091jea80e8c5bbd9e89@mail.gmail.com>

On Sun, Aug 2, 2009 at 6:59 AM, Hal Rosenstock <hal.rosenstock at gmail.com>wrote:

> HI Sasha,
>
>  On Sun, Aug 2, 2009 at 6:32 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
>
>> Hi Hal,
>>
>> On 09:53 Fri 31 Jul     , Hal Rosenstock wrote:
>> >
>> > based on valid/invalid hop count rather than relying on debug assert
>>
>> When could an invalid hop count be passed to this function?
>
>
> Some out of tree user.
>

Also, I think opensm can also do this now with such topologies (without some
other changes which are in the pipe).

-- Hal


>
>
>> And what could happen?
>
>
> It writes past the end of the path array.
>
>
>>
>> ib_smp_init_new() is a simple structure fill-up helper (inlined and
>> defined in header file) and I don't think that we need to check
>> parameters there.
>>
>> This patch also introduces sort of inconsistency - a hop count is checked
>> and other parameters aren't.
>
>
> It's to protect against writing past end of array. Do any other parameters
> need checking ? I think they just result in some timeout condition
> resulting.
>
> -- Hal
>
>
>>
>>
>> > Handle invalid status appropriate in callers of ib_smp_init_new
>> >
>> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> > ---
>> > diff --git a/opensm/include/iba/ib_types.h
>> b/opensm/include/iba/ib_types.h
>> > index beb7492..6668d96 100644
>> > --- a/opensm/include/iba/ib_types.h
>> > +++ b/opensm/include/iba/ib_types.h
>> > @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN
>> const ib_smp_t * const p_smp)
>> >  *
>> >  * TODO
>> >  *    This is too big for inlining, but leave it here for now
>> > -*    since there is not yet another convient spot.
>> > +*    since there is not yet another convenient spot.
>> >  *
>> >  * SYNOPSIS
>> >  */
>> > -static inline void OSM_API
>> > +static inline boolean_t OSM_API
>> >  ib_smp_init_new(IN ib_smp_t * const p_smp,
>> >               IN const uint8_t method,
>> >               IN const ib_net64_t trans_id,
>> > @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
>> >               IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid)
>> >  {
>> >       CL_ASSERT(p_smp);
>> > -     CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX);
>> > +
>> > +     if (hop_count >= IB_SUBNET_PATH_HOPS_MAX)
>> > +             return FALSE;
>> >       p_smp->base_ver = 1;
>> >       p_smp->mgmt_class = IB_MCLASS_SUBN_DIR;
>> >       p_smp->class_ver = 1;
>> > @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp,
>> >
>> >       /* copy the path */
>> >       memcpy(&p_smp->initial_path, path_out,
>> sizeof(p_smp->initial_path));
>> > +     return TRUE;
>> >  }
>> >
>> >  /*
>> > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c
>> > index be9a92b..7934173 100644
>> > --- a/opensm/opensm/osm_req.c
>> > +++ b/opensm/opensm/osm_req.c
>> > @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm,
>> >               ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id),
>> >               cl_ntoh32(attr_mod), cl_ntoh64(tid));
>> >
>> > -     ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
>> > -                     IB_MAD_METHOD_GET,
>> > -                     tid,
>> > -                     attr_id,
>> > -                     attr_mod,
>> > -                     p_path->hop_count,
>> > -                     sm->p_subn->opt.m_key,
>> > -                     p_path->path, IB_LID_PERMISSIVE,
>> IB_LID_PERMISSIVE);
>> > +     if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw),
>> > +                          IB_MAD_METHOD_GET,
>> > +                          tid,
>> > +                          attr_id,
>> > +                          attr_mod,
>> > +                          p_path->hop_count,
>> > +                          sm->p_subn->opt.m_key,
>> > +                          p_path->path,
>> > +                          IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) {
>> > +             OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: "
>> > +                     "ib_smp_init_new failed: hop count %d\n",
>> > +                     p_path->hop_count);
>>
>> This is assumption on how ib_smp_init_new() is actually implemented -
>> not perfect.
>>
>> Sasha
>>  _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090802/358710bc/attachment.html>

From hal.rosenstock at gmail.com  Sun Aug  2 04:26:37 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 07:26:37 -0400
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <4A742E94.2070002@gmail.com>
References: <4A6DDFCE.9060009@voltaire.com> <4A703DA4.9080300@Voltaire.COM>
	<f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>
	<4A705B3A.7060404@Voltaire.COM>
	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>
	<4A731818.3060500@voltaire.com>
	<f0e08f230907311050wa750cf2n497039acafdab3b4@mail.gmail.com>
	<4A733D24.3040201@voltaire.com>
	<f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>
	<4A742E94.2070002@gmail.com>
Message-ID: <f0e08f230908020426i2331cf0fg3bc3a21f1e86d1b5@mail.gmail.com>

On Sat, Aug 1, 2009 at 8:01 AM, Yossi Etigin <yossi.openib at gmail.com> wrote:
>
> Hal Rosenstock wrote:
> >
> >     Yes, but AFAIK the only "bad" case is if the LID stays the same but
> >     LMC changes to a lower
> >     value. In this case the path refresh will not happen when it is
> >     supposed to.
> >
> >
> > What's the impact of that ?
> >
> > Also the LID can change at the same time as the LMC.
> >
> > I can't tell if all the possible cases are handled properly. Are they ?
> >
>
> Let's see:
> Only LID changes - handled correctly.
> LMC (and possibly LID) change - either we "catch" this, or we don't.
> If we do, the path and LMC will be refreshed so we will not keep refreshing
> the path forever (like it could have been if we didn't refresh the LMC).
> If we don't - ipoib packets will not reach the neighbour, which is the same
> situation there is today.

By handled correctly, you mean that the ARP request gets to the remote
node, is responded to, and the response makes it back and that is
treated as valid path indication, right ?

If so, is the original ARP request unicast or broadcast ?

If the request is unicast, couldn't it be sent using the wrong static
rate as isn't it using the original path parameters ?

Even if it is broadcast, if the original path parameters are still
used (like rate, etc.) at the local node, doesn't this assume a
homogeneous subnet ?


From sashak at voltaire.com  Sun Aug  2 04:49:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 14:49:32 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to
	return success/failure status
In-Reply-To: <f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me>
	<f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
Message-ID: <20090802114932.GK5287@me>

On 06:59 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Some out of tree user.

They should care to pass a valid data - ib_smp_init_new() is a simple
helper.

Sasha


From hnrose at comcast.net  Sun Aug  2 04:50:11 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 07:50:11 -0400
Subject: [ofa-general] [PATCHv4] opensm/osm_lash: Fix use after free problem
	in osm_mesh_node_delete
Message-ID: <20090802115011.GA9345@comcast.net>


When osm_mesh_node_delete is called, osm_switch_delete may already have
been called so sw->p_sw is no longer valid to be used although it was
being used to obtain num_ports.

Fix this by performing delete_mesh_switches at the end of lash_process.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v3:
Changed name of delete_switches to delete_mesh_switches

Changes since v2:
Moved mesh switches deletion into lash

Changes since v1:
Rather than saving num_ports in the mesh node structure on creation and using
this on deletion, mesh switches deletion should occur at end of the lash
calculation as none of this state is needed after that 
Approach proposed by Sasha

diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 1c55a90..841c0fd 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -5,6 +5,7 @@
  * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
  * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved.
+ * Copyright (c) 2009      HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw)
 	free(sw);
 }
 
+static void delete_mesh_switches(lash_t *p_lash)
+{
+	if (p_lash->switches) {
+		unsigned id;
+		for (id = 0; ((int)id) < p_lash->num_switches; id++)
+			if (p_lash->switches[id])
+				osm_mesh_node_delete(p_lash,
+						     p_lash->switches[id]);
+	}
+}
+
+
 static void free_lash_structures(lash_t * p_lash)
 {
 	unsigned int i, j, k;
@@ -1219,7 +1232,7 @@ static int lash_process(void *context)
 
 	return_status = discover_network_properties(p_lash);
 	if (return_status != IB_SUCCESS)
-		goto Exit;
+		goto Exit2;
 
 	return_status = init_lash_structures(p_lash);
 	if (return_status != IB_SUCCESS)
@@ -1234,6 +1247,9 @@ static int lash_process(void *context)
 	populate_fwd_tbls(p_lash);
 
 Exit:
+	delete_mesh_switches(p_lash);
+
+Exit2:
 	if (p_lash->vl_min)
 		free_lash_structures(p_lash);
 	OSM_LOG_EXIT(p_log);


From hal.rosenstock at gmail.com  Sun Aug  2 04:54:01 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 2 Aug 2009 07:54:01 -0400
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to 
	return success/failure status
In-Reply-To: <20090802114932.GK5287@me>
References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me>
	<f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
	<20090802114932.GK5287@me>
Message-ID: <f0e08f230908020454h5aecbc17m561728d35fc9a62@mail.gmail.com>

On Sun, Aug 2, 2009 at 7:49 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:59 Sun 02 Aug     , Hal Rosenstock wrote:
> >
> > Some out of tree user.
>
> They should care to pass a valid data - ib_smp_init_new() is a simple
> helper.


opensm too ?

Why replicate this simple check all over the place ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090802/3788a05d/attachment.html>

From sashak at voltaire.com  Sun Aug  2 04:57:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 14:57:35 +0300
Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after
	free problem in osm_mesh_node_delete
In-Reply-To: <f0e08f230908020417n6e7688eayd39e5731c6231ba3@mail.gmail.com>
References: <20090802105331.GA26002@comcast.net> <20090802111601.GI5287@me>
	<f0e08f230908020417n6e7688eayd39e5731c6231ba3@mail.gmail.com>
Message-ID: <20090802115735.GL5287@me>

On 07:17 Sun 02 Aug     , Hal Rosenstock wrote:
> >
> > > +{
> > > +     if (p_lash->switches) {
> > > +             unsigned id;
> > > +             for (id = 0; ((int)id) < p_lash->num_switches; id++)
> > > +                     if (p_lash->switches[id])
> > > +                             osm_mesh_node_delete(p_lash,
> > > +                                                  p_lash->switches[id]);
> > > +     }
> > > +}
> > > +
> > > +
> > >  static void free_lash_structures(lash_t * p_lash)
> > >  {
> > >       unsigned int i, j, k;
> > > @@ -1219,7 +1232,7 @@ static int lash_process(void *context)
> > >
> > >       return_status = discover_network_properties(p_lash);
> >
> > discover_network_properties() can fail in a middle of allocations and
> > full clean is desired anyway. It should be safe to 'goto Exit' below
> > since mesh node deletion is protected against not yet initialized input.
> 
> 
> It's not;

Could you elaborate?

Sasha


From hnrose at comcast.net  Sun Aug  2 05:40:52 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 08:40:52 -0400
Subject: [ofa-general] [PATCHv5] opensm/osm_lash: Fix use after free problem
	in osm_mesh_node_delete
Message-ID: <20090802124052.GA18247@comcast.net>


When osm_mesh_node_delete is called, osm_switch_delete may already have
been called so sw->p_sw is no longer valid to be used although it was
being used to obtain num_ports.

Fix this by performing delete_mesh_switches in free_lash_structures.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v4:
Moved call of delete_mesh_switches into free_lash_structures

Changes since v3:
Changed name of delete_switches to delete_mesh_switches

Changes since v2:
Moved mesh switches deletion into lash

Changes since v1:
Rather than saving num_ports in the mesh node structure on creation and using
this on deletion, mesh switches deletion should occur at end of the lash
calculation as none of this state is needed after that 
Approach proposed by Sasha

diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 1c55a90..a62cb3d 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -5,6 +5,7 @@
  * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
  * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved.
+ * Copyright (c) 2009      HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw)
 	free(sw);
 }
 
+static void delete_mesh_switches(lash_t *p_lash)
+{
+	if (p_lash->switches) {
+		unsigned id;
+		for (id = 0; ((int)id) < p_lash->num_switches; id++)
+			if (p_lash->switches[id])
+				osm_mesh_node_delete(p_lash,
+						     p_lash->switches[id]);
+	}
+}
+
+
 static void free_lash_structures(lash_t * p_lash)
 {
 	unsigned int i, j, k;
@@ -667,6 +680,8 @@ static void free_lash_structures(lash_t * p_lash)
 
 	OSM_LOG_ENTER(p_log);
 
+	delete_mesh_switches(p_lash);
+
 	/* free cdg_vertex_matrix */
 	for (i = 0; i < p_lash->vl_min; i++) {
 		for (j = 0; j < num_switches; j++) {


From sashak at voltaire.com  Sun Aug  2 06:16:16 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 16:16:16 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to
	return success/failure status
In-Reply-To: <f0e08f230908020454h5aecbc17m561728d35fc9a62@mail.gmail.com>
References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me>
	<f0e08f230908020359h2987a0c4h2755a701b7dcc19d@mail.gmail.com>
	<20090802114932.GK5287@me>
	<f0e08f230908020454h5aecbc17m561728d35fc9a62@mail.gmail.com>
Message-ID: <20090802131616.GM5287@me>

On 07:54 Sun 02 Aug     , Hal Rosenstock wrote:
> >
> > They should care to pass a valid data - ib_smp_init_new() is a simple
> > helper.
> 
> 
> opensm too ?

Ok, path overflow is theoretically possible only when path is extended
(by itself wrong extension will overflow osm_dr_path_t path buffer). So
it should be pretty enough to check for overflow only in three places:

requery_dup_node_info() in osm_node_info_rcv.c
pi_rcv_process_switch_port() in osm_port_info_rcv.c
state_mgr_get_remote_port_info() in osm_state_mgr.c

> Why replicate this simple check all over the place ?

And if you wish to make a single point check then I guess that function
osm_dr_path_extend() is the place (and this is called less frequently
than ib_smp_init_new()).

Sasha


From hnrose at comcast.net  Sun Aug  2 08:03:18 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 11:03:18 -0400
Subject: [ofa-general] [PATCH] opensm: osm_dr_path_extend can fail due to
	invalid hop count
Message-ID: <20090802150318.GA20037@comcast.net>


Change routine to return success/failure status rather than
depend on debug assert
Also, fix callers of this routine to handle this return status

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h
index 8d65d2c..7ef0fc5 100644
--- a/opensm/include/opensm/osm_path.h
+++ b/opensm/include/opensm/osm_path.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path,
 *
 * SYNOPSIS
 */
-static inline void
+static inline boolean_t
 osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num)
 {
 	p_path->hop_count++;
-	CL_ASSERT(p_path->hop_count < IB_SUBNET_PATH_HOPS_MAX);
+
+	if (p_path->hop_count >= IB_SUBNET_PATH_HOPS_MAX)
+		return FALSE;
 	/*
 	   Location 0 in the path array is reserved per IB spec.
 	 */
 	p_path->path[p_path->hop_count] = port_num;
+	return TRUE;
 }
 
 /*
@@ -208,7 +212,7 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num)
 *		[in] Additional port to add to the DR path.
 *
 * RETURN VALUE
-*	None.
+*	Boolean indicating whether or not path was extended.
 *
 * NOTES
 *
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index bfa5b1f..f5ef1ac 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -85,7 +86,10 @@ static void report_duplicated_guid(IN osm_sm_t * sm, osm_physp_t * p_physp,
 			 OSM_LOG_ERROR);
 
 	path = *osm_physp_get_dr_path_ptr(p_new);
-	osm_dr_path_extend(&path, port_num);
+	if (!osm_dr_path_extend(&path, port_num))
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D05: "
+			"DR path with hop count %d couldn't be extended\n",
+			path.hop_count);
 	osm_dump_dr_path(sm->p_log, &path, OSM_LOG_ERROR);
 
 	osm_log(sm->p_log, OSM_LOG_SYS,
@@ -100,7 +104,12 @@ static void requery_dup_node_info(IN osm_sm_t * sm, osm_physp_t * p_physp,
 	cl_status_t status;
 
 	path = *osm_physp_get_dr_path_ptr(p_physp->p_remote_physp);
-	osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num);
+	if (!osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D08: "
+			"DR path with hop count %d couldn't be extended\n",
+			path.hop_count);
+		return;
+	}
 
 	context.ni_context.node_guid =
 	    p_physp->p_remote_physp->p_node->node_info.port_guid;
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 7b6fb1a..a451de7 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -246,9 +247,15 @@ static void pi_rcv_process_switch_port(IN osm_sm_t * sm, IN osm_node_t * p_node,
 			    osm_physp_get_port_num(p_physp)) {
 				path = *osm_physp_get_dr_path_ptr(p_physp);
 
-				osm_dr_path_extend(&path,
-						   osm_physp_get_port_num
-						   (p_physp));
+				if (!osm_dr_path_extend(&path,
+							osm_physp_get_port_num
+							(p_physp))) {
+					OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+						"ERR 0F08: "
+						"DR path with hop count %d couldn't be extended\n",
+						path.hop_count);
+					break;
+				}
 
 				memset(&context, 0, sizeof(context));
 				context.ni_context.node_guid =
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index adc39a0..44b0f6c 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -166,7 +167,13 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm,
 	/* generate a dr path leaving on the physp to the remote node */
 	p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
 	memcpy(&rem_node_dr_path, p_dr_path, sizeof(osm_dr_path_t));
-	osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp));
+	if (!osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp))) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332D: "
+			"DR path with hop count %d couldn't be extended "
+			"so skipping PortInfo query\n",
+			p_dr_path->hop_count);
+		goto Exit;
+	}
 
 	memset(&mad_context, 0, sizeof(mad_context));
 
@@ -187,6 +194,7 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm,
 		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332E: "
 			"Request for PortInfo failed\n");
 
+Exit:
 	OSM_LOG_EXIT(sm->p_log);
 }
 

From sashak at voltaire.com  Sun Aug  2 08:07:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 18:07:12 +0300
Subject: [ofa-general] Re: [PATCHv5] opensm/osm_lash: Fix use after free
 problem in osm_mesh_node_delete
In-Reply-To: <20090802124052.GA18247@comcast.net>
References: <20090802124052.GA18247@comcast.net>
Message-ID: <20090802150712.GP5287@me>

On 08:40 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> When osm_mesh_node_delete is called, osm_switch_delete may already have
> been called so sw->p_sw is no longer valid to be used although it was
> being used to obtain num_ports.
> 
> Fix this by performing delete_mesh_switches in free_lash_structures.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hnrose at comcast.net  Sun Aug  2 08:16:12 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 11:16:12 -0400
Subject: [ofa-general] [PATCH] opensm/ib_types.h: Update ib_mad_is_response
	description
Message-ID: <20090802151612.GA27074@comcast.net>


Also, fix typo in ib_smp_init_new TODO

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index beb7492..fe3f051 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -3779,7 +3779,8 @@ ib_mad_init_response(IN const ib_mad_t * const p_req_mad,
 *	ib_mad_is_response
 *
 * DESCRIPTION
-*	Returns TRUE if the MAD is a response ('R' bit set),
+*	Returns TRUE if the MAD is a response ('R' bit set)
+*	or if the MAD is a TRAP REPRESS,
 *	FALSE otherwise.
 *
 * SYNOPSIS
@@ -4091,7 +4092,7 @@ static inline boolean_t OSM_API ib_smp_is_d(IN const ib_smp_t * const p_smp)
 *
 * TODO
 *	This is too big for inlining, but leave it here for now
-*	since there is not yet another convient spot.
+*	since there is not yet another convenient spot.
 *
 * SYNOPSIS
 */


From sashak at voltaire.com  Sun Aug  2 08:20:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 18:20:06 +0300
Subject: [ofa-general] Re: [PATCH] opensm: osm_dr_path_extend can fail due to
 invalid hop count
In-Reply-To: <20090802150318.GA20037@comcast.net>
References: <20090802150318.GA20037@comcast.net>
Message-ID: <20090802152006.GR5287@me>

On 11:03 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Change routine to return success/failure status rather than
> depend on debug assert
> Also, fix callers of this routine to handle this return status
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h
> index 8d65d2c..7ef0fc5 100644
> --- a/opensm/include/opensm/osm_path.h
> +++ b/opensm/include/opensm/osm_path.h
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path,
>  *
>  * SYNOPSIS
>  */
> -static inline void
> +static inline boolean_t

But why boolean? It is not logical operation, what is wrong with just int?

Sasha


From sashak at voltaire.com  Sun Aug  2 08:21:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 18:21:26 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Update
 ib_mad_is_response description
In-Reply-To: <20090802151612.GA27074@comcast.net>
References: <20090802151612.GA27074@comcast.net>
Message-ID: <20090802152126.GS5287@me>

On 11:16 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Also, fix typo in ib_smp_init_new TODO
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hnrose at comcast.net  Sun Aug  2 08:22:04 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 11:22:04 -0400
Subject: [ofa-general] [PATCH] opensm/osm_path.h: In osm_dr_path_init,
	only copy needed part of path
Message-ID: <20090802152204.GA27199@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com> 
---
diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h
index 8d65d2c..7ef0fc5 100644
--- a/opensm/include/opensm/osm_path.h
+++ b/opensm/include/opensm/osm_path.h
@@ -155,7 +156,7 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path,
 	CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX);
 	p_path->h_bind = h_bind;
 	p_path->hop_count = hop_count;
-	memcpy(p_path->path, path, IB_SUBNET_PATH_HOPS_MAX);
+	memcpy(p_path->path, path, hop_count + 1);
 }
 
 /*


From hnrose at comcast.net  Sun Aug  2 08:31:33 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 11:31:33 -0400
Subject: [ofa-general] [PATCHv2] opensm: osm_dr_path_extend can fail due to
	invalid hop count
Message-ID: <20090802153133.GA29647@comcast.net>


Change routine to return success/failure status rather than
depend on debug assert
Also, fix callers of this routine to handle this return status

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Make osm_dr_path_extend return int rather than boolean

diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h
index 8d65d2c..d02576b 100644
--- a/opensm/include/opensm/osm_path.h
+++ b/opensm/include/opensm/osm_path.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path,
 *
 * SYNOPSIS
 */
-static inline void
+static inline int
 osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num)
 {
 	p_path->hop_count++;
-	CL_ASSERT(p_path->hop_count < IB_SUBNET_PATH_HOPS_MAX);
+
+	if (p_path->hop_count >= IB_SUBNET_PATH_HOPS_MAX)
+		return -1;
 	/*
 	   Location 0 in the path array is reserved per IB spec.
 	 */
 	p_path->path[p_path->hop_count] = port_num;
+	return 0;
 }
 
 /*
@@ -208,7 +212,7 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num)
 *		[in] Additional port to add to the DR path.
 *
 * RETURN VALUE
-*	None.
+*	Boolean indicating whether or not path was extended.
 *
 * NOTES
 *
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index bfa5b1f..c454d02 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -85,7 +86,10 @@ static void report_duplicated_guid(IN osm_sm_t * sm, osm_physp_t * p_physp,
 			 OSM_LOG_ERROR);
 
 	path = *osm_physp_get_dr_path_ptr(p_new);
-	osm_dr_path_extend(&path, port_num);
+	if (osm_dr_path_extend(&path, port_num))
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D05: "
+			"DR path with hop count %d couldn't be extended\n",
+			path.hop_count);
 	osm_dump_dr_path(sm->p_log, &path, OSM_LOG_ERROR);
 
 	osm_log(sm->p_log, OSM_LOG_SYS,
@@ -100,7 +104,12 @@ static void requery_dup_node_info(IN osm_sm_t * sm, osm_physp_t * p_physp,
 	cl_status_t status;
 
 	path = *osm_physp_get_dr_path_ptr(p_physp->p_remote_physp);
-	osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num);
+	if (osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D08: "
+			"DR path with hop count %d couldn't be extended\n",
+			path.hop_count);
+		return;
+	}
 
 	context.ni_context.node_guid =
 	    p_physp->p_remote_physp->p_node->node_info.port_guid;
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 7b6fb1a..57cc494 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -246,9 +247,15 @@ static void pi_rcv_process_switch_port(IN osm_sm_t * sm, IN osm_node_t * p_node,
 			    osm_physp_get_port_num(p_physp)) {
 				path = *osm_physp_get_dr_path_ptr(p_physp);
 
-				osm_dr_path_extend(&path,
-						   osm_physp_get_port_num
-						   (p_physp));
+				if (osm_dr_path_extend(&path,
+						       osm_physp_get_port_num
+						       (p_physp))) {
+					OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+						"ERR 0F08: "
+						"DR path with hop count %d couldn't be extended\n",
+						path.hop_count);
+					break;
+				}
 
 				memset(&context, 0, sizeof(context));
 				context.ni_context.node_guid =
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index adc39a0..90bef87 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -166,7 +167,13 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm,
 	/* generate a dr path leaving on the physp to the remote node */
 	p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
 	memcpy(&rem_node_dr_path, p_dr_path, sizeof(osm_dr_path_t));
-	osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp));
+	if (osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp))) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332D: "
+			"DR path with hop count %d couldn't be extended "
+			"so skipping PortInfo query\n",
+			p_dr_path->hop_count);
+		goto Exit;
+	}
 
 	memset(&mad_context, 0, sizeof(mad_context));
 
@@ -187,6 +194,7 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm,
 		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332E: "
 			"Request for PortInfo failed\n");
 
+Exit:
 	OSM_LOG_EXIT(sm->p_log);
 }
 

From sashak at voltaire.com  Sun Aug  2 08:51:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 18:51:30 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: osm_dr_path_extend can fail due
 to invalid hop count
In-Reply-To: <20090802153133.GA29647@comcast.net>
References: <20090802153133.GA29647@comcast.net>
Message-ID: <20090802155130.GT5287@me>

On 11:31 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Change routine to return success/failure status rather than
> depend on debug assert
> Also, fix callers of this routine to handle this return status
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 08:54:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 18:54:10 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_path.h: In osm_dr_path_init,
 only copy needed part of path
In-Reply-To: <20090802152204.GA27199@comcast.net>
References: <20090802152204.GA27199@comcast.net>
Message-ID: <20090802155410.GU5287@me>

On 11:22 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com> 

Applied. Thanks.

Sasha


From bugzilla-daemon at bugzilla.kernel.org  Sun Aug  2 10:57:42 2009
From: bugzilla-daemon at bugzilla.kernel.org (bugzilla-daemon at bugzilla.kernel.org)
Date: Sun, 2 Aug 2009 17:57:42 GMT
Subject: [ofa-general] [Bug 13893] New: NULL pointer dereference by SRP
 initiator after
 restarting SRP target followed by SCSI reset of initiator
Message-ID: <bug-13893-11804@http.bugzilla.kernel.org/>

http://bugzilla.kernel.org/show_bug.cgi?id=13893

           Summary: NULL pointer dereference by SRP initiator after
                    restarting SRP target followed by SCSI reset of
                    initiator
           Product: Drivers
           Version: 2.5
    Kernel Version: 2.6.30.3
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
        AssignedTo: drivers_infiniband-rdma at kernel-bugs.osdl.org
        ReportedBy: bart.vanassche at gmail.com
        Regression: No


Setup of the target system:
- SCST revision 1000.
- Contents of /etc/scst.conf on the target:
[HANDLER vdisk]
DEVICE disk01,/dev/exported-block,NV_CACHE,512           
[HANDLER vcdrom]
[GROUP Default]
[ASSIGNMENT Default]
DEVICE disk01,0
[TARGETS enable]
[TARGETS disable]
- After having installed SCST, start it as follows:
dd if=/dev/zero of=/dev/exported-block bs=1M count=1000
/etc/init.d/scst restart

Setup of the initiator system:
- Vanilla 2.6.30.3 kernel.
- Once the target has been set up, import the SRP target as follows:
rmmod ib_srp; modprobe ib_srp; ibsrpdm -c | while readtarget_info; do echo
"${target_info}"; echo "${target_info}" >
/sys/class/infiniband_srp/srp-mlx4_0-1/add_target; done

How to reproduce the NULL pointer dereference:
- Run the following command on the target:
/etc/init.d/scst restart
- Run the following command on the initiator:
sg_reset -d /dev/sdb

Result:
scsi host7: SRP reset_device called                  
BUG: unable to handle kernel NULL pointer dereference at 0000000000000074
IP: [<ffffffffa03f2db2>] srp_send_tsk_mgmt+0xb4/0x130 [ib_srp]           
PGD 51e7067 PUD 48543067 PMD 0                                           
Oops: 0000 [1] SMP                                                       
last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
CPU 0                                                                    
Modules linked in: ib_srp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack
iptable_filter ip_tables x_tables vboxnetflt(N) vboxdrv(N) snd_pcm_oss
snd_mixer_oss binfmt_misc snd_seq snd_seq_device rdma_ucm scsi_transport_srp
scsi_tgt ib_ipoib ib_uverbs ib_umad ib_iser rdma_cm ib_cm iw_cm mlx4_ib ib_sa
ipv6 ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi af_packet
cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq fuse loop
dm_mod coretemp(N) snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep
ohci1394 i2c_i801 snd rtc_cmos mlx4_core sr_mod serio_raw pcspkr ieee1394
i2c_core intel_agp pata_marvell rtc_core skge soundcore button rtc_lib sky2
cdrom sg floppy uhci_hcd ehci_hcd sd_mod crc_t10dif usbcore edd ext3 mbcache
jbd fan ide_pci_generic ide_core ata_generic ata_piix thermal processor
thermal_sys hwmon pata_jmicron ahci libata scsi_mod dock [last unloaded:
ib_srp]                                             
Supported: No                                                                   
Pid: 17736, comm: sg_reset Tainted: G          2.6.27.25-0.1-default #1         
RIP: 0010:[<ffffffffa03f2db2>]  [<ffffffffa03f2db2>]
srp_send_tsk_mgmt+0xb4/0x130 [ib_srp]                                           
RSP: 0018:ffff88005e4ddbc8  EFLAGS: 00010046                                    
RAX: 0000000000000000 RBX: ffff8800623d8620 RCX: 0000000000000000               
RDX: ffff8800778d2000 RSI: ffff88006f088d80 RDI: ffff8800623d8620               
RBP: ffff8800623d8b40 R08: ffffffff806e2c70 R09: 0000000100000000               
R10: 0000000000000046 R11: 0000000000000000 R12: ffff88006f088d80               
R13: 0000000000000008 R14: ffff8800623d8000 R15: ffff88007e7d3c00               
FS:  00007f3cab09f6f0(0000) GS:ffffffff80a43080(0000) knlGS:0000000000000000    
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                               
CR2: 0000000000000074 CR3: 00000000069b6000 CR4: 00000000000006e0               
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000               
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400               
Process sg_reset (pid: 17736, threadinfo ffff88005e4dc000, task
ffff8800095ca0c0)                                                               
Stack:  ffff8800623d82a8 0000000000000000 ffff8800623d8620 ffff8800623d8000     
 ffff8800381fd380 ffffffffa03f2ea5 ffff88005e4ddc38 ffff8800381fd380            
 ffff8800623d8000 0000000000000000 00007fff39b51144 ffffffffa0008351            
Call Trace:
 [<ffffffffa03f2ea5>] srp_reset_device+0x77/0x101 [ib_srp]
 [<ffffffffa0008351>] scsi_reset_provider+0xc8/0x18d [scsi_mod]
 [<ffffffffa00069d8>] scsi_nonblockable_ioctl+0x90/0xb5 [scsi_mod]
 [<ffffffffa012a869>] sd_ioctl+0x61/0xc6 [sd_mod]
 [<ffffffff8033ec81>] blkdev_driver_ioctl+0x5d/0x72
 [<ffffffff8033f4ee>] blkdev_ioctl+0x1f5/0x217
 [<ffffffff802d71aa>] block_ioctl+0x1b/0x20
 [<ffffffff802bd275>] vfs_ioctl+0x21/0x6c
 [<ffffffff802bd4e2>] do_vfs_ioctl+0x222/0x231
 [<ffffffff802bd542>] sys_ioctl+0x51/0x73
 [<ffffffff8020bfbb>] system_call_fastpath+0x16/0x1b
 [<00007f3caac19b77>] 0x7f3caac19b77


Code: 00 4d 85 e4 0f 84 85 00 00 00 49 8b 54 24 08 31 c0 b9 0c 00 00 00 4c 89
e6 48 89 d7 f3 ab c6 02 01 48 89 df 48 8b 45 10 48 8b 00 <8b> 40 74 48 c1 e0 30
48 0f c8 48 89 42 14 8b 45 50 44 88 6a 1e
RIP  [<ffffffffa03f2db2>] srp_send_tsk_mgmt+0xb4/0x130 [ib_srp]
 RSP <ffff88005e4ddbc8>
CR2: 0000000000000074
---[ end trace 4cec2e39421a0374 ]---

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From hnrose at comcast.net  Sun Aug  2 11:48:05 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 14:48:05 -0400
Subject: [ofa-general] [PATCH] opensm/osm_path.h: Fix osm_dr_path_extend
	return values comment
Message-ID: <20090802184805.GB15622@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h
index d02576b..da55aa8 100644
--- a/opensm/include/opensm/osm_path.h
+++ b/opensm/include/opensm/osm_path.h
@@ -211,8 +211,9 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num)
 *	port_num
 *		[in] Additional port to add to the DR path.
 *
-* RETURN VALUE
-*	Boolean indicating whether or not path was extended.
+* RETURN VALUES
+*	0 indicates path was extended.
+*	Other than 0 indicates path was not extended.
 *
 * NOTES
 *


From hnrose at comcast.net  Sun Aug  2 11:47:16 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 14:47:16 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Remove
	osm_mesh_node_delete call from switch_delete
Message-ID: <20090802184716.GA15622@comcast.net>


osm_mesh_node_delete now called from free_lash_structures
Mistakenly omitted from commit 46e56687e629cbd21cbca453bb088c90c20a38aa

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index a62cb3d..2715fe7 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -651,8 +651,6 @@ static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw
 
 static void switch_delete(lash_t *p_lash, switch_t * sw)
 {
-	osm_mesh_node_delete(p_lash, sw);
-
 	if (sw->dij_channels)
 		free(sw->dij_channels);
 	if (sw->p_sw)
@@ -671,7 +669,6 @@ static void delete_mesh_switches(lash_t *p_lash)
 	}
 }
 
-
 static void free_lash_structures(lash_t * p_lash)
 {
 	unsigned int i, j, k;


From hnrose at comcast.net  Sun Aug  2 11:48:49 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 14:48:49 -0400
Subject: [ofa-general] [PATCH] opensm/osm_helper.h: Fix some commentary typos
Message-ID: <20090802184849.GC15622@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_helper.h b/opensm/include/opensm/osm_helper.h
index 91c9f84..d76af8d 100644
--- a/opensm/include/opensm/osm_helper.h
+++ b/opensm/include/opensm/osm_helper.h
@@ -470,7 +470,7 @@ const char *osm_get_disp_msg_str(IN cl_disp_msgid_t msg);
 *		[in] Dispatcher message ID value.
 *
 * RETURN VALUES
-*	Pointer to the message discription string.
+*	Pointer to the message description string.
 *
 * NOTES
 *
@@ -509,7 +509,7 @@ const char *osm_get_sm_signal_str(IN osm_signal_t signal);
 *		[in] Signal value
 *
 * RETURN VALUES
-*	Pointer to the signal discription string.
+*	Pointer to the signal description string.
 *
 * NOTES
 *
@@ -548,7 +548,7 @@ const char *osm_get_sm_mgr_signal_str(IN osm_sm_signal_t signal);
 *		[in] SM manager signal
 *
 * RETURN VALUES
-*	Pointer to the signal discription string.
+*	Pointer to the signal description string.
 *
 * NOTES
 *
@@ -571,7 +571,7 @@ const char *osm_get_sm_mgr_state_str(IN uint16_t state);
 *		[in] SM manager state
 *
 * RETURN VALUES
-*	Pointer to the state discription string.
+*	Pointer to the state description string.
 *
 * NOTES
 *


From sashak at voltaire.com  Sun Aug  2 12:18:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 22:18:53 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Remove
 osm_mesh_node_delete call from switch_delete
In-Reply-To: <20090802184716.GA15622@comcast.net>
References: <20090802184716.GA15622@comcast.net>
Message-ID: <20090802191853.GV5287@me>

On 14:47 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> osm_mesh_node_delete now called from free_lash_structures
> Mistakenly omitted from commit 46e56687e629cbd21cbca453bb088c90c20a38aa
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 12:19:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 22:19:27 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_path.h: Fix osm_dr_path_extend
 return values comment
In-Reply-To: <20090802184805.GB15622@comcast.net>
References: <20090802184805.GB15622@comcast.net>
Message-ID: <20090802191927.GW5287@me>

On 14:48 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug  2 12:20:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Aug 2009 22:20:05 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.h: Fix some commentary
	typos
In-Reply-To: <20090802184849.GC15622@comcast.net>
References: <20090802184849.GC15622@comcast.net>
Message-ID: <20090802192005.GX5287@me>

On 14:48 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From todd.rimmer at qlogic.com  Sun Aug  2 14:45:21 2009
From: todd.rimmer at qlogic.com (Todd Rimmer)
Date: Sun, 2 Aug 2009 16:45:21 -0500
Subject: [ofa-general] umad SLID and LMC
In-Reply-To: <F4251187-C5FA-42E8-A40A-F3C7B32E09EB@redhat.com>
References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com>
	<200907231121.00140.jackm@dev.mellanox.co.il>	<adaocrb43su.fsf@cisco.com>
	<F4251187-C5FA-42E8-A40A-F3C7B32E09EB@redhat.com>
Message-ID: <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org>

What is the proper way to control the SLID used for outgoing umad sends?

For example, when using LMC>0, the PathRecord returned from the SM for talking to a given remove node may have a SLID which is not the BaseLid for the sender.  How does the sender ensure the correct SLID is used for the outgoing mad?

In reviewing the API it seems like the only way to do this is:
void *umad = umad_alloc(...);

// call various umad calls to initialize address and contents
umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID;

umad_send(..., umad, ...);

Was path_bits an intentional omission in the API?  It would seem that a function which could update the ib_mad_addr in a umad given a path record would seem appropriate.

Todd Rimmer
Chief Architect 
QLogic Network Systems Group
Voice: 610-233-4852     Fax: 610-233-4777
Todd.Rimmer at QLogic.com  www.QLogic.com


From arlin.r.davis at intel.com  Sun Aug  2 16:26:52 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Sun, 2 Aug 2009 16:26:52 -0700
Subject: [ofa-general] [PATCH] uDAPL v2: CNO support broken in both CMA and
	SCM providers.
Message-ID: <7939FE16A13F4A7EA126666873A88141@amr.corp.intel.com>


CQ thread/callback mechanism was removed by mistake. Still
need indirect DTO callbacks when CNO is attached to EVD's.

Add CQ event channel to cma provider's thread and add
to select for rdma_cm and async channels.

For scm provider there is no easy way to add this channel
to the select across sockets on windows. So, for portablity
reasons a 2nd thread is started to process the ASYNC and
CQ channels for events.

Must also disable EVD (evd_endabled=FALSE) during destroy
to prevent EVD events firing for CNOs and re-arming CQ while
CQ is being destroyed.

Slight modification to dtest to check EVD after CNO timeout.

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 dapl/common/dapl_evd_util.c         |    1 +
 dapl/openib_cma/dapl_ib_util.h      |    5 +-
 dapl/openib_cma/device.c            |  154 ++++---------
 dapl/openib_common/cq.c             |  192 +++++----------
 dapl/openib_common/dapl_ib_common.h |    2 +
 dapl/openib_common/util.c           |   98 ++++++++
 dapl/openib_scm/dapl_ib_util.h      |    5 +
 dapl/openib_scm/device.c            |  458 ++++++++++++++++++++++++++++++-----
 test/dtest/dtest.c                  |   54 +++--
 9 files changed, 649 insertions(+), 320 deletions(-)

diff --git a/dapl/common/dapl_evd_util.c b/dapl/common/dapl_evd_util.c
index 88c3f8f..02909e9 100644
--- a/dapl/common/dapl_evd_util.c
+++ b/dapl/common/dapl_evd_util.c
@@ -469,6 +469,7 @@ DAT_RETURN dapls_evd_dealloc(IN DAPL_EVD * evd_ptr)
 	 * Destroy the CQ first, to keep any more callbacks from coming
 	 * up from it.
 	 */
+	evd_ptr->evd_enabled = DAT_FALSE;
 	if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) {
 		ia_ptr = evd_ptr->header.owner_ia;
 
diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h
index f466c06..c9ab4d6 100755
--- a/dapl/openib_cma/dapl_ib_util.h
+++ b/dapl/openib_cma/dapl_ib_util.h
@@ -84,7 +84,6 @@ typedef struct _ib_hca_transport
 { 
 	struct dapl_llist_entry	entry;
 	int			destroy;
-	struct dapl_hca		*d_hca;
 	struct rdma_cm_id 	*cm_id;
 	struct ibv_comp_channel *ib_cq;
 	ib_cq_handle_t		ib_cq_empty;
@@ -99,6 +98,7 @@ typedef struct _ib_hca_transport
 	/* device attributes */
 	int			rd_atom_in;
 	int			rd_atom_out;
+	struct	ibv_context	*ib_ctx;
 	struct	ibv_device	*ib_dev;
 	/* dapls_modify_qp_state */
 	uint16_t		lid;
@@ -119,7 +119,8 @@ void dapli_thread(void *arg);
 DAT_RETURN  dapli_ib_thread_init(void);
 void dapli_ib_thread_destroy(void);
 void dapli_cma_event_cb(void);
-void dapli_async_event_cb(struct _ib_hca_transport *hca);
+void dapli_async_event_cb(struct _ib_hca_transport *tp);
+void dapli_cq_event_cb(struct _ib_hca_transport *tp);
 dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep);
 void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep);
 DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle,
diff --git a/dapl/openib_cma/device.c b/dapl/openib_cma/device.c
index 81203bf..743e8fa 100644
--- a/dapl/openib_cma/device.c
+++ b/dapl/openib_cma/device.c
@@ -123,6 +123,12 @@ static int dapls_config_verbs(struct ibv_context *verbs)
 	return 0;
 }
 
+static int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	channel->comp_channel.Milliseconds = 0;
+	return 0;
+}
+
 static int dapls_thread_signal(void)
 {
 	CompManagerCancel(windata.comp_mgr);
@@ -205,6 +211,11 @@ static int dapls_config_verbs(struct ibv_context *verbs)
 	return dapls_config_fd(verbs->async_fd);
 }
 
+static int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	return dapls_config_fd(channel->fd);
+}
+
 static int dapls_thread_signal(void)
 {
 	return write(g_ib_pipe[1], "w", sizeof "w");
@@ -334,10 +345,6 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 		     " open_hca: RDMA channel created (%p)\n", g_cm_events);
 
-	dat_status = dapli_ib_thread_init();
-	if (dat_status != DAT_SUCCESS)
-		return dat_status;
-
 	/* HCA name will be hostname or IP address */
 	if (getipaddr((char *)hca_name,
 		      (char *)&hca_ptr->hca_address, 
@@ -357,6 +364,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " open_hca: rdma_bind ERR %s."
 			 " Is %s configured?\n", strerror(errno), hca_name);
+		rdma_destroy_id(cm_id);
 		return DAT_INVALID_ADDRESS;
 	}
 
@@ -366,6 +374,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 	dapls_config_verbs(cm_id->verbs);
 	hca_ptr->port_num = cm_id->port_num;
 	hca_ptr->ib_trans.ib_dev = cm_id->verbs->device;
+	hca_ptr->ib_trans.ib_ctx = cm_id->verbs;
 	gid = &cm_id->route.addr.addr.ibaddr.sgid;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
@@ -374,6 +383,21 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 		     (unsigned long long)ntohll(gid->global.subnet_prefix),
 		     (unsigned long long)ntohll(gid->global.interface_id));
 
+	/* support for EVD's with CNO's: one channel via thread */
+	hca_ptr->ib_trans.ib_cq =
+	    ibv_create_comp_channel(hca_ptr->ib_hca_handle);
+	if (hca_ptr->ib_trans.ib_cq == NULL) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: ibv_create_comp_channel ERR %s\n",
+			 strerror(errno));
+		rdma_destroy_id(cm_id);
+		return DAT_INTERNAL_ERROR;
+	}
+	if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) {
+		rdma_destroy_id(cm_id);
+		return DAT_INTERNAL_ERROR;
+	}
+
 	/* set inline max with env or default, get local lid and gid 0 */
 	if (hca_ptr->ib_hca_handle->device->transport_type
 	    == IBV_TRANSPORT_IWARP)
@@ -395,14 +419,17 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 	/* set default IB MTU */
 	hca_ptr->ib_trans.mtu = dapl_ib_mtu(2048);
 
+	dat_status = dapli_ib_thread_init();
+	if (dat_status != DAT_SUCCESS)
+		return dat_status;
 	/* 
 	 * Put new hca_transport on list for async and CQ event processing 
 	 * Wakeup work thread to add to polling list
 	 */
-	dapl_llist_init_entry((DAPL_LLIST_ENTRY *) & hca_ptr->ib_trans.entry);
+	dapl_llist_init_entry((DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry);
 	dapl_os_lock(&g_hca_lock);
 	dapl_llist_add_tail(&g_hca_list,
-			    (DAPL_LLIST_ENTRY *) & hca_ptr->ib_trans.entry,
+			    (DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry,
 			    &hca_ptr->ib_trans.entry);
 	if (dapls_thread_signal() == -1)
 		dapl_log(DAPL_DBG_TYPE_UTIL,
@@ -425,7 +452,6 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 		     &hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, 
 		     hca_ptr->ib_trans.max_inline_send);
 
-	hca_ptr->ib_trans.d_hca = hca_ptr;
 	return DAT_SUCCESS;
 }
 
@@ -574,105 +600,6 @@ bail:
 		     " ib_thread_destroy(%d) exit\n", dapl_os_getpid());
 }
 
-void dapli_async_event_cb(struct _ib_hca_transport *hca)
-{
-	struct ibv_async_event event;
-
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " async_event(%p)\n", hca);
-
-	if (hca->destroy)
-		return;
-
-	if (!ibv_get_async_event(hca->cm_id->verbs, &event)) {
-
-		switch (event.event_type) {
-		case IBV_EVENT_CQ_ERR:
-		{
-			struct dapl_ep *evd_ptr =
-				event.element.cq->cq_context;
-
-			dapl_log(DAPL_DBG_TYPE_ERR,
-				 "dapl async_event CQ (%p) ERR %d\n",
-				 evd_ptr, event.event_type);
-
-			/* report up if async callback still setup */
-			if (hca->async_cq_error)
-				hca->async_cq_error(hca->cm_id->verbs,
-							event.element.cq,
-							&event,
-							(void *)evd_ptr);
-			break;
-		}
-		case IBV_EVENT_COMM_EST:
-		{
-			/* Received msgs on connected QP before RTU */
-			dapl_log(DAPL_DBG_TYPE_UTIL,
-				 " async_event COMM_EST(%p) rdata beat RTU\n",
-				 event.element.qp);
-
-			break;
-		}
-		case IBV_EVENT_QP_FATAL:
-		case IBV_EVENT_QP_REQ_ERR:
-		case IBV_EVENT_QP_ACCESS_ERR:
-		case IBV_EVENT_QP_LAST_WQE_REACHED:
-		case IBV_EVENT_SRQ_ERR:
-		case IBV_EVENT_SRQ_LIMIT_REACHED:
-		case IBV_EVENT_SQ_DRAINED:
-		{
-			struct dapl_ep *ep_ptr =
-				event.element.qp->qp_context;
-
-			dapl_log(DAPL_DBG_TYPE_ERR,
-				 "dapl async_event QP (%p) ERR %d\n",
-				 ep_ptr, event.event_type);
-
-			/* report up if async callback still setup */
-			if (hca->async_qp_error)
-				hca->async_qp_error(hca->cm_id->verbs,
-						    ep_ptr->qp_handle,
-						    &event,
-						    (void *)ep_ptr);
-			break;
-		}
-		case IBV_EVENT_PATH_MIG:
-		case IBV_EVENT_PATH_MIG_ERR:
-		case IBV_EVENT_DEVICE_FATAL:
-		case IBV_EVENT_PORT_ACTIVE:
-		case IBV_EVENT_PORT_ERR:
-		case IBV_EVENT_LID_CHANGE:
-		case IBV_EVENT_PKEY_CHANGE:
-		case IBV_EVENT_SM_CHANGE:
-		{
-			dapl_log(DAPL_DBG_TYPE_WARN,
-				 "dapl async_event: DEV ERR %d\n",
-				 event.event_type);
-
-			/* report up if async callback still setup */
-			if (hca->async_unafiliated)
-				hca->async_unafiliated(hca->cm_id->
-							verbs, &event,
-							hca->
-							async_un_ctx);
-			break;
-		}
-		case IBV_EVENT_CLIENT_REREGISTER:
-			/* no need to report this event this time */
-			dapl_log(DAPL_DBG_TYPE_UTIL,
-				 " async_event: IBV_CLIENT_REREGISTER\n");
-			break;
-
-		default:
-			dapl_log(DAPL_DBG_TYPE_WARN,
-				 "dapl async_event: %d UNKNOWN\n",
-				 event.event_type);
-			break;
-
-		}
-		ibv_ack_async_event(&event);
-	}
-}
-
 #if defined(_WIN64) || defined(_WIN32)
 /* work thread for uAT, uCM, CQ, and async events */
 void dapli_thread(void *arg)
@@ -721,6 +648,7 @@ void dapli_thread(void *arg)
 				dapl_os_unlock(&g_hca_lock);
 				uhca[idx]->destroy = 2;
 			} else {
+				dapli_cq_event_cb(uhca[idx]);
 				dapli_async_event_cb(uhca[idx]);
 			}
 		}
@@ -732,6 +660,7 @@ void dapli_thread(void *arg)
 	dapl_os_unlock(&g_hca_lock);
 }
 #else				// _WIN64 || WIN32
+
 /* work thread for uAT, uCM, CQ, and async events */
 void dapli_thread(void *arg)
 {
@@ -771,7 +700,13 @@ void dapli_thread(void *arg)
 		while (hca) {
 
 			/* uASYNC events */
-			ufds[++idx].fd = hca->cm_id->verbs->async_fd;
+			ufds[++idx].fd = hca->ib_ctx->async_fd;
+			ufds[idx].events = POLLIN;
+			ufds[idx].revents = 0;
+			uhca[idx] = hca;
+
+			/* CQ events are non-direct with CNO's */
+			ufds[++idx].fd = hca->ib_cq->fd;
 			ufds[idx].events = POLLIN;
 			ufds[idx].revents = 0;
 			uhca[idx] = hca;
@@ -809,9 +744,10 @@ void dapli_thread(void *arg)
 		if (ufds[1].revents == POLLIN)
 			dapli_cma_event_cb();
 
-		/* check and process ASYNC events, per device */
+		/* check and process CQ and ASYNC events, per device */
 		for (idx = 2; idx < fds; idx++) {
 			if (ufds[idx].revents == POLLIN) {
+				dapli_cq_event_cb(uhca[idx]);
 				dapli_async_event_cb(uhca[idx]);
 			}
 		}
@@ -824,7 +760,7 @@ void dapli_thread(void *arg)
 					 strerror(errno));
 
 			/* cleanup any device on list marked for destroy */
-			for (idx = 2; idx < fds; idx++) {
+			for (idx = 3; idx < fds; idx++) {
 				if (uhca[idx] && uhca[idx]->destroy == 1) {
 					dapl_os_lock(&g_hca_lock);
 					dapl_llist_remove_entry(
diff --git a/dapl/openib_common/cq.c b/dapl/openib_common/cq.c
index 096167c..16d4f18 100644
--- a/dapl/openib_common/cq.c
+++ b/dapl/openib_common/cq.c
@@ -171,36 +171,32 @@ DAT_RETURN dapls_ib_get_async_event(IN ib_error_record_t * err_record,
  *	DAT_INSUFFICIENT_RESOURCES
  *
  */
-#if defined(_WIN32)
-
 DAT_RETURN
 dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr,
 		  IN DAPL_EVD * evd_ptr, IN DAT_COUNT * cqlen)
 {
-	OVERLAPPED *overlap;
+	struct ibv_comp_channel *channel;
 	DAT_RETURN ret;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 		     "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen);
 
-	evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle,
-					      *cqlen, evd_ptr, NULL, 0);
+	if (!evd_ptr->cno_ptr)
+		channel = ibv_create_comp_channel(ia_ptr->hca_ptr->ib_hca_handle);
+	else
+		channel = ia_ptr->hca_ptr->ib_trans.ib_cq;
 
-	if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE)
+	if (!channel)
 		return DAT_INSUFFICIENT_RESOURCES;
 
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     " cq_object_create: (%p)\n", evd_ptr);
+	evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle,
+					      *cqlen, evd_ptr, channel, 0);
 
-	overlap = &evd_ptr->ib_cq_handle->comp_entry.Overlap;
-	overlap->hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
-	if (!overlap->hEvent) {
+	if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) {
 		ret = DAT_INSUFFICIENT_RESOURCES;
 		goto err;
 	}
 
-	overlap->hEvent = (HANDLE) ((ULONG_PTR) overlap->hEvent | 1);
-
 	/* arm cq for events */
 	dapls_set_cq_notify(ia_ptr, evd_ptr);
 
@@ -214,7 +210,8 @@ dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr,
 	return DAT_SUCCESS;
 
 err:
-	ibv_destroy_cq(evd_ptr->ib_cq_handle);
+	if (!evd_ptr->cno_ptr)
+		ibv_destroy_comp_channel(channel);
 	return ret;
 }
 
@@ -239,18 +236,18 @@ DAT_RETURN dapls_ib_cq_free(IN DAPL_IA * ia_ptr, IN DAPL_EVD * evd_ptr)
 {
 	DAT_EVENT event;
 	ib_work_completion_t wc;
-	HANDLE hevent;
+	struct ibv_comp_channel *channel;
 
 	if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) {
 		/* pull off CQ and EVD entries and toss */
 		while (ibv_poll_cq(evd_ptr->ib_cq_handle, 1, &wc) == 1) ;
 		while (dapl_evd_dequeue(evd_ptr, &event) == DAT_SUCCESS) ;
 
-		hevent = evd_ptr->ib_cq_handle->comp_entry.Overlap.hEvent;
+		channel = evd_ptr->ib_cq_handle->channel;
 		if (ibv_destroy_cq(evd_ptr->ib_cq_handle))
 			return (dapl_convert_errno(errno, "ibv_destroy_cq"));
-
-		CloseHandle(hevent);
+		if (!evd_ptr->cno_ptr)
+			ibv_destroy_comp_channel(channel);
 		evd_ptr->ib_cq_handle = IB_INVALID_HANDLE;
 	}
 	return DAT_SUCCESS;
@@ -262,105 +259,42 @@ dapls_evd_dto_wakeup(IN DAPL_EVD * evd_ptr)
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 		     " cq_object_wakeup: evd=%p\n", evd_ptr);
 
-	if (!SetEvent(evd_ptr->ib_cq_handle->comp_entry.Overlap.hEvent))
-		return DAT_INTERNAL_ERROR;
-
+	/* no wake up mechanism */
 	return DAT_SUCCESS;
 }
 
-DAT_RETURN
-dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout)
+#if defined(_WIN32)
+static int
+dapls_wait_comp_channel(IN struct ibv_comp_channel *channel, IN uint32_t timeout)
 {
-	int status;
-
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     " cq_object_wait: EVD %p time %d\n",
-		     evd_ptr, timeout);
-
-	status = WaitForSingleObject(evd_ptr->ib_cq_handle->
-				     comp_entry.Overlap.hEvent,
-				     timeout / 1000);
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     " cq_object_wait: EVD %p status 0x%x\n",
-		     evd_ptr, status);
-	if (status)
-		return DAT_TIMEOUT_EXPIRED;
-
-	InterlockedExchange(&evd_ptr->ib_cq_handle->comp_entry.Busy, 0);
-	return DAT_SUCCESS;
+	channel->comp_channel.Milliseconds =
+		(timeout == DAT_TIMEOUT_INFINITE) ? INFINITE : timeout / 1000;
+	return 0;
 }
 
 #else // WIN32
 
-DAT_RETURN
-dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr,
-		  IN DAPL_EVD * evd_ptr, IN DAT_COUNT * cqlen)
-{
-	struct ibv_comp_channel *channel;
-	DAT_RETURN ret;
-
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen);
-
-	channel = ibv_create_comp_channel(ia_ptr->hca_ptr->ib_hca_handle);
-	if (!channel)
-		return DAT_INSUFFICIENT_RESOURCES;
-
-	evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle,
-					      *cqlen, evd_ptr, channel, 0);
-
-	if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) {
-		ret = DAT_INSUFFICIENT_RESOURCES;
-		goto err;
-	}
-
-	/* arm cq for events */
-	dapls_set_cq_notify(ia_ptr, evd_ptr);
-
-	/* update with returned cq entry size */
-	*cqlen = evd_ptr->ib_cq_handle->cqe;
-
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     "dapls_ib_cq_alloc: new_cq %p cqlen=%d \n",
-		     evd_ptr->ib_cq_handle, *cqlen);
-
-	return DAT_SUCCESS;
-
-err:
-	ibv_destroy_comp_channel(channel);
-	return ret;
-}
-
-DAT_RETURN dapls_ib_cq_free(IN DAPL_IA * ia_ptr, IN DAPL_EVD * evd_ptr)
+static int
+dapls_wait_comp_channel(IN struct ibv_comp_channel *channel, IN uint32_t timeout)
 {
-	DAT_EVENT event;
-	ib_work_completion_t wc;
-	struct ibv_comp_channel *channel;
-
-	if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) {
-		/* pull off CQ and EVD entries and toss */
-		while (ibv_poll_cq(evd_ptr->ib_cq_handle, 1, &wc) == 1) ;
-		while (dapl_evd_dequeue(evd_ptr, &event) == DAT_SUCCESS) ;
-
-		channel = evd_ptr->ib_cq_handle->channel;
-		if (ibv_destroy_cq(evd_ptr->ib_cq_handle))
-			return (dapl_convert_errno(errno, "ibv_destroy_cq"));
-
-		ibv_destroy_comp_channel(channel);
-		evd_ptr->ib_cq_handle = IB_INVALID_HANDLE;
-	}
-	return DAT_SUCCESS;
-}
-
-DAT_RETURN
-dapls_evd_dto_wakeup(IN DAPL_EVD * evd_ptr)
-{
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-		     " cq_object_wakeup: evd=%p\n", evd_ptr);
+	int status, timeout_ms;
+	struct pollfd cq_fd = {
+		.fd = channel->fd,
+		.events = POLLIN,
+		.revents = 0
+	};
 
-	/* no wake up mechanism */
-	return DAT_SUCCESS;
+	/* uDAPL timeout values in usecs */
+	timeout_ms = (timeout == DAT_TIMEOUT_INFINITE) ? -1 : timeout / 1000;
+	status = poll(&cq_fd, 1, timeout_ms);
+	if (status > 0)
+		return 0;
+	else if (status == 0)
+		return ETIMEDOUT;
+	else
+		return status;
 }
+#endif
 
 DAT_RETURN
 dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout)
@@ -368,43 +302,45 @@ dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout)
 	struct ibv_comp_channel *channel = evd_ptr->ib_cq_handle->channel;
 	struct ibv_cq *ibv_cq = NULL;
 	void *context;
-	int status = 0;
-	int timeout_ms = -1;
-	struct pollfd cq_fd = {
-		.fd = channel->fd,
-		.events = POLLIN,
-		.revents = 0
-	};
+	int status;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 		     " cq_object_wait: EVD %p time %d\n",
 		     evd_ptr, timeout);
 
-	/* uDAPL timeout values in usecs */
-	if (timeout != DAT_TIMEOUT_INFINITE)
-		timeout_ms = timeout / 1000;
-
-	status = poll(&cq_fd, 1, timeout_ms);
-
-	/* returned event */
-	if (status > 0) {
+	status = dapls_wait_comp_channel(channel, timeout);
+	if (!status) {
 		if (!ibv_get_cq_event(channel, &ibv_cq, &context)) {
 			ibv_ack_cq_events(ibv_cq, 1);
 		}
-		status = 0;
-
-		/* timeout */
-	} else if (status == 0)
-		status = ETIMEDOUT;
+	}
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 		     " cq_object_wait: RET evd %p ibv_cq %p %s\n",
 		     evd_ptr, ibv_cq, strerror(errno));
 
-	return (dapl_convert_errno(status, "cq_wait_object_wait"));
+	return dapl_convert_errno(status, "cq_wait_object_wait");
+}
 
+void dapli_cq_event_cb(struct _ib_hca_transport *tp)
+{
+	/* check all comp events on this device */
+	struct dapl_evd *evd = NULL;
+	struct ibv_cq   *ibv_cq = NULL;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," dapli_cq_event_cb(%p)\n", tp);
+
+	while (!ibv_get_cq_event(tp->ib_cq, &ibv_cq, (void*)&evd)) {
+
+		if (!DAPL_BAD_HANDLE(evd, DAPL_MAGIC_EVD)) {
+			/* Both EVD or EVD->CNO event via callback */
+			dapl_evd_dto_callback(tp->ib_ctx, 
+					      evd->ib_cq_handle, (void*)evd);
+		}
+
+		ibv_ack_cq_events(ibv_cq, 1);
+	} 
 }
-#endif
 
 /*
  * dapl_ib_cq_resize
diff --git a/dapl/openib_common/dapl_ib_common.h b/dapl/openib_common/dapl_ib_common.h
index 0b417b8..2195767 100644
--- a/dapl/openib_common/dapl_ib_common.h
+++ b/dapl/openib_common/dapl_ib_common.h
@@ -208,6 +208,8 @@ typedef uint32_t ib_shm_transport_t;
 /* prototypes */
 int32_t	dapls_ib_init(void);
 int32_t	dapls_ib_release(void);
+
+/* util.c */
 enum ibv_mtu dapl_ib_mtu(int mtu);
 char *dapl_ib_mtu_str(enum ibv_mtu mtu);
 DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len);
diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c
index da913c5..3963e1f 100644
--- a/dapl/openib_common/util.c
+++ b/dapl/openib_common/util.c
@@ -320,6 +320,104 @@ DAT_RETURN dapls_ib_setup_async_callback(IN DAPL_IA * ia_ptr,
 	return DAT_SUCCESS;
 }
 
+void dapli_async_event_cb(struct _ib_hca_transport *hca)
+{
+	struct ibv_async_event event;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " async_event(%p)\n", hca);
+
+	if (hca->destroy)
+		return;
+
+	if (!ibv_get_async_event(hca->ib_ctx, &event)) {
+
+		switch (event.event_type) {
+		case IBV_EVENT_CQ_ERR:
+		{
+			struct dapl_ep *evd_ptr =
+				event.element.cq->cq_context;
+
+			dapl_log(DAPL_DBG_TYPE_ERR,
+				 "dapl async_event CQ (%p) ERR %d\n",
+				 evd_ptr, event.event_type);
+
+			/* report up if async callback still setup */
+			if (hca->async_cq_error)
+				hca->async_cq_error(hca->ib_ctx,
+						    event.element.cq,
+						    &event,
+						    (void *)evd_ptr);
+			break;
+		}
+		case IBV_EVENT_COMM_EST:
+		{
+			/* Received msgs on connected QP before RTU */
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " async_event COMM_EST(%p) rdata beat RTU\n",
+				 event.element.qp);
+
+			break;
+		}
+		case IBV_EVENT_QP_FATAL:
+		case IBV_EVENT_QP_REQ_ERR:
+		case IBV_EVENT_QP_ACCESS_ERR:
+		case IBV_EVENT_QP_LAST_WQE_REACHED:
+		case IBV_EVENT_SRQ_ERR:
+		case IBV_EVENT_SRQ_LIMIT_REACHED:
+		case IBV_EVENT_SQ_DRAINED:
+		{
+			struct dapl_ep *ep_ptr =
+				event.element.qp->qp_context;
+
+			dapl_log(DAPL_DBG_TYPE_ERR,
+				 "dapl async_event QP (%p) ERR %d\n",
+				 ep_ptr, event.event_type);
+
+			/* report up if async callback still setup */
+			if (hca->async_qp_error)
+				hca->async_qp_error(hca->ib_ctx,
+						    ep_ptr->qp_handle,
+						    &event,
+						    (void *)ep_ptr);
+			break;
+		}
+		case IBV_EVENT_PATH_MIG:
+		case IBV_EVENT_PATH_MIG_ERR:
+		case IBV_EVENT_DEVICE_FATAL:
+		case IBV_EVENT_PORT_ACTIVE:
+		case IBV_EVENT_PORT_ERR:
+		case IBV_EVENT_LID_CHANGE:
+		case IBV_EVENT_PKEY_CHANGE:
+		case IBV_EVENT_SM_CHANGE:
+		{
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				 "dapl async_event: DEV ERR %d\n",
+				 event.event_type);
+
+			/* report up if async callback still setup */
+			if (hca->async_unafiliated)
+				hca->async_unafiliated(hca->ib_ctx, 
+						       &event,	
+						       hca->async_un_ctx);
+			break;
+		}
+		case IBV_EVENT_CLIENT_REREGISTER:
+			/* no need to report this event this time */
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " async_event: IBV_CLIENT_REREGISTER\n");
+			break;
+
+		default:
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				 "dapl async_event: %d UNKNOWN\n",
+				 event.event_type);
+			break;
+
+		}
+		ibv_ack_async_event(&event);
+	}
+}
+
 /*
  * dapls_set_provider_specific_attr
  *
diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h
index a5e734e..933364c 100644
--- a/dapl/openib_scm/dapl_ib_util.h
+++ b/dapl/openib_scm/dapl_ib_util.h
@@ -78,8 +78,11 @@ typedef dp_ib_cm_handle_t	ib_cm_srvc_handle_t;
 /* ib_hca_transport_t, specific to this implementation */
 typedef struct _ib_hca_transport
 { 
+	struct dapl_llist_entry	entry;
+	int			destroy;
 	union ibv_gid		gid;
 	struct	ibv_device	*ib_dev;
+	struct	ibv_context	*ib_ctx;
 	ib_cq_handle_t		ib_cq_empty;
 	DAPL_OS_LOCK		cq_lock;	
 	int			max_inline_send;
@@ -114,6 +117,8 @@ typedef struct _ib_hca_transport
 void cr_thread(void *arg);
 int dapli_cq_thread_init(struct dapl_hca *hca_ptr);
 void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr);
+void dapli_async_event_cb(struct _ib_hca_transport *tp);
+void dapli_cq_event_cb(struct _ib_hca_transport *tp);
 DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr);
 void dapls_print_cm_list(IN DAPL_IA *ia_ptr);
 dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep);
diff --git a/dapl/openib_scm/device.c b/dapl/openib_scm/device.c
index d5089aa..9c91b78 100644
--- a/dapl/openib_scm/device.c
+++ b/dapl/openib_scm/device.c
@@ -57,6 +57,96 @@ static const char rcsid[] = "$Id:  $";
 
 #include <stdlib.h>
 
+ib_thread_state_t g_ib_thread_state = 0;
+DAPL_OS_THREAD g_ib_thread;
+DAPL_OS_LOCK g_hca_lock;
+struct dapl_llist_entry *g_hca_list;
+
+void dapli_thread(void *arg);
+DAT_RETURN  dapli_ib_thread_init(void);
+void dapli_ib_thread_destroy(void);
+
+#if defined(_WIN64) || defined(_WIN32)
+#include "..\..\..\..\..\etc\user\comp_channel.cpp"
+#include <rdma\winverbs.h>
+
+struct ibvw_windata windata;
+
+static int dapls_os_init(void)
+{
+	return ibvw_get_windata(&windata, IBVW_WINDATA_VERSION);
+}
+
+static void dapls_os_release(void)
+{
+	if (windata.comp_mgr)
+		ibvw_release_windata(&windata, IBVW_WINDATA_VERSION);
+	windata.comp_mgr = NULL;
+}
+
+static int dapls_config_verbs(struct ibv_context *verbs)
+{
+	verbs->channel.Milliseconds = 0;
+	return 0;
+}
+
+static int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	channel->comp_channel.Milliseconds = 0;
+	return 0;
+}
+
+static int dapls_thread_signal(void)
+{
+	CompManagerCancel(windata.comp_mgr);
+	return 0;
+}
+#else				// _WIN64 || WIN32
+int g_ib_pipe[2];
+
+static int dapls_os_init(void)
+{
+	/* create pipe for waking up work thread */
+	return pipe(g_ib_pipe);
+}
+
+static void dapls_os_release(void)
+{
+	/* close pipe? */
+}
+
+static int dapls_config_fd(int fd)
+{
+	int opts;
+
+	opts = fcntl(fd, F_GETFL);
+	if (opts < 0 || fcntl(fd, F_SETFL, opts | O_NONBLOCK) < 0) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " dapls_config_fd: fcntl on fd %d ERR %d %s\n",
+			 fd, opts, strerror(errno));
+		return errno;
+	}
+
+	return 0;
+}
+
+static int dapls_config_verbs(struct ibv_context *verbs)
+{
+	return dapls_config_fd(verbs->async_fd);
+}
+
+static int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	return dapls_config_fd(channel->fd);
+}
+
+static int dapls_thread_signal(void)
+{
+	return write(g_ib_pipe[1], "w", sizeof "w");
+}
+#endif
+
+
 static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr)
 {
 	DAPL_SOCKET listen_socket;
@@ -130,35 +220,22 @@ static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr)
  */
 int32_t dapls_ib_init(void)
 {
-	return 0;
-}
+	/* initialize hca_list */
+	dapl_os_lock_init(&g_hca_lock);
+	dapl_llist_init_head(&g_hca_list);
 
-int32_t dapls_ib_release(void)
-{
-	return 0;
-}
+	if (dapls_os_init())
+		return 1;
 
-#if defined(_WIN64) || defined(_WIN32)
-int dapls_config_comp_channel(struct ibv_comp_channel *channel)
-{
 	return 0;
 }
-#else				// _WIN64 || WIN32
-int dapls_config_comp_channel(struct ibv_comp_channel *channel)
-{
-	int opts;
-
-	opts = fcntl(channel->fd, F_GETFL);	/* uCQ */
-	if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) {
-		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n",
-			 channel->fd, opts, strerror(errno));
-		return errno;
-	}
 
+int32_t dapls_ib_release(void)
+{
+	dapli_ib_thread_destroy();
+	dapls_os_release();
 	return 0;
 }
-#endif
 
 /*
  * dapls_ib_open_hca
@@ -213,7 +290,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 		 " open_hca: device %s not found\n", hca_name);
 	goto err;
 
-      found:
+found:
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " open_hca: Found dev %s %016llx\n",
 		     ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
 		     (unsigned long long)
@@ -227,6 +304,8 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 			 strerror(errno));
 		goto err;
 	}
+	hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle;
+	dapls_config_verbs(hca_ptr->ib_hca_handle);
 
 	/* get lid for this hca-port, network order */
 	if (ibv_query_port(hca_ptr->ib_hca_handle,
@@ -271,15 +350,8 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 	hca_ptr->ib_trans.mtu =
 	    dapl_ib_mtu(dapl_os_get_env_val("DAPL_IB_MTU", SCM_IB_MTU));
 
-#ifndef CQ_WAIT_OBJECT
-	/* initialize cq_lock */
-	dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock);
-	if (dat_status != DAT_SUCCESS) {
-		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " open_hca: failed to init cq_lock\n");
-		goto bail;
-	}
-	/* EVD events without direct CQ channels, non-blocking */
+
+	/* EVD events without direct CQ channels, CNO support */
 	hca_ptr->ib_trans.ib_cq =
 	    ibv_create_comp_channel(hca_ptr->ib_hca_handle);
 	if (hca_ptr->ib_trans.ib_cq == NULL) {
@@ -288,18 +360,28 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 			 strerror(errno));
 		goto bail;
 	}
-
-	if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) {
-		goto bail;
-	}
-
-	if (dapli_cq_thread_init(hca_ptr)) {
+	dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq);
+	
+	dat_status = dapli_ib_thread_init();
+	if (dat_status != DAT_SUCCESS) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " open_hca: cq_thread_init failed for %s\n",
-			 ibv_get_device_name(hca_ptr->ib_trans.ib_dev));
+			 " open_hca: failed to init cq thread lock\n");
 		goto bail;
 	}
-#endif				/* CQ_WAIT_OBJECT */
+	/* 
+	 * Put new hca_transport on list for async and CQ event processing 
+	 * Wakeup work thread to add to polling list
+	 */
+	dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&hca_ptr->ib_trans.entry);
+	dapl_os_lock(&g_hca_lock);
+	dapl_llist_add_tail(&g_hca_list,
+			    (DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry,
+			    &hca_ptr->ib_trans.entry);
+	if (dapls_thread_signal() == -1)
+		dapl_log(DAPL_DBG_TYPE_UTIL,
+			 " open_hca: thread wakeup error = %s\n",
+			 strerror(errno));
+	dapl_os_unlock(&g_hca_lock);
 
 	/* initialize cr_list lock */
 	dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock);
@@ -333,7 +415,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 
 	/* wait for thread */
 	while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
-		dapl_os_sleep_usec(2000);
+		dapl_os_sleep_usec(1000);
 	}
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
@@ -380,33 +462,297 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr)
 {
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: %p\n", hca_ptr);
 
-#ifndef CQ_WAIT_OBJECT
-	dapli_cq_thread_destroy(hca_ptr);
-	dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock);
-#endif				/* CQ_WAIT_OBJECT */
-
 	if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) {
 		if (ibv_close_device(hca_ptr->ib_hca_handle))
 			return (dapl_convert_errno(errno, "ib_close_device"));
 		hca_ptr->ib_hca_handle = IB_INVALID_HANDLE;
 	}
 
+	dapl_os_lock(&g_hca_lock);
+	if (g_ib_thread_state != IB_THREAD_RUN) {
+		dapl_os_unlock(&g_hca_lock);
+		return (DAT_SUCCESS);
+	}
+	dapl_os_unlock(&g_hca_lock);
+
 	/* destroy cr_thread and lock */
 	hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL;
-	if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
-		dapl_log(DAPL_DBG_TYPE_UTIL,
-			 " thread_destroy: thread wakeup err = %s\n",
-			 strerror(errno));
+	send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0);
 	while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) {
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 			     " close_hca: waiting for cr_thread\n");
-		if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
-			dapl_log(DAPL_DBG_TYPE_UTIL,
-				 " thread_destroy: thread wakeup err = %s\n",
-				 strerror(errno));
-		dapl_os_sleep_usec(2000);
+		send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0);
+		dapl_os_sleep_usec(1000);
 	}
 	dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
 	destroy_cr_pipe(hca_ptr); /* no longer need pipe */
+	
+	/* 
+	 * Remove hca from async event processing list
+	 * Wakeup work thread to remove from polling list
+	 */
+	hca_ptr->ib_trans.destroy = 1;
+	if (dapls_thread_signal() == -1)
+		dapl_log(DAPL_DBG_TYPE_UTIL,
+			 " destroy: thread wakeup error = %s\n",
+			 strerror(errno));
+
+	/* wait for thread to remove HCA references */
+	while (hca_ptr->ib_trans.destroy != 2) {
+		if (dapls_thread_signal() == -1)
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " destroy: thread wakeup error = %s\n",
+				 strerror(errno));
+		dapl_os_sleep_usec(1000);
+	}
+
 	return (DAT_SUCCESS);
 }
+
+DAT_RETURN dapli_ib_thread_init(void)
+{
+	DAT_RETURN dat_status;
+
+	dapl_os_lock(&g_hca_lock);
+	if (g_ib_thread_state != IB_THREAD_INIT) {
+		dapl_os_unlock(&g_hca_lock);
+		return DAT_SUCCESS;
+	}
+
+	g_ib_thread_state = IB_THREAD_CREATE;
+	dapl_os_unlock(&g_hca_lock);
+
+	/* create thread to process inbound connect request */
+	dat_status = dapl_os_thread_create(dapli_thread, NULL, &g_ib_thread);
+	if (dat_status != DAT_SUCCESS)
+		return (dapl_convert_errno(errno,
+					   "create_thread ERR:"
+					   " check resource limits"));
+
+	/* wait for thread to start */
+	dapl_os_lock(&g_hca_lock);
+	while (g_ib_thread_state != IB_THREAD_RUN) {
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+			     " ib_thread_init: waiting for ib_thread\n");
+		dapl_os_unlock(&g_hca_lock);
+		dapl_os_sleep_usec(1000);
+		dapl_os_lock(&g_hca_lock);
+	}
+	dapl_os_unlock(&g_hca_lock);
+
+	return DAT_SUCCESS;
+}
+
+void dapli_ib_thread_destroy(void)
+{
+	int retries = 10;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+		     " ib_thread_destroy(%d)\n", dapl_os_getpid());
+	/* 
+	 * wait for async thread to terminate. 
+	 * pthread_join would be the correct method
+	 * but some applications have some issues
+	 */
+
+	/* destroy ib_thread, wait for termination, if not already */
+	dapl_os_lock(&g_hca_lock);
+	if (g_ib_thread_state != IB_THREAD_RUN)
+		goto bail;
+
+	g_ib_thread_state = IB_THREAD_CANCEL;
+	if (dapls_thread_signal() == -1)
+		dapl_log(DAPL_DBG_TYPE_UTIL,
+			 " destroy: thread wakeup error = %s\n",
+			 strerror(errno));
+	while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) {
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+			     " ib_thread_destroy: waiting for ib_thread\n");
+		if (dapls_thread_signal() == -1)
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " destroy: thread wakeup error = %s\n",
+				 strerror(errno));
+		dapl_os_unlock(&g_hca_lock);
+		dapl_os_sleep_usec(2000);
+		dapl_os_lock(&g_hca_lock);
+	}
+bail:
+	dapl_os_unlock(&g_hca_lock);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+		     " ib_thread_destroy(%d) exit\n", dapl_os_getpid());
+}
+
+
+#if defined(_WIN64) || defined(_WIN32)
+/* work thread for uAT, uCM, CQ, and async events */
+void dapli_thread(void *arg)
+{
+	struct _ib_hca_transport *hca;
+	struct _ib_hca_transport *uhca[8];
+	COMP_CHANNEL *channel;
+	int ret, idx, cnt;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: \n",
+		     dapl_os_getpid(), g_ib_thread);
+
+	dapl_os_lock(&g_hca_lock);
+	for (g_ib_thread_state = IB_THREAD_RUN;
+	     g_ib_thread_state == IB_THREAD_RUN; 
+	     dapl_os_lock(&g_hca_lock)) {
+
+		idx = 0;
+		hca = dapl_llist_is_empty(&g_hca_list) ? NULL :
+		      dapl_llist_peek_head(&g_hca_list);
+
+		while (hca) {
+			uhca[idx++] = hca;
+			hca = dapl_llist_next_entry(&g_hca_list,
+						    (DAPL_LLIST_ENTRY *)
+						    &hca->entry);
+		}
+		cnt = idx;
+
+		dapl_os_unlock(&g_hca_lock);
+		ret = CompManagerPoll(windata.comp_mgr, INFINITE, &channel);
+
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+			     " ib_thread(%d) poll_event 0x%x\n",
+			     dapl_os_getpid(), ret);
+
+
+		/* check and process ASYNC events, per device */
+		for (idx = 0; idx < cnt; idx++) {
+			if (uhca[idx]->destroy == 1) {
+				dapl_os_lock(&g_hca_lock);
+				dapl_llist_remove_entry(&g_hca_list,
+							(DAPL_LLIST_ENTRY *)
+							&uhca[idx]->entry);
+				dapl_os_unlock(&g_hca_lock);
+				uhca[idx]->destroy = 2;
+			} else {
+				dapli_cq_event_cb(uhca[idx]);
+				dapli_async_event_cb(uhca[idx]);
+			}
+		}
+	}
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) EXIT\n",
+		     dapl_os_getpid());
+	g_ib_thread_state = IB_THREAD_EXIT;
+	dapl_os_unlock(&g_hca_lock);
+}
+#else				// _WIN64 || WIN32
+
+/* work thread for uAT, uCM, CQ, and async events */
+void dapli_thread(void *arg)
+{
+	struct pollfd ufds[__FD_SETSIZE];
+	struct _ib_hca_transport *uhca[__FD_SETSIZE] = { NULL };
+	struct _ib_hca_transport *hca;
+	int ret, idx, fds;
+	char rbuf[2];
+
+	dapl_dbg_log(DAPL_DBG_TYPE_THREAD,
+		     " ib_thread(%d,0x%x): ENTER: pipe %d \n",
+		     dapl_os_getpid(), g_ib_thread, g_ib_pipe[0]);
+
+	/* Poll across pipe, CM, AT never changes */
+	dapl_os_lock(&g_hca_lock);
+	g_ib_thread_state = IB_THREAD_RUN;
+
+	ufds[0].fd = g_ib_pipe[0];	/* pipe */
+	ufds[0].events = POLLIN;
+
+	while (g_ib_thread_state == IB_THREAD_RUN) {
+
+		/* build ufds after pipe and uCMA events */
+		ufds[0].revents = 0;
+		idx = 0;
+
+		/*  Walk HCA list and setup async and CQ events */
+		if (!dapl_llist_is_empty(&g_hca_list))
+			hca = dapl_llist_peek_head(&g_hca_list);
+		else
+			hca = NULL;
+
+		while (hca) {
+
+			/* uASYNC events */
+			ufds[++idx].fd = hca->ib_ctx->async_fd;
+			ufds[idx].events = POLLIN;
+			ufds[idx].revents = 0;
+			uhca[idx] = hca;
+
+			/* CQ events are non-direct with CNO's */
+			ufds[++idx].fd = hca->ib_cq->fd;
+			ufds[idx].events = POLLIN;
+			ufds[idx].revents = 0;
+			uhca[idx] = hca;
+
+			dapl_dbg_log(DAPL_DBG_TYPE_THREAD,
+				     " ib_thread(%d) poll_fd: hca[%d]=%p,"
+				     " async=%d pipe=%d \n",
+				     dapl_os_getpid(), hca, ufds[idx - 1].fd,
+				     ufds[0].fd);
+
+			hca = dapl_llist_next_entry(&g_hca_list,
+						    (DAPL_LLIST_ENTRY *)
+						    &hca->entry);
+		}
+
+		/* unlock, and setup poll */
+		fds = idx + 1;
+		dapl_os_unlock(&g_hca_lock);
+		ret = poll(ufds, fds, -1);
+		if (ret <= 0) {
+			dapl_dbg_log(DAPL_DBG_TYPE_THREAD,
+				     " ib_thread(%d): ERR %s poll\n",
+				     dapl_os_getpid(), strerror(errno));
+			dapl_os_lock(&g_hca_lock);
+			continue;
+		}
+
+		dapl_dbg_log(DAPL_DBG_TYPE_THREAD,
+			     " ib_thread(%d) poll_event: "
+			     " async=0x%x pipe=0x%x \n",
+			     dapl_os_getpid(), ufds[idx].revents,
+			     ufds[0].revents);
+
+		/* check and process CQ and ASYNC events, per device */
+		for (idx = 1; idx < fds; idx++) {
+			if (ufds[idx].revents == POLLIN) {
+				dapli_cq_event_cb(uhca[idx]);
+				dapli_async_event_cb(uhca[idx]);
+			}
+		}
+
+		/* check and process user events, PIPE */
+		if (ufds[0].revents == POLLIN) {
+			if (read(g_ib_pipe[0], rbuf, 2) == -1)
+				dapl_log(DAPL_DBG_TYPE_THREAD,
+					 " cr_thread: pipe rd err= %s\n",
+					 strerror(errno));
+
+			/* cleanup any device on list marked for destroy */
+			for (idx = 1; idx < fds; idx++) {
+				if (uhca[idx] && uhca[idx]->destroy == 1) {
+					dapl_os_lock(&g_hca_lock);
+					dapl_llist_remove_entry(
+						&g_hca_list,
+						(DAPL_LLIST_ENTRY*)
+						&uhca[idx]->entry);
+					dapl_os_unlock(&g_hca_lock);
+					uhca[idx]->destroy = 2;
+				}
+			}
+		}
+		dapl_os_lock(&g_hca_lock);
+	}
+
+	dapl_dbg_log(DAPL_DBG_TYPE_THREAD, " ib_thread(%d) EXIT\n",
+		     dapl_os_getpid());
+	g_ib_thread_state = IB_THREAD_EXIT;
+	dapl_os_unlock(&g_hca_lock);
+}
+#endif
diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index 77d78b2..739ccca 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -689,10 +689,9 @@ send_msg(void *data,
 				LOGPRINTF("%d cno wait return evd_handle=%p\n",
 					  getpid(), evd);
 				if (evd != h_dto_req_evd) {
-					fprintf(stderr,
-						"%d Error waiting on h_dto_cno: evd != h_dto_req_evd\n",
-						getpid());
-					return (DAT_ABORT);
+					/* CNO timeout, already on EVD */
+					if (evd != NULL)
+						return (ret);
 				}
 			}
 			/* use wait to dequeue */
@@ -1085,10 +1084,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 			LOGPRINTF("%d cno wait return evd_handle=%p\n",
 				  getpid(), evd);
 			if (evd != h_dto_rcv_evd) {
-				fprintf(stderr,
-					"%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n",
-					getpid());
-				return (DAT_ABORT);
+				/* CNO timeout, already on EVD */
+				if (evd != NULL)
+					return (ret);
 			}
 		}
 		/* use wait to dequeue */
@@ -1319,10 +1317,9 @@ DAT_RETURN do_rdma_write_with_msg(void)
 			LOGPRINTF("%d cno wait return evd_handle=%p\n",
 				  getpid(), evd);
 			if (evd != h_dto_rcv_evd) {
-				fprintf(stderr,
-					"%d Error waiting on h_dto_cno: "
-					"evd != h_dto_rcv_evd\n", getpid());
-				return (ret);
+				/* CNO timeout, already on EVD */
+				if (evd != NULL)
+					return (ret);
 			}
 		}
 		/* use wait to dequeue */
@@ -1446,10 +1443,9 @@ DAT_RETURN do_rdma_read_with_msg(void)
 				LOGPRINTF("%d cno wait return evd_handle=%p\n",
 					  getpid(), evd);
 				if (evd != h_dto_req_evd) {
-					fprintf(stderr,
-						"%d Error waiting on h_dto_cno: evd != h_dto_req_evd\n",
-						getpid());
-					return (DAT_ABORT);
+					/* CNO timeout, already on EVD */
+					if (evd != NULL)
+						return (ret);
 				}
 			}
 			/* use wait to dequeue */
@@ -1501,6 +1497,15 @@ DAT_RETURN do_rdma_read_with_msg(void)
 	 */
 	printf("%d Sending RDMA read completion message\n", getpid());
 
+	/* give remote chance to process read completes */
+	if (use_cno) {
+#if defined(_WIN32) || defined(_WIN64)
+		Sleep(1000);
+#else
+		sleep(1);
+#endif
+	}
+
 	ret = send_msg(&rmr_send_msg,
 		       sizeof(DAT_RMR_TRIPLET),
 		       lmr_context_send_msg,
@@ -1525,14 +1530,14 @@ DAT_RETURN do_rdma_read_with_msg(void)
 		LOGPRINTF("%d waiting for message receive event\n", getpid());
 		if (use_cno) {
 			DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-			ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
+			
+		        ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
 			LOGPRINTF("%d cno wait return evd_handle=%p\n",
 				  getpid(), evd);
 			if (evd != h_dto_rcv_evd) {
-				fprintf(stderr,
-					"%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n",
-					getpid());
-				return (ret);
+				/* CNO timeout, already on EVD */
+				if (evd != NULL)
+					return (ret);
 			}
 		}
 		/* use wait to dequeue */
@@ -1693,10 +1698,9 @@ DAT_RETURN do_ping_pong_msg()
 				LOGPRINTF("%d cno wait return evd_handle=%p\n",
 					  getpid(), evd);
 				if (evd != h_dto_rcv_evd) {
-					fprintf(stderr,
-						"%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n",
-						getpid());
-					return (ret);
+					/* CNO timeout, already on EVD */
+					if (evd != NULL)
+						return (ret);
 				}
 			}
 			/* use wait to dequeue */
-- 
1.5.2.5


From hnrose at comcast.net  Sun Aug  2 17:14:45 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 20:14:45 -0400
Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: In
	sm_mad_ctrl_send_err_cb, indicate failed attribute
Message-ID: <20090803001444.GA26324@comcast.net>


Display attribute name when appropriate
Also, cosmetic changes in other log messages

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c
index eeec51c..135c666 100644
--- a/opensm/opensm/osm_sa_mad_ctrl.c
+++ b/opensm/opensm/osm_sa_mad_ctrl.c
@@ -213,7 +213,7 @@ static void sa_mad_ctrl_process(IN osm_sa_mad_ctrl_t * p_ctrl,
 
 	default:
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A01: "
-			"Unsupported attribute = 0x%X\n",
+			"Unsupported attribute 0x%X\n",
 			cl_ntoh16(p_sa_mad->attr_id));
 		osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_ERROR);
 	}
@@ -233,9 +233,10 @@ static void sa_mad_ctrl_process(IN osm_sa_mad_ctrl_t * p_ctrl,
 
 		if (status != CL_SUCCESS) {
 			OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A02: "
-				"Dispatcher post message failed (%s) for attribute = 0x%X\n",
+				"Dispatcher post message failed (%s) for attribute 0x%X (%s)\n",
 				CL_STATUS_MSG(status),
-				cl_ntoh16(p_sa_mad->attr_id));
+				cl_ntoh16(p_sa_mad->attr_id),
+				ib_get_sa_attr_str(p_sa_mad->attr_id));
 
 			osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw);
 			goto Exit;
diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c
index f941748..791c848 100644
--- a/opensm/opensm/osm_sm_mad_ctrl.c
+++ b/opensm/opensm/osm_sm_mad_ctrl.c
@@ -254,7 +254,7 @@ static void sm_mad_ctrl_process_get_resp(IN osm_sm_mad_ctrl_t * p_ctrl,
 	default:
 		cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown);
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3103: "
-			"Unsupported attribute = 0x%X\n",
+			"Unsupported attribute 0x%X\n",
 			cl_ntoh16(p_smp->attr_id));
 		osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
 		goto Exit;
@@ -276,8 +276,9 @@ static void sm_mad_ctrl_process_get_resp(IN osm_sm_mad_ctrl_t * p_ctrl,
 
 	if (status != CL_SUCCESS) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3104: "
-			"Dispatcher post message failed (%s) for attribute = 0x%X\n",
-			CL_STATUS_MSG(status), cl_ntoh16(p_smp->attr_id));
+			"Dispatcher post message failed (%s) for attribute 0x%X (%s)\n",
+			CL_STATUS_MSG(status), cl_ntoh16(p_smp->attr_id),
+			ib_get_sm_attr_str(p_smp->attr_id));
 		goto Exit;
 	}
 
@@ -316,7 +317,7 @@ static void sm_mad_ctrl_process_get(IN osm_sm_mad_ctrl_t * p_ctrl,
 	default:
 		cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown);
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_VERBOSE,
-			"Ignoring SubnGet MAD - unsupported attribute = 0x%X\n",
+			"Ignoring SubnGet MAD - unsupported attribute 0x%X\n",
 			cl_ntoh16(p_smp->attr_id));
 		break;
 	}
@@ -393,7 +394,7 @@ static void sm_mad_ctrl_process_set(IN osm_sm_mad_ctrl_t * p_ctrl,
 	default:
 		cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown);
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3107: "
-			"Unsupported attribute = 0x%X\n",
+			"Unsupported attribute 0x%X\n",
 			cl_ntoh16(p_smp->attr_id));
 		osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
 		break;
@@ -480,7 +481,7 @@ static void sm_mad_ctrl_process_trap(IN osm_sm_mad_ctrl_t * p_ctrl,
 	default:
 		cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown);
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3109: "
-			"Unsupported attribute = 0x%X\n",
+			"Unsupported attribute 0x%X\n",
 			cl_ntoh16(p_smp->attr_id));
 		osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
 		break;
@@ -555,7 +556,7 @@ static void sm_mad_ctrl_process_trap_repress(IN osm_sm_mad_ctrl_t * p_ctrl,
 	default:
 		cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown);
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3105: "
-			"Unsupported attribute = 0x%X\n",
+			"Unsupported attribute 0x%X\n",
 			cl_ntoh16(p_smp->attr_id));
 		osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
 		break;
@@ -724,7 +725,9 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
 	     p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO ||
 	     p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL)) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3119: "
-			"Set method failed\n");
+			"Set method failed for attribute 0x%X (%s)\n",
+			cl_ntoh16(p_smp->attr_id),
+			ib_get_sm_attr_str(p_smp->attr_id));
 		p_ctrl->p_subn->subnet_initialization_error = TRUE;
 	}
 

From hnrose at comcast.net  Sun Aug  2 17:15:32 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 2 Aug 2009 20:15:32 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover.8: Add max hops
	option
Message-ID: <20090803001532.GB26324@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8
index 692994b..5841d8e 100644
--- a/infiniband-diags/man/ibnetdiscover.8
+++ b/infiniband-diags/man/ibnetdiscover.8
@@ -1,11 +1,11 @@
-.TH IBNETDISCOVER 8 "January 3, 2008" "OpenIB" "OpenIB Diagnostics"
+.TH IBNETDISCOVER 8 "May 13, 2009" "OpenIB" "OpenIB Diagnostics"
 
 .SH NAME
 ibnetdiscover \- discover InfiniBand topology
 
 .SH SYNOPSIS
 .B ibnetdiscover
-[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-h(elp)] [<topology-file>]
+[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-m(ax_hops)] [\-h(elp)] [<topology-file>]
 
 .SH DESCRIPTION
 .PP
@@ -47,6 +47,9 @@ names.  See file format below.
 Obtain a ports report which is a
 list of connected ports with relevant information (like LID, portnum,
 GUID, width, speed, and NodeDescription).
+.TP
+\fB\-m\fR, \fB\-\-max_hops\fR
+Report max hops discovered.
 
 .SH COMMON OPTIONS
 

From eli at mellanox.co.il  Mon Aug  3 02:25:29 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 3 Aug 2009 12:25:29 +0300
Subject: [ofa-general] [PATCH] cma: fix access to freed memory
Message-ID: <20090803092528.GA25528@mtls03>

rdma_join_multicast() allocates struct cma_multicast and then proceeds to join
to a multicast address. However, the join operation completes in another
context and the allocated struct could be released if the user destroys either
the rdma_id object or decides to leave the multicast group while the join is in
progress. This patch uses reference counting to to avoid such situation. It
also protects removal from id_priv->mc_list in cma_leave_mc_groups().

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/cma.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 851de83..8fee477 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -157,6 +157,7 @@ struct cma_multicast {
 	struct list_head	list;
 	void			*context;
 	struct sockaddr_storage	addr;
+	atomic_t		refcount;
 };
 
 struct cma_work {
@@ -290,6 +291,12 @@ static inline void cma_deref_dev(struct cma_device *cma_dev)
 		complete(&cma_dev->comp);
 }
 
+void cma_deref_mc(struct cma_multicast *mc)
+{
+	if (atomic_dec_and_test(&mc->refcount))
+		kfree(mc);
+}
+
 static void cma_detach_from_dev(struct rdma_id_private *id_priv)
 {
 	list_del(&id_priv->list);
@@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 {
 	struct cma_multicast *mc;
 
+	spin_lock_irq(&id_priv->lock);
 	while (!list_empty(&id_priv->mc_list)) {
 		mc = container_of(id_priv->mc_list.next,
 				  struct cma_multicast, list);
 		list_del(&mc->list);
+		spin_unlock_irq(&id_priv->lock);
 		ib_sa_free_multicast(mc->multicast.ib);
-		kfree(mc);
+		cma_deref_mc(mc);
+		spin_lock_irq(&id_priv->lock);
 	}
+	spin_unlock_irq(&id_priv->lock);
 }
 
 void rdma_destroy_id(struct rdma_cm_id *id)
@@ -2643,7 +2654,7 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 	id_priv = mc->id_priv;
 	if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) &&
 	    cma_disable_callback(id_priv, CMA_ADDR_RESOLVED))
-		return 0;
+		goto out;
 
 	mutex_lock(&id_priv->qp_mutex);
 	if (!status && id_priv->id.qp)
@@ -2669,10 +2680,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 		cma_exch(id_priv, CMA_DESTROYING);
 		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
-		return 0;
+		goto out;
 	}
 
 	mutex_unlock(&id_priv->handler_mutex);
+out:
+	cma_deref_mc(mc);
 	return 0;
 }
 
@@ -2759,11 +2772,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	memcpy(&mc->addr, addr, ip_addr_size(addr));
 	mc->context = context;
 	mc->id_priv = id_priv;
+	atomic_set(&mc->refcount, 1);
 
 	spin_lock(&id_priv->lock);
 	list_add(&mc->list, &id_priv->mc_list);
 	spin_unlock(&id_priv->lock);
 
+	atomic_inc(&mc->refcount);
 	switch (rdma_node_get_transport(id->device->node_type)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
@@ -2800,7 +2815,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
 						&mc->multicast.ib->rec.mgid,
 						mc->multicast.ib->rec.mlid);
 			ib_sa_free_multicast(mc->multicast.ib);
-			kfree(mc);
+			cma_deref_mc(mc);
 			return;
 		}
 	}
-- 
1.6.3.3


From vlad at lists.openfabrics.org  Mon Aug  3 03:23:45 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon,  3 Aug 2009 03:23:45 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090803-0200 daily build status
Message-ID: <20090803102346.185DC102020F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg':
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink'
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg':
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink'
/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Mon Aug  3 05:04:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 3 Aug 2009 15:04:24 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_sm_mad_ctrl.c: In
 sm_mad_ctrl_send_err_cb, indicate failed attribute
In-Reply-To: <20090803001444.GA26324@comcast.net>
References: <20090803001444.GA26324@comcast.net>
Message-ID: <20090803120424.GY5287@me>

On 20:14 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Display attribute name when appropriate
> Also, cosmetic changes in other log messages
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Aug  3 05:05:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 3 Aug 2009 15:05:35 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover.8: Add max
	hops option
In-Reply-To: <20090803001532.GB26324@comcast.net>
References: <20090803001532.GB26324@comcast.net>
Message-ID: <20090803120535.GZ5287@me>

On 20:15 Sun 02 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Mon Aug  3 06:04:29 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 03 Aug 2009 16:04:29 +0300
Subject: [ofa-general] [PATCH] opensm: do not configure MFTs when mcast
	support is disabled
Message-ID: <4A76E05D.9070705@dev.mellanox.co.il>

Hi Sasha,

I noticed that when MCast support in OSM is disabled (command line
option '-d3'), MFTs on the switches are still getting configured.

Turns out that MFTs configuration was disabled only in heavy sweep,
but it was still working at idle time - the following patch fixes it.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_state_mgr.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 90bef87..185c700 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1377,8 +1377,10 @@ static void do_process_mgrp_queue(osm_sm_t * sm)
 {
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
 		return;
-	osm_mcast_mgr_process_mgroups(sm);
-	wait_for_pending_transactions(&sm->p_subn->p_osm->stats);
+	if (!sm->p_subn->opt.disable_multicast) {
+		osm_mcast_mgr_process_mgroups(sm);
+		wait_for_pending_transactions(&sm->p_subn->p_osm->stats);
+	}
 }

 void osm_state_mgr_process(IN osm_sm_t * sm, IN osm_signal_t signal)
-- 
1.5.1.4


From bart.vanassche at gmail.com  Mon Aug  3 06:21:21 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Mon, 3 Aug 2009 15:21:21 +0200
Subject: [ofa-general] [PATCH 2.6.30.4] Fix for NULL pointer dereference by
	SRP initiator 
	triggered by a SCSI reset after the SRP connection has been closed
Message-ID: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>

Issuing a SCSI reset command on an SRP initiator after the SRP connection has
been closed triggers a NULL pointer dereference. The patch below fixes this
NULL pointer dereference.

See also http://bugzilla.kernel.org/show_bug.cgi?id=13893.

Signed-off-by: <bart.vanassche at gmail.com>
Cc: Roland Dreier <rolandd at cisco.com>
Cc: Sean Hefty <sean.hefty at intel.com>
Cc: Hal Rosenstock <hal.rosenstock at gmail.com>

--- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c	2009-08-03
12:13:11.000000000 +0200
+++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c	2009-08-03
14:58:36.000000000 +0200
@@ -1330,6 +1330,8 @@ static int srp_send_tsk_mgmt(struct srp_
 	struct srp_iu *iu;
 	struct srp_tsk_mgmt *tsk_mgmt;

+	BUG_ON(!req->scmnd->device);
+
 	spin_lock_irq(target->scsi_host->host_lock);

 	if (target->state == SRP_TARGET_DEAD ||
@@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_
 		return FAILED;
 	if (req->tsk_status)
 		return FAILED;
+	if (!req->scmnd->device)
+		return FAILED;

 	spin_lock_irq(target->scsi_host->host_lock);


From sebastien.dugue at bull.net  Mon Aug  3 06:40:01 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 3 Aug 2009 15:40:01 +0200
Subject: [ofa-general] [PATCH] libmlx4 - mmap needs some includes
Message-ID: <20090803154001.32fdab08@frecb007965>


  Hi Roland,

  Add errno.h and sys/mman.h includes in buf.c to get mmap() support, otherwise
we cannot build as is.

  Those includes were removed in your cleanup. Sorry for not noticing earlier.


Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 src/buf.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/src/buf.c b/src/buf.c
index bbaff12..a80bcb1 100644
--- a/src/buf.c
+++ b/src/buf.c
@@ -35,6 +35,8 @@
 #endif /* HAVE_CONFIG_H */
 
 #include <stdlib.h>
+#include <errno.h>
+#include <sys/mman.h>
 
 #include "mlx4.h"
 
-- 
1.6.3.1


From sebastien.dugue at bull.net  Mon Aug  3 07:37:36 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 3 Aug 2009 16:37:36 +0200
Subject: [ofa-general] [PATCH] libmlx4: use dynamic archive name when
	building an rpm
Message-ID: <20090803163736.1fde4c74@frecb007965>


  There is a discrepancy between the tar.gz source archive name and the library
version. rpmbuild then fails to find its source files.

  Fix this by dynamically setting the package version into the archive name .

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 libmlx4.spec.in |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libmlx4.spec.in b/libmlx4.spec.in
index 869032f..7c6c841 100644
--- a/libmlx4.spec.in
+++ b/libmlx4.spec.in
@@ -6,7 +6,7 @@ Summary: Mellanox ConnectX InfiniBand HCA Userspace Driver
 Group: System Environment/Libraries
 License: GPLv2 or BSD
 Url: http://openfabrics.org/
-Source: http://openfabrics.org/downloads/mlx4/libmlx4-1.0.tar.gz
+Source: http://openfabrics.org/downloads/mlx4/libmlx4-%{version}.tar.gz
 BuildRoot: %(mktemp -ud %{_tmppath}/%{name}-%{version}-%{release}-XXXXXX)
 
 BuildRequires: libibverbs-devel >= 1.1-0.1.rc2
-- 
1.6.0.4


From yosefe at voltaire.com  Mon Aug  3 10:03:14 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Mon, 03 Aug 2009 20:03:14 +0300
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <f0e08f230908020426i2331cf0fg3bc3a21f1e86d1b5@mail.gmail.com>
References: <4A6DDFCE.9060009@voltaire.com> <4A703DA4.9080300@Voltaire.COM>	
	<f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>	
	<4A705B3A.7060404@Voltaire.COM>	
	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>	
	<4A731818.3060500@voltaire.com>	
	<f0e08f230907311050wa750cf2n497039acafdab3b4@mail.gmail.com>	
	<4A733D24.3040201@voltaire.com>	
	<f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>	
	<4A742E94.2070002@gmail.com>
	<f0e08f230908020426i2331cf0fg3bc3a21f1e86d1b5@mail.gmail.com>
Message-ID: <4A771852.1010606@voltaire.com>

On 02/08/09 14:26, Hal Rosenstock wrote:
> 
> By handled correctly, you mean that the ARP request gets to the remote
> node, is responded to, and the response makes it back and that is
> treated as valid path indication, right ?
> 
> If so, is the original ARP request unicast or broadcast ?
> 
> If the request is unicast, couldn't it be sent using the wrong static
> rate as isn't it using the original path parameters ?
> 
> Even if it is broadcast, if the original path parameters are still
> used (like rate, etc.) at the local node, doesn't this assume a
> homogeneous subnet ?

By handled correctly, I mean that:
- If the LID is not changed, the mechanism will not trigger path refresh.
      (The first patch without any LMC handling does not satisfy this)
- If the LID is changed, the mechanism will trigger a path refresh (eventually)

The ARP stuff works this way: Remote LID changes. In some point, either the remote
node will send an ARP reply (gratuitous), or (more likely) the local network stack
will start sending solicited ARPs, unicast, using the invalid path. They will fail,
so the stack will send broadcast ARP. Then, the remote node will answer, and IPoIB
will see a different slid than expected. This will trigger path refresh.

The broadcast ARP is sent with the AH of the broadcast group (which is joined when
IPoIB interface goes up), and not the parameters of the path to any specific node.

--Yossi


From yosefe at voltaire.com  Mon Aug  3 10:10:07 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Mon, 03 Aug 2009 20:10:07 +0300
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <20090731194003.GV30626@obsidianresearch.com>
References: <20090727192938.GD5794@obsidianresearch.com>	<4A6ECF6F.4000008@Voltaire.COM>	<f0e08f230907280427h23f9ea0did86293dae80314c1@mail.gmail.com>	<4A70154F.7080300@gmail.com>	<f0e08f230907290330j777bb2f9j4063d497e66e305d@mail.gmail.com>	<4A703DA4.9080300@Voltaire.COM>	<f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>	<4A705B3A.7060404@Voltaire.COM>	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>	<4A731818.3060500@voltaire.com>
	<20090731194003.GV30626@obsidianresearch.com>
Message-ID: <4A7719EF.4080902@voltaire.com>

On 31/07/09 22:40, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2009 at 07:13:12PM +0300, Yossi Etigin wrote:
> 
>> What if we query the remote port LMC once, when the path is
>> resolved, and then use it to mask the LID until the path is
>> refreshed again?
> 
> What are you trying to fix here? Most SMs have a persistent LID
> stability feature, so why would the LID change very often anyhow?
> 
> Jason
>

We have customers with large fabrics and different machines/operation systems, 
where the LID does not always stay the same.They are experiencing loss of
IPoIB connectivity. The patch above solved that. Besides, according to the 
IB spec, LIDs are not persistent and can change (although most SM today do
try to keep them persistent).

Regarding LMC, that is less likely to change, so if we handle constant LMC
correctly and if LMC is changed the behaviour is as it was before the patch,
I think it can be OK.

What do you think?

--Yossi


From hnrose at comcast.net  Mon Aug  3 10:59:46 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 3 Aug 2009 13:59:46 -0400
Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: Use proper flag name in
	comment
Message-ID: <20090803175946.GA5981@comcast.net>


Change force_single_heavy_sweep to force_heavy_sweep
Other cosmetic commentary changes

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 4578ebc..4a6d0ff 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -564,12 +564,12 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 
 	/* do a sweep if we received a trap */
 	if (sm->p_subn->opt.sweep_on_trap) {
-		/* if this is trap number 128 or run_heavy_sweep is TRUE - update the
-		   force_single_heavy_sweep flag of the subnet.
-		   Sweep also on traps 144/145 - these traps signal a change of a certain
-		   port capability/system image guid.
-		   TODO: In the future we can change this to just getting PortInfo on
-		   this port instead of sweeping the entire subnet. */
+		/* if this is trap number 128 or run_heavy_sweep is TRUE -
+		   update the force_heavy_sweep flag of the subnet.
+		   Sweep also on traps 144/145 - these traps signal a change of 
+		   certain port capabilities/system image guid.
+		   TODO: In the future this can be changed to just getting
+		   PortInfo on this port instead of sweeping the entire subnet. */
 		if (ib_notice_is_generic(p_ntci) &&
 		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
 		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||


From rdreier at cisco.com  Mon Aug  3 13:31:37 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Aug 2009 13:31:37 -0700
Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory
In-Reply-To: <20090803092528.GA25528@mtls03> (Eli Cohen's message of "Mon, 3
	Aug 2009 12:25:29 +0300")
References: <20090803092528.GA25528@mtls03>
Message-ID: <adak51kod06.fsf@cisco.com>


 > rdma_join_multicast() allocates struct cma_multicast and then proceeds to join
 > to a multicast address. However, the join operation completes in another
 > context and the allocated struct could be released if the user destroys either
 > the rdma_id object or decides to leave the multicast group while the join is in
 > progress. This patch uses reference counting to to avoid such situation. It
 > also protects removal from id_priv->mc_list in cma_leave_mc_groups().

Is this all in response to problems seen in practice, or just from
reading over the code?

 > +	atomic_t		refcount;

I think this would be clearer if you used struct kref here.

 > @@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 >  {
 >  	struct cma_multicast *mc;
 >  
 > +	spin_lock_irq(&id_priv->lock);

I didn't follow how this change is connected to the reference counting.
What is this synchronizing against?  Is it an independent change of the
reference counting?

 - R.


From rdreier at cisco.com  Mon Aug  3 13:36:02 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Aug 2009 13:36:02 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP initiator triggered by a SCSI reset after the SRP
	connection has been closed
In-Reply-To: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	(Bart Van Assche's message of "Mon, 3 Aug 2009 15:21:21 +0200")
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
Message-ID: <adafxc8ocst.fsf@cisco.com>


 > Issuing a SCSI reset command on an SRP initiator after the SRP connection has
 > been closed triggers a NULL pointer dereference. The patch below fixes this
 > NULL pointer dereference.
 > 
 > See also http://bugzilla.kernel.org/show_bug.cgi?id=13893.

Thanks for debugging this... a couple of questions:

 > +	BUG_ON(!req->scmnd->device);

Why BUG_ON() here?  Can we return failure or something, rather than
crashing the whole system?

 > +	if (!req->scmnd->device)
 > +		return FAILED;

How do we end up in srp_reset_device() with req->scmnd->device == NULL?
Presumably req->scmnd should match scmnd if I am understanding the code
properly -- and then scmnd->device == NULL??

 - R.


From jgunthorpe at obsidianresearch.com  Mon Aug  3 13:35:59 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 3 Aug 2009 14:35:59 -0600
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <4A7719EF.4080902@voltaire.com>
References: <f0e08f230907280427h23f9ea0did86293dae80314c1@mail.gmail.com>
	<4A70154F.7080300@gmail.com>
	<f0e08f230907290330j777bb2f9j4063d497e66e305d@mail.gmail.com>
	<4A703DA4.9080300@Voltaire.COM>
	<f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>
	<4A705B3A.7060404@Voltaire.COM>
	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>
	<4A731818.3060500@voltaire.com>
	<20090731194003.GV30626@obsidianresearch.com>
	<4A7719EF.4080902@voltaire.com>
Message-ID: <20090803203559.GJ24282@obsidianresearch.com>

On Mon, Aug 03, 2009 at 08:10:07PM +0300, Yossi Etigin wrote:

> We have customers with large fabrics and different machines/operation systems, 
> where the LID does not always stay the same.They are experiencing loss of
> IPoIB connectivity. The patch above solved that. Besides, according to the 
> IB spec, LIDs are not persistent and can change (although most SM today do
> try to keep them persistent).

Hmm, have you considered changing the IPoIB QPN when the LID changes?
This would provide a clear signal to anyone with a cached ARP entry
that it is wrong.

But even so, IPoIB implicitly assumes that the LID doesn't change, by
design. That SA really has to try to make that true when IPoIB is used.

Jason


From hnrose at comcast.net  Mon Aug  3 13:39:57 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 3 Aug 2009 16:39:57 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibsendtrap.c: Fill in
	capability mask on trap 144
Message-ID: <20090803203957.GA23640@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ac8dcf4..38305a2 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -63,6 +63,16 @@ static uint16_t get_node_type(ib_portid_t *port)
 	return node_type;
 }
 
+static uint32_t get_cap_mask(ib_portid_t *port)
+{
+	uint8_t data[IB_SMP_DATA_SIZE];
+	uint32_t cap_mask = 0;
+
+	if (smp_query_via(data, port, IB_ATTR_PORT_INFO, 0, 0, srcport))
+		cap_mask = (uint32_t)mad_get_field(data, 0, IB_PORT_CAPMASK_F);
+	return cap_mask;
+}
+
 static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
 	n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
@@ -70,6 +80,7 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 	n->g_or_v.generic.trap_num = cl_hton16(144);
 	n->issuer_lid = cl_hton16((uint16_t) port->lid);
 	n->data_details.ntc_144.lid = n->issuer_lid;
+	n->data_details.ntc_144.new_cap_mask = cl_hton32(get_cap_mask(port));
 	n->data_details.ntc_144.local_changes =
 	    TRAP_144_MASK_OTHER_LOCAL_CHANGES;
 	n->data_details.ntc_144.change_flgs =


From rdreier at cisco.com  Mon Aug  3 13:46:22 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Aug 2009 13:46:22 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4 - mmap needs some includes
In-Reply-To: <20090803154001.32fdab08@frecb007965> (sebastien dugue's message
	of "Mon, 3 Aug 2009 15:40:01 +0200")
References: <20090803154001.32fdab08@frecb007965>
Message-ID: <adabpmwocbl.fsf@cisco.com>

thanks ... actually they weren't removed as part of my cleanups, but as
part of the incompetent way I applied the patch.  same end result anyway.


From rdreier at cisco.com  Mon Aug  3 13:49:35 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Aug 2009 13:49:35 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: use dynamic archive name when
	building an rpm
In-Reply-To: <20090803163736.1fde4c74@frecb007965> (sebastien dugue's message
	of "Mon, 3 Aug 2009 16:37:36 +0200")
References: <20090803163736.1fde4c74@frecb007965>
Message-ID: <ada7hxkoc68.fsf@cisco.com>


 >   There is a discrepancy between the tar.gz source archive name and the library
 > version. rpmbuild then fails to find its source files.

 >   Fix this by dynamically setting the package version into the archive name .

Thanks, good catch.  I fixed this by just changing to 1.0.1, since
otherwise things run into trouble if the version number is 1.0.2-rc1 or
something like that (RPM version should be different than  1.0.2-rc1 in
that cfase)


From abenjamin at sgi.com  Mon Aug  3 19:49:23 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Mon, 03 Aug 2009 19:49:23 -0700
Subject: [ofa-general] [PATCH v3] mthca: Distinguish multiple IB cards in
	/proc/interrupts
Message-ID: <4A77A1B3.3020603@sgi.com>

When the mthca driver calls request_irq() to allocate interrupt
resources, it uses the fixed device name string "ib_mthca".
When multiple IB cards are present in the system, every instance of
the resource is named "ib_mthca" in /proc/interrupts.
This can make it very confusing trying to work out exactly where IB
interrupts are going and why.

The mthca driver has been modified to use the PCI name of the IB
card for the purpose of allocating interrupt resources.

Signed-off-by: Arputham Benjamin <abenjamin at sgi.com>
---
diff -rup a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
--- a/drivers/infiniband/hw/mthca/mthca_dev.h	2009-08-03 15:44:44.408580749 -0700
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h	2009-08-03 15:45:25.451110249 -0700
@@ -357,6 +357,7 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+	char                  irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX];
 };
 
 #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
diff -rup a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c
--- a/drivers/infiniband/hw/mthca/mthca_eq.c	2009-08-03 15:44:44.416581242 -0700
+++ b/drivers/infiniband/hw/mthca/mthca_eq.c	2009-08-03 15:45:11.098225651 -0700
@@ -835,21 +835,27 @@ int mthca_init_eq_table(struct mthca_dev
 		};
 
 		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			snprintf(dev->irq_name[i], IB_DEVICE_NAME_MAX,
+				 "%s at pci:%s", eq_name[i],
+				 pci_name(dev->pdev));
 			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
 					  mthca_is_memfree(dev) ?
 					  mthca_arbel_msi_x_interrupt :
 					  mthca_tavor_msi_x_interrupt,
-					  0, eq_name[i], dev->eq_table.eq + i);
+					  0, dev->irq_name[i],
+					  dev->eq_table.eq + i);
 			if (err)
 				goto err_out_cmd;
 			dev->eq_table.eq[i].have_irq = 1;
 		}
 	} else {
+		snprintf(dev->irq_name[0], IB_DEVICE_NAME_MAX,
+			 DRV_NAME "@pci:%s", pci_name(dev->pdev));
 		err = request_irq(dev->pdev->irq,
 				  mthca_is_memfree(dev) ?
 				  mthca_arbel_interrupt :
 				  mthca_tavor_interrupt,
-				  IRQF_SHARED, DRV_NAME, dev);
+				  IRQF_SHARED, dev->irq_name[0], dev);
 		if (err)
 			goto err_out_cmd;
 		dev->eq_table.have_irq = 1;


From abenjamin at sgi.com  Mon Aug  3 20:00:00 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Mon, 03 Aug 2009 20:00:00 -0700
Subject: [ofa-general] [PATCH v2] mlx4_core: Distinguish multiple IB cards in
	/proc/interrupts
Message-ID: <4A77A430.2020106@sgi.com>

When the mlx4_core driver calls request_irq() to allocate interrupt
resources, it uses the fixed device name string "mlx4_core".
When multiple IB cards are present in the system, every instance of
the resource is named "mlx4_core" in /proc/interrupts.
This can make it very confusing trying to work out exactly where IB
interrupts are going and why.

The mlx4_core driver has been modified to use the PCI name of the IB
card for the purpose of allocating interrupt resources.

Signed-off-by: Arputham Benjamin <abenjamin at sgi.com>
---
diff -rup a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
--- a/drivers/net/mlx4/eq.c	2009-08-03 19:42:18.737707766 -0700
+++ b/drivers/net/mlx4/eq.c	2009-08-03 19:42:48.175515414 -0700
@@ -615,7 +615,8 @@ int mlx4_init_eq_table(struct mlx4_dev *
 	priv->eq_table.clr_int  = priv->clr_base +
 		(priv->eq_table.inta_pin < 32 ? 4 : 0);
 
-	priv->eq_table.irq_names = kmalloc(16 * dev->caps.num_comp_vectors, GFP_KERNEL);
+	priv->eq_table.irq_names = kmalloc(DEVICE_NAME_MAX *
+					   (dev->caps.num_comp_vectors + 1), GFP_KERNEL);
 	if (!priv->eq_table.irq_names) {
 		err = -ENOMEM;
 		goto err_out_bitmap;
@@ -638,17 +639,25 @@ int mlx4_init_eq_table(struct mlx4_dev *
 		goto err_out_comp;
 
 	if (dev->flags & MLX4_FLAG_MSI_X) {
-		static const char async_eq_name[] = "mlx4-async";
 		const char *eq_name;
 
 		for (i = 0; i < dev->caps.num_comp_vectors + 1; ++i) {
 			if (i < dev->caps.num_comp_vectors) {
-				snprintf(priv->eq_table.irq_names + i * 16, 16,
-					 "mlx4-comp-%d", i);
-				eq_name = priv->eq_table.irq_names + i * 16;
-			} else
-				eq_name = async_eq_name;
+				snprintf(priv->eq_table.irq_names +
+					 i * DEVICE_NAME_MAX,
+					 DEVICE_NAME_MAX,
+					 "mlx4-comp-%d at pci:%s", i,
+					 pci_name(dev->pdev));
+			} else {
+				snprintf(priv->eq_table.irq_names +
+					 i * DEVICE_NAME_MAX,
+					 DEVICE_NAME_MAX,
+					 "mlx4-async at pci:%s",
+					 pci_name(dev->pdev));
+			}
 
+			eq_name = priv->eq_table.irq_names +
+				  i * DEVICE_NAME_MAX;
 			err = request_irq(priv->eq_table.eq[i].irq,
 					  mlx4_msi_x_interrupt, 0, eq_name,
 					  priv->eq_table.eq + i);
@@ -658,8 +667,12 @@ int mlx4_init_eq_table(struct mlx4_dev *
 			priv->eq_table.eq[i].have_irq = 1;
 		}
 	} else {
+		snprintf(priv->eq_table.irq_names,
+			 DEVICE_NAME_MAX,
+			 DRV_NAME "@pci:%s",
+			 pci_name(dev->pdev));
 		err = request_irq(dev->pdev->irq, mlx4_interrupt,
-				  IRQF_SHARED, DRV_NAME, dev);
+				  IRQF_SHARED, priv->eq_table.irq_names, dev);
 		if (err)
 			goto err_out_async;
 
diff -rup a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
--- a/drivers/net/mlx4/mlx4.h	2009-08-03 19:42:18.737707766 -0700
+++ b/drivers/net/mlx4/mlx4.h	2009-08-03 19:43:01.532335625 -0700
@@ -198,6 +198,8 @@ struct mlx4_cq_table {
 	struct mlx4_icm_table	cmpt_table;
 };
 
+#define DEVICE_NAME_MAX 64
+
 struct mlx4_eq_table {
 	struct mlx4_bitmap	bitmap;
 	char		       *irq_names;


From eli at dev.mellanox.co.il  Mon Aug  3 20:32:21 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Tue, 4 Aug 2009 06:32:21 +0300
Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory
In-Reply-To: <adak51kod06.fsf@cisco.com>
References: <20090803092528.GA25528@mtls03> <adak51kod06.fsf@cisco.com>
Message-ID: <20090804033221.GA30949@mtls03>

On Mon, Aug 03, 2009 at 01:31:37PM -0700, Roland Dreier wrote:
> 
> Is this all in response to problems seen in practice, or just from
> reading over the code?

I did not see a problem in practice with the current code, but playing
arround rdma_join_multicast() adding another case to the switch
statement revealed this problem which I think exists also in the
current code.

> 
>  > +	atomic_t		refcount;
> 
> I think this would be clearer if you used struct kref here.
> 
Certainly. I will post another patch.


>  > @@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
>  >  {
>  >  	struct cma_multicast *mc;
>  >  
>  > +	spin_lock_irq(&id_priv->lock);
> 
> I didn't follow how this change is connected to the reference counting.
> What is this synchronizing against?  Is it an independent change of the
> reference counting?
> 

Maybe it's just a loose connection but yet, it seems to me that
operations on id_priv->mc_list should be protected. Should I send a
different patch?


From jgunthorpe at obsidianresearch.com  Mon Aug  3 21:56:47 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 3 Aug 2009 22:56:47 -0600
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes
In-Reply-To: <4A771852.1010606@voltaire.com>
References: <f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>
	<4A705B3A.7060404@Voltaire.COM>
	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>
	<4A731818.3060500@voltaire.com>
	<f0e08f230907311050wa750cf2n497039acafdab3b4@mail.gmail.com>
	<4A733D24.3040201@voltaire.com>
	<f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>
	<4A742E94.2070002@gmail.com>
	<f0e08f230908020426i2331cf0fg3bc3a21f1e86d1b5@mail.gmail.com>
	<4A771852.1010606@voltaire.com>
Message-ID: <20090804045647.GK24282@obsidianresearch.com>

On Mon, Aug 03, 2009 at 08:03:14PM +0300, Yossi Etigin wrote:

> The ARP stuff works this way: Remote LID changes. In some point, either the remote
> node will send an ARP reply (gratuitous), or (more likely) the local network stack
> will start sending solicited ARPs, unicast, using the invalid path. They will fail,
> so the stack will send broadcast ARP.

Erm.. Maybe a little tighter integration with the ARP/ND layer is in
order. If it knows unicast isn't working thats a pretty damn good clue
to discard the PR.

Jason


From sebastien.dugue at bull.net  Mon Aug  3 23:49:43 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Tue, 4 Aug 2009 08:49:43 +0200
Subject: [ofa-general] Re: [PATCH] libmlx4: use dynamic archive name when
	building an rpm
In-Reply-To: <ada7hxkoc68.fsf@cisco.com>
References: <20090803163736.1fde4c74@frecb007965> <ada7hxkoc68.fsf@cisco.com>
Message-ID: <20090804084943.4bcc38cd@frecb007965>

On Mon, 03 Aug 2009 13:49:35 -0700
Roland Dreier <rdreier at cisco.com> wrote:

> 
>  >   There is a discrepancy between the tar.gz source archive name and the library
>  > version. rpmbuild then fails to find its source files.
> 
>  >   Fix this by dynamically setting the package version into the archive name .
> 
> Thanks, good catch.  I fixed this by just changing to 1.0.1, since
> otherwise things run into trouble if the version number is 1.0.2-rc1 or
> something like that (RPM version should be different than  1.0.2-rc1 in
> that cfase)
> 
  Thanks, haven't thought of the -rc issue.

  Sebastien.


From bart.vanassche at gmail.com  Tue Aug  4 00:48:22 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Tue, 4 Aug 2009 09:48:22 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adafxc8ocst.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
Message-ID: <e2e108260908040048o46f66c64s91fb8368dbfb9f24@mail.gmail.com>

On Mon, Aug 3, 2009 at 10:36 PM, Roland Dreier <rdreier at cisco.com> wrote:
>
>  > Issuing a SCSI reset command on an SRP initiator after the SRP connection has
>  > been closed triggers a NULL pointer dereference. The patch below fixes this
>  > NULL pointer dereference.
>  >
>  > See also http://bugzilla.kernel.org/show_bug.cgi?id=13893.
>
> Thanks for debugging this... a couple of questions:
>
>  > +    BUG_ON(!req->scmnd->device);
>
> Why BUG_ON() here?  Can we return failure or something, rather than
> crashing the whole system?

The function srp_send_tsk_mgmt() contains a.o. the following
statement: "tsk_mgmt->lun = cpu_to_be64((u64) req->scmnd->device->lun
<< 48);". This is the statement that triggered the NULL pointer
dereference. Whether or not a BUG_ON() is appropriate here depends on
which of the following two alternatives is preferred: should the
caller guarantee that req->scmnd->device != NULL or should
srp_send_tsk_mgmt() should handle the condition req->scmnd->device ==
NULL itself ?

>  > +    if (!req->scmnd->device)
>  > +            return FAILED;
>
> How do we end up in srp_reset_device() with req->scmnd->device == NULL?
> Presumably req->scmnd should match scmnd if I am understanding the code
> properly -- and then scmnd->device == NULL??

Good question. I did not yet analyze why this happens. But before I
started developing a patch I had first verified that scmnd->device is
NULL at that point by inserting the statement WARN_ON(!scmnd->device).

A clue might be that without the above patch the BUG message on the
initiator system is triggered just after the "SRP reset_device called"
message has been logged.

Bart.


From kliteyn at dev.mellanox.co.il  Tue Aug  4 01:29:06 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 04 Aug 2009 11:29:06 +0300
Subject: [ofa-general] [PATCH] opensm/osm_helper.c: fix printing trap 258
	details
Message-ID: <4A77F152.2030506@dev.mellanox.co.il>

Hi Sasha,

Fixing some issues with printing trap 258 details.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_helper.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 57de0d4..07b1e5a 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -1828,10 +1828,11 @@ void osm_dump_notice(IN osm_log_t * p_log,
 					  lid2),
 				cl_ntoh32(p_ntci->data_details.ntc_257_258.key),
 				cl_ntoh32(p_ntci->data_details.ntc_257_258.
-					  qp1) >> 24,
+					  qp1) >> 28,
 				cl_ntoh32(p_ntci->data_details.ntc_257_258.
 					  qp1) & 0xffffff,
-				cl_ntoh32(p_ntci->data_details.ntc_257_258.qp2),
+				cl_ntoh32(p_ntci->data_details.ntc_257_258.
+					  qp2) & 0xffffff,
 				inet_ntop(AF_INET6, p_ntci->data_details.
 					  ntc_257_258.gid1.raw, gid_str,
 					  sizeof gid_str),
-- 
1.5.1.4


From vlad at lists.openfabrics.org  Tue Aug  4 02:59:47 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue,  4 Aug 2009 02:59:47 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090804-0200 daily build status
Message-ID: <20090804095948.2C1D3E61D1B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From Robert at saq.co.uk  Tue Aug  4 02:58:24 2009
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 4 Aug 2009 10:58:24 +0100
Subject: [ofa-general] OFED on Centos with 2.6.30.4 generic kernel
Message-ID: <C1EAC9C5E752D24C968FF091D446D823458F4A@ALTERNATEREALIT>

I'm a bit of newbie to kernel building but work on my first custom
kernel seems to be going well so far. 

The issue I have is the systems this kernel is destined for are using
Mellanox infiniband cards, IPOIB (CM), RDMA and Subnet Manager (Systems
are direct cabled to each other). I noticed support seems to be built-in
to the kernel for all but the subnet manager. 

Should I use the built-in kernel support and install the Subnet manager
separately? Or build a kernel with no infiniband support and then try to
install OFED? Will OFED even likely install with this sort of setup?


Thanks,

Rob

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
Company Number: 06481952

http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.

Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.

ISPA Member


From ofedrnicuser at yahoo.com  Tue Aug  4 03:08:43 2009
From: ofedrnicuser at yahoo.com (Bill N)
Date: Tue, 4 Aug 2009 03:08:43 -0700 (PDT)
Subject: [ofa-general] perftest for Chelsio RNIC adapters
Message-ID: <351317.84709.qm@web111212.mail.gq1.yahoo.com>

Hi,

Is performance tests of the perftest-1.2 supported for Chelsio and other RNIC adapters?

Regards,
Bill


From jackm at dev.mellanox.co.il  Tue Aug  4 03:25:05 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 4 Aug 2009 13:25:05 +0300
Subject: [ofa-general] OFED on Centos with 2.6.30.4 generic kernel
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D823458F4A@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D823458F4A@ALTERNATEREALIT>
Message-ID: <200908041325.05450.jackm@dev.mellanox.co.il>

On Tuesday 04 August 2009 12:58, Robert Dunkley wrote:
> I'm a bit of newbie to kernel building but work on my first custom
> kernel seems to be going well so far. 
> 
> The issue I have is the systems this kernel is destined for are using
> Mellanox infiniband cards, IPOIB (CM), RDMA and Subnet Manager (Systems
> are direct cabled to each other). I noticed support seems to be built-in
> to the kernel for all but the subnet manager. 
> 
> Should I use the built-in kernel support and install the Subnet manager
> separately? Or build a kernel with no infiniband support and then try to
> install OFED? Will OFED even likely install with this sort of setup?

OFED 1.4 will not install (it supports up to kernel 2.6.27 only).
OFED 1.5 is currently under development, supporting up to kernel 2.6.30.
An alpha version (lightly tested only) is currently available.
I do not know how much testing the built-in kernel support has undergone.

-Jack
> 
> Thanks,
> 
> Rob
> 
> The SAQ Group
> 
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
> Company Number: 06481952
> 
> http://www.saqnet.co.uk AS29219
> 
> SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.
> 
> Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.
> 
> ISPA Member
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Tue Aug  4 05:32:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 15:32:53 +0300
Subject: [ofa-general] Re: [PATCH] opensm: do not configure MFTs when mcast
 support is disabled
In-Reply-To: <4A76E05D.9070705@dev.mellanox.co.il>
References: <4A76E05D.9070705@dev.mellanox.co.il>
Message-ID: <20090804123253.GA7993@me>

On 16:04 Mon 03 Aug     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> I noticed that when MCast support in OSM is disabled (command line
> option '-d3'), MFTs on the switches are still getting configured.
> 
> Turns out that MFTs configuration was disabled only in heavy sweep,
> but it was still working at idle time - the following patch fixes it.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Aug  4 05:33:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 15:33:08 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Use proper flag
	name in comment
In-Reply-To: <20090803175946.GA5981@comcast.net>
References: <20090803175946.GA5981@comcast.net>
Message-ID: <20090804123308.GB7993@me>

On 13:59 Mon 03 Aug     , Hal Rosenstock wrote:
> 
> Change force_single_heavy_sweep to force_heavy_sweep
> Other cosmetic commentary changes
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Aug  4 05:35:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 15:35:27 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsendtrap.c: Fill in
 capability mask on trap 144
In-Reply-To: <20090803203957.GA23640@comcast.net>
References: <20090803203957.GA23640@comcast.net>
Message-ID: <20090804123527.GC7993@me>

On 16:39 Mon 03 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Aug  4 05:38:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 15:38:21 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: fix printing trap 258
	details
In-Reply-To: <4A77F152.2030506@dev.mellanox.co.il>
References: <4A77F152.2030506@dev.mellanox.co.il>
Message-ID: <20090804123821.GD7993@me>

On 11:29 Tue 04 Aug     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Fixing some issues with printing trap 258 details.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From hnrose at comcast.net  Tue Aug  4 05:47:17 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 08:47:17 -0400
Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: Validate trap is 144
	before checking for NodeDescription changed
Message-ID: <20090804124717.GA12236@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index bf39926..925cb27 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 		}
 	}
 
-	/* Check for node description update. IB Spec v1.2.1 pg 823 */
-	if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
-	    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
-		OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n");
-
-		if (p_physp) {
-			CL_PLOCK_ACQUIRE(sm->p_lock);
-			osm_req_get_node_desc(sm, p_physp);
-			CL_PLOCK_RELEASE(sm->p_lock);
-		} else {
-			OSM_LOG(sm->p_log, OSM_LOG_ERROR,
-				"ERR 3812: No physical port found for "
-				"trap 144: \"node description update\"\n");
+	if (ib_notice_is_generic(p_ntci)) {
+		/* Check for node description update. IB Spec v1.2.1 pg 823 */
+		if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
+			if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
+			    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
+				OSM_LOG(sm->p_log, OSM_LOG_INFO,
+					"Trap 144 Node description update\n");
+
+				if (p_physp) {
+					CL_PLOCK_ACQUIRE(sm->p_lock);
+					osm_req_get_node_desc(sm, p_physp);
+					CL_PLOCK_RELEASE(sm->p_lock);
+				} else
+					OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+						"ERR 3812: No physical port found for "
+						"trap 144: \"node description update\"\n");
+			}
 		}
-	}
 
-	/* do a sweep if we received a trap */
-	if (sm->p_subn->opt.sweep_on_trap) {
-		/* if this is trap number 128 or run_heavy_sweep is TRUE -
-		   update the force_heavy_sweep flag of the subnet.
-		   Sweep also on traps 144/145 - these traps signal a change of
-		   certain port capabilities/system image guid.
-		   TODO: In the future this can be changed to just getting
-		   PortInfo on this port instead of sweeping the entire subnet. */
-		if (ib_notice_is_generic(p_ntci) &&
-		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
-		     run_heavy_sweep)) {
-			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-				"Forcing heavy sweep. Received trap:%u\n",
-				cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
+		/* do a sweep if we received a trap */
+		if (sm->p_subn->opt.sweep_on_trap) {
+			/* if this is trap number 128 or run_heavy_sweep is
+			   TRUE - update the force_heavy_sweep flag of the
+			   subnet. Also, sweep also on traps 144/145 -
+			   these traps signal a change of certain port
+			   capabilities/system image guid.
+			   TODO: In the future this can be changed to just
+			   getting PortInfo on this port instead of sweeping
+			   the entire subnet. */
+			if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
+			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
+			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
+			    run_heavy_sweep) {
+				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
+					"Forcing heavy sweep. Received trap:%u\n",
+					cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
 
-			sm->p_subn->force_heavy_sweep = TRUE;
+				sm->p_subn->force_heavy_sweep = TRUE;
+			}
+			osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
 		}
-		osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
 	}
 
 	/* If we reached here due to trap 129/130/131 - do not need to do


From hnrose at comcast.net  Tue Aug  4 05:50:09 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 08:50:09 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibsendtrap.c: Add support for
	link_speed_enabled_change trap
Message-ID: <20090804125009.GB12236@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index 38305a2..c8c7ee8 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -87,6 +87,20 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 	    TRAP_144_MASK_NODE_DESCRIPTION_CHANGE;
 }
 
+static void build_trap144_2(ib_mad_notice_attr_t * n, ib_portid_t *port)
+{
+	n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
+	n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
+	n->g_or_v.generic.trap_num = cl_hton16(144);
+	n->issuer_lid = cl_hton16((uint16_t) port->lid);
+	n->data_details.ntc_144.lid = n->issuer_lid;
+	n->data_details.ntc_144.new_cap_mask = cl_hton32(get_cap_mask(port));
+	n->data_details.ntc_144.local_changes =
+	    TRAP_144_MASK_OTHER_LOCAL_CHANGES;
+	n->data_details.ntc_144.change_flgs =
+	    TRAP_144_MASK_LINK_SPEED_ENABLE_CHANGE;
+}
+
 static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
 	n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT;
@@ -134,6 +148,7 @@ typedef struct _trap_def {
 
 trap_def_t traps[3] = {
 	{"node_desc_change", build_trap144},
+	{"link_speed_enabled_change", build_trap144_2},
 	{"local_link_integrity", build_trap129},
 	{NULL, NULL}
 };


From hnrose at comcast.net  Tue Aug  4 06:18:36 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 09:18:36 -0400
Subject: [ofa-general] [PATCH] opensm: Add initial support for optimized
	SLtoVLMappingTable programming
Message-ID: <20090804131836.GA15226@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 6c20de8..8443763 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -4,6 +4,7 @@
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -204,6 +205,7 @@ typedef struct osm_subn_opt {
 	boolean_t daemon;
 	boolean_t sm_inactive;
 	boolean_t babbling_port_policy;
+	boolean_t use_optimized_slvl;
 	osm_qos_options_t qos_options;
 	osm_qos_options_t qos_ca_options;
 	osm_qos_options_t qos_sw0_options;
@@ -428,6 +430,10 @@ typedef struct osm_subn_opt {
 *	babbling_port_policy
 *		OpenSM will enforce its "babbling" port policy.
 *
+*	use_optimized_slvl
+*		Use optimized SLtoVLMappingTable programming if
+*		device indicates it supports this.
+*
 *	perfmgr
 *		Enable or disable the performance manager
 *
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index e3dfb58..592e082 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -150,7 +151,7 @@ static ib_api_status_t vlarb_update(osm_sm_t * sm, osm_physp_t * p,
 
 static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p,
 					  uint8_t in_port, uint8_t out_port,
-					  unsigned force_update,
+					  unsigned optimize, unsigned force_update,
 					  const ib_slvl_table_t * sl2vl_table)
 {
 	osm_madw_context_t context;
@@ -177,10 +178,18 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p,
 	    !memcmp(p_tbl, &tbl, sizeof(tbl)))
 		return IB_SUCCESS;
 
+	/* both input port and output port wildcarded */
+	if (optimize && (in_port != 1 || out_port != 1))
+		return IB_SUCCESS;
+
 	context.slvl_context.node_guid = osm_node_get_node_guid(p_node);
 	context.slvl_context.port_guid = osm_physp_get_port_guid(p);
 	context.slvl_context.set_method = TRUE;
-	attr_mod = in_port << 8 | out_port;
+	if (optimize)
+		/* both input port and output port wildcarded */
+		attr_mod = 0x30000;
+	else
+		attr_mod = in_port << 8 | out_port;
 	return osm_req_set(sm, osm_physp_get_dr_path_ptr(p),
 			   (uint8_t *) & tbl, sizeof(tbl),
 			   IB_MAD_ATTR_SLVL_TABLE, cl_hton32(attr_mod),
@@ -189,14 +198,17 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p,
 
 static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port,
 				    osm_physp_t * p, uint8_t port_num,
-				    unsigned force_update,
+				    unsigned optimize, unsigned force_update,
 				    const struct qos_config *qcfg)
 {
 	ib_api_status_t status;
 	uint8_t i, num_ports;
 	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	unsigned optimizesl2vl = 0;
 
-	if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) {
+	p_node = osm_physp_get_node_ptr(p);
+	if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) {
 		if (ib_port_info_get_vl_cap(&p->port_info) == 1) {
 			/* Check port 0's capability mask */
 			p_physp = p_port->p_physp;
@@ -205,7 +217,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port,
 			     capability_mask & IB_PORT_CAP_HAS_SL_MAP))
 				return IB_SUCCESS;
 		}
-		num_ports = osm_node_get_num_physp(osm_physp_get_node_ptr(p));
+		num_ports = osm_node_get_num_physp(p_node);
+		optimizesl2vl = ib_switch_info_get_opt_sl2vlmapping(&p_node->sw->switch_info) & optimize;
 	} else {
 		if (!(p->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP))
 			return IB_SUCCESS;
@@ -213,8 +226,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port,
 	}
 
 	for (i = 0; i < num_ports; i++) {
-		status = sl2vl_update_table(sm, p, i, port_num, force_update,
-					    &qcfg->sl2vl);
+		status = sl2vl_update_table(sm, p, i, port_num, optimizesl2vl,
+					    force_update, &qcfg->sl2vl);
 		if (status != IB_SUCCESS)
 			return status;
 	}
@@ -224,7 +237,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port,
 
 static int qos_physp_setup(osm_log_t * p_log, osm_sm_t * sm,
 			   osm_port_t * p_port, osm_physp_t * p,
-			   uint8_t port_num, unsigned force_update,
+			   uint8_t port_num, unsigned optimize,
+			   unsigned force_update,
 			   const struct qos_config *qcfg)
 {
 	ib_api_status_t status;
@@ -245,7 +259,8 @@ static int qos_physp_setup(osm_log_t * p_log, osm_sm_t * sm,
 	}
 
 	/* setup SL2VL tables */
-	status = sl2vl_update(sm, p_port, p, port_num, force_update, qcfg);
+	status = sl2vl_update(sm, p_port, p, port_num, optimize, force_update,
+			      qcfg);
 	if (status != IB_SUCCESS) {
 		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 6203 : "
 			"failed to update SL2VLMapping tables "
@@ -307,6 +322,7 @@ int osm_qos_setup(osm_opensm_t * p_osm)
 				    p_osm->subn.need_update;
 				if (qos_physp_setup(&p_osm->log, &p_osm->sm,
 						    p_port, p_physp, i,
+						    p_osm->subn.opt.use_optimized_slvl,
 						    force_update, &swe_config))
 					ret = -1;
 			}
@@ -327,7 +343,7 @@ int osm_qos_setup(osm_opensm_t * p_osm)
 
 		force_update = p_physp->need_update || p_osm->subn.need_update;
 		if (qos_physp_setup(&p_osm->log, &p_osm->sm, p_port, p_physp,
-				    0, force_update, cfg))
+				    0, 0, force_update, cfg))
 			ret = -1;
 	}
 
diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c
index 9c37442..67c71bd 100644
--- a/opensm/opensm/osm_slvl_map_rcv.c
+++ b/opensm/opensm/osm_slvl_map_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -72,7 +73,9 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data)
 	osm_slvl_context_t *p_context;
 	ib_net64_t port_guid;
 	ib_net64_t node_guid;
-	uint8_t out_port_num, in_port_num;
+	uint32_t attr_mod;
+	uint8_t out_port_num, in_port_num, startinport, startoutport,
+		endinport, endoutport;
 
 	CL_ASSERT(sm);
 
@@ -111,6 +114,9 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data)
 		    (uint8_t) cl_ntoh32(p_smp->attr_mod & 0xFF000000);
 		in_port_num =
 		    (uint8_t) cl_ntoh32((p_smp->attr_mod & 0x00FF0000) << 8);
+		attr_mod = cl_ntoh32(p_smp->attr_mod);
+		if (attr_mod & 0x30000)
+			goto opt_sl2vl;
 		p_physp = osm_node_get_physp_ptr(p_node, out_port_num);
 	} else {
 		p_physp = p_port->p_physp;
@@ -123,7 +129,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data)
 	   all we want is to update the subnet.
 	 */
 	OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-		"Got SLtoVL get response in_port_num %u out_port_num %u with "
+		"Received SLtoVL GetResp in_port_num %u out_port_num %u with "
 		"GUID 0x%" PRIx64 " for parent node GUID 0x%" PRIx64 ", TID 0x%"
 		PRIx64 "\n", in_port_num, out_port_num, cl_ntoh64(port_guid),
 		cl_ntoh64(node_guid), cl_ntoh64(p_smp->trans_id));
@@ -142,6 +148,39 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data)
 				out_port_num, p_slvl_tbl, OSM_LOG_DEBUG);
 
 	osm_physp_set_slvl_tbl(p_physp, p_slvl_tbl, in_port_num);
+	goto Exit;
+
+opt_sl2vl:
+	OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
+		"Got optimized SLtoVL get response in_port_num %u out_port_num "
+		"%u with GUID 0x%" PRIx64 " for parent node GUID 0x%" PRIx64
+		", TID 0x%" PRIx64 "\n", in_port_num, out_port_num,
+		cl_ntoh64(port_guid), cl_ntoh64(node_guid),
+		cl_ntoh64(p_smp->trans_id));
+
+	osm_dump_slvl_map_table(sm->p_log, port_guid, in_port_num,
+				out_port_num, p_slvl_tbl, OSM_LOG_DEBUG);
+
+	if (attr_mod & 0x10000) {
+		startoutport = ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info) ? 0 : 1;
+		endoutport = osm_node_get_num_physp(p_node);
+	} else
+		endoutport = startoutport = out_port_num;
+	if (attr_mod & 0x20000) {
+		startinport = ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info) ? 0 : 1;
+		endinport = osm_node_get_num_physp(p_node);
+	} else
+		endinport = startinport = in_port_num;
+
+	for (out_port_num = startoutport; out_port_num < endoutport;
+	     out_port_num++) {
+		p_physp = osm_node_get_physp_ptr(p_node, out_port_num);
+		if (!p_physp)
+			continue;
+		for (in_port_num = startinport; in_port_num < endinport;
+		     in_port_num++)
+			osm_physp_set_slvl_tbl(p_physp, p_slvl_tbl, in_port_num);
+	}
 
 Exit:
 	cl_plock_release(sm->p_lock);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 0d11811..540165a 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -4,6 +4,7 @@
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -352,6 +353,7 @@ static const opt_rec_t opt_tbl[] = {
 	{ "daemon", OPT_OFFSET(daemon), opts_parse_boolean, NULL, 0 },
 	{ "sm_inactive", OPT_OFFSET(sm_inactive), opts_parse_boolean, NULL, 1 },
 	{ "babbling_port_policy", OPT_OFFSET(babbling_port_policy), opts_parse_boolean, NULL, 1 },
+	{"use_optimized_slvl", OPT_OFFSET(use_optimized_slvl), opts_parse_boolean, NULL, 1 },
 #ifdef ENABLE_OSM_PERF_MGR
 	{ "perfmgr", OPT_OFFSET(perfmgr), opts_parse_boolean, NULL, 0 },
 	{ "perfmgr_redir", OPT_OFFSET(perfmgr_redir), opts_parse_boolean, NULL, 0 },
@@ -715,6 +717,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->daemon = FALSE;
 	p_opt->sm_inactive = FALSE;
 	p_opt->babbling_port_policy = FALSE;
+	p_opt->use_optimized_slvl = FALSE;
 #ifdef ENABLE_OSM_PERF_MGR
 	p_opt->perfmgr = FALSE;
 	p_opt->perfmgr_redir = TRUE;
@@ -1501,10 +1504,13 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		"# SM Inactive\n"
 		"sm_inactive %s\n\n"
 		"# Babbling Port Policy\n"
-		"babbling_port_policy %s\n\n",
+		"babbling_port_policy %s\n\n"
+		"# Use Optimized SLtoVLMapping programming if supported by device\n"
+		"use_optimized_slvl %s\n\n",
 		p_opts->daemon ? "TRUE" : "FALSE",
 		p_opts->sm_inactive ? "TRUE" : "FALSE",
-		p_opts->babbling_port_policy ? "TRUE" : "FALSE");
+		p_opts->babbling_port_policy ? "TRUE" : "FALSE",
+		p_opts->use_optimized_slvl ? "TRUE" : "FALSE");
 
 #ifdef ENABLE_OSM_PERF_MGR
 	fprintf(out,


From eli at mellanox.co.il  Tue Aug  4 06:24:08 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 4 Aug 2009 16:24:08 +0300
Subject: [ofa-general] [PATCH v2] cma: fix access to freed memory
Message-ID: <20090804132408.GA11545@mtls03>

rdma_join_multicast() allocates struct cma_multicast and then proceeds to join
to a multicast address. However, the join operation completes in another
context and the allocated struct could be released if the user destroys either
the rdma_id object or decides to leave the multicast group while the join
operation is in progress. This patch uses a kref object to maintain reference
counting to avoid such situation.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

Changes from previous version: I removed the protection of mc list
manipulation using spinlocks becuase -
a. In order to break into different patches 
b. I have doubts as for the necessity of this protection.

 drivers/infiniband/core/cma.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 851de83..aa62101 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -157,6 +157,7 @@ struct cma_multicast {
 	struct list_head	list;
 	void			*context;
 	struct sockaddr_storage	addr;
+	struct kref		mcref;
 };
 
 struct cma_work {
@@ -290,6 +291,13 @@ static inline void cma_deref_dev(struct cma_device *cma_dev)
 		complete(&cma_dev->comp);
 }
 
+void release_mc(struct kref *kref)
+{
+	struct cma_multicast *mc = container_of(kref, struct cma_multicast, mcref);
+
+	kfree(mc);
+}
+
 static void cma_detach_from_dev(struct rdma_id_private *id_priv)
 {
 	list_del(&id_priv->list);
@@ -827,7 +835,7 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 				  struct cma_multicast, list);
 		list_del(&mc->list);
 		ib_sa_free_multicast(mc->multicast.ib);
-		kfree(mc);
+		kref_put(&mc->mcref, release_mc);
 	}
 }
 
@@ -2643,7 +2651,7 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 	id_priv = mc->id_priv;
 	if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) &&
 	    cma_disable_callback(id_priv, CMA_ADDR_RESOLVED))
-		return 0;
+		goto out;
 
 	mutex_lock(&id_priv->qp_mutex);
 	if (!status && id_priv->id.qp)
@@ -2669,10 +2677,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 		cma_exch(id_priv, CMA_DESTROYING);
 		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
-		return 0;
+		goto out;
 	}
 
 	mutex_unlock(&id_priv->handler_mutex);
+out:
+	kref_put(&mc->mcref, release_mc);
 	return 0;
 }
 
@@ -2759,11 +2769,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	memcpy(&mc->addr, addr, ip_addr_size(addr));
 	mc->context = context;
 	mc->id_priv = id_priv;
+	kref_init(&mc->mcref);
 
 	spin_lock(&id_priv->lock);
 	list_add(&mc->list, &id_priv->mc_list);
 	spin_unlock(&id_priv->lock);
 
+	kref_get(&mc->mcref);
 	switch (rdma_node_get_transport(id->device->node_type)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
@@ -2800,7 +2812,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
 						&mc->multicast.ib->rec.mgid,
 						mc->multicast.ib->rec.mlid);
 			ib_sa_free_multicast(mc->multicast.ib);
-			kfree(mc);
+			kref_put(&mc->mcref, release_mc);
 			return;
 		}
 	}
-- 
1.6.3.3


From hnrose at comcast.net  Tue Aug  4 06:54:26 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 09:54:26 -0400
Subject: [ofa-general] [PATCH] perftest/README: Add SL option
Message-ID: <20090804135426.GA15784@comcast.net>

perftest/README: Add SL option

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/README b/README
index 8c0d558..e0acf2d 100755
--- a/README
+++ b/README
@@ -124,6 +124,7 @@ Common Options to all tests:
   -a, --all                    run sizes from 2 till 2^23
   -t, --tx-depth=<dep>         size of tx queue (default: 50)
   -n, --iters=<iters>          number of exchanges (at least 100, default: 1000)
+  -S, --sl=<sl>                SL (default 0)
   -C, --report-cycles          report times in cpu cycle units
 					(default: microseconds)
   -H, --report-histogram       print out all results


From hal.rosenstock at gmail.com  Tue Aug  4 07:00:20 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Aug 2009 10:00:20 -0400
Subject: [ofa-general] perftest for Chelsio RNIC adapters
In-Reply-To: <351317.84709.qm@web111212.mail.gq1.yahoo.com>
References: <351317.84709.qm@web111212.mail.gq1.yahoo.com>
Message-ID: <f0e08f230908040700s5afbf00fsf711e00ce6b9f9de@mail.gmail.com>

On Tue, Aug 4, 2009 at 6:08 AM, Bill N <ofedrnicuser at yahoo.com> wrote:

> Hi,
>
> Is performance tests of the perftest-1.2 supported for Chelsio and other
> RNIC adapters?


I'm not sure what 1.2 is exactly but I'm pretty sure the answer is currently
no although it shouldn't be much work to add. I recently saw a patch for
this supporting a gid option but it looked to me like it was implemented as
IBxOE specific rather than also accomodating IB/iWARP.

-- Hal


>
>
> Regards,
> Bill
>
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/8d7bf4e8/attachment.html>

From hal.rosenstock at gmail.com  Tue Aug  4 07:03:17 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Aug 2009 10:03:17 -0400
Subject: [ofa-general] umad SLID and LMC
In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org>
References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com>
	<200907231121.00140.jackm@dev.mellanox.co.il>
	<adaocrb43su.fsf@cisco.com>
	<F4251187-C5FA-42E8-A40A-F3C7B32E09EB@redhat.com>
	<5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org>
Message-ID: <f0e08f230908040703t28335fabib18e795d3d3a35a2@mail.gmail.com>

On Sun, Aug 2, 2009 at 5:45 PM, Todd Rimmer <todd.rimmer at qlogic.com> wrote:

> What is the proper way to control the SLID used for outgoing umad sends?
>
> For example, when using LMC>0, the PathRecord returned from the SM for
> talking to a given remove node may have a SLID which is not the BaseLid for
> the sender.  How does the sender ensure the correct SLID is used for the
> outgoing mad?
>
> In reviewing the API it seems like the only way to do this is:
> void *umad = umad_alloc(...);
>
> // call various umad calls to initialize address and contents
> umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID;
>
> umad_send(..., umad, ...);
>
> Was path_bits an intentional omission in the API?


No; it was an unintentional omission AFAIT.


>  It would seem that a function which could update the ib_mad_addr in a umad
> given a path record would seem appropriate.


Seems reasonable to me. Care to supply a patch ?

-- Hal


>
> Todd Rimmer
> Chief Architect
> QLogic Network Systems Group
> Voice: 610-233-4852     Fax: 610-233-4777
> Todd.Rimmer at QLogic.com  www.QLogic.com <http://www.qlogic.com/>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/8af05bdd/attachment.html>

From chien.tin.tung at intel.com  Tue Aug  4 07:25:00 2009
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Tue, 4 Aug 2009 07:25:00 -0700
Subject: [ofa-general] perftest for Chelsio RNIC adapters
In-Reply-To: <351317.84709.qm@web111212.mail.gq1.yahoo.com>
References: <351317.84709.qm@web111212.mail.gq1.yahoo.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383035F7A95D5@azsmsx501.amr.corp.intel.com>

 
>Is performance tests of the perftest-1.2 supported for Chelsio 
>and other RNIC adapters?

You can run ib_rdma_bw and ib_rdma_lat over iWarp adapters with -c flag (use RDMA CM).

Chien

From kliteyn at dev.mellanox.co.il  Tue Aug  4 07:32:56 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 04 Aug 2009 17:32:56 +0300
Subject: [ofa-general] [PATCH] opensm: fixing handling of opt.max_wire_smps
Message-ID: <4A784698.10803@dev.mellanox.co.il>

opt.max_wire_smps is uint32, but then when it's propagated
into the VL15 poller it's casted to int32. Fixing the
parameter handling to protect it from wrong values.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/main.c       |    2 +-
 opensm/opensm/osm_subnet.c |    7 +++++++
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 296d5d5..9cb9990 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -722,7 +722,7 @@ int main(int argc, char *argv[])

 		case 'n':
 			opt.max_wire_smps = strtol(optarg, NULL, 0);
-			if (opt.max_wire_smps <= 0)
+			if (opt.max_wire_smps > 0x7FFFFFFF)
 				opt.max_wire_smps = 0x7FFFFFFF;
 			printf(" Max wire smp's = %d\n", opt.max_wire_smps);
 			break;
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ec15f8a..c07d823 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
 		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
 	}

+	if (p_opts->max_wire_smps > 0x7FFFFFFF) {
+		log_report(" Invalid Cached Option Value: max_wire_smps = %u,"
+			   " Using Default: %u\n",
+			   p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE);
+		p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
+	}
+
 	if (strcmp(p_opts->console, OSM_DISABLE_CONSOLE)
 	    && strcmp(p_opts->console, OSM_LOCAL_CONSOLE)
 #ifdef ENABLE_OSM_CONSOLE_SOCKET
-- 
1.5.1.4


From hnrose at comcast.net  Tue Aug  4 08:13:37 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 11:13:37 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Added error numbers
	to some error log messages
Message-ID: <20090804151337.GA6037@comcast.net>


Also, made routine local which didn't need to be global

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 2715fe7..6210477 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -811,7 +811,7 @@ static int lash_core(lash_t * p_lash)
 	OSM_LOG_ENTER(p_log);
 
 	if (p_lash->p_osm->subn.opt.do_mesh_analysis && osm_do_mesh_analysis(p_lash)) {
-		OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n");
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D05: Mesh analysis failed\n");
 		goto Exit;
 	}
 
@@ -820,7 +820,7 @@ static int lash_core(lash_t * p_lash)
 		shortest_path(p_lash, i);
 		if (generate_routing_func_for_mst(p_lash, i, &dests)) {
 			status = -1;
-			OSM_LOG(p_log, OSM_LOG_ERROR,
+			OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06: "
 				"generate_routing_func_for_mst failed\n");
 			goto Exit;
 		}
@@ -969,7 +969,7 @@ static unsigned get_lash_id(osm_switch_t * p_sw)
 	return ((switch_t *) p_sw->priv)->id;
 }
 
-int get_next_port(switch_t *sw, int link)
+static int get_next_port(switch_t *sw, int link)
 {
 	link_t *l = sw->node->links[link];
 	int port = l->next_port++;


From sashak at voltaire.com  Tue Aug  4 08:27:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 18:27:00 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090730232848.GA22660@comcast.net>
References: <20090730232848.GA22660@comcast.net>
Message-ID: <20090804152700.GF7993@me>

Hi,

On 19:28 Thu 30 Jul     , Hal Rosenstock wrote:
> 
> Currently, MADs are pipelined to a single switch at a time which
> effectively serializes these requests due to processing at the SMA.
> This patch pipelines (stripes) them across the switches first before
> proceeding with successive blocks. As a result of this striping,
> multiple switches can process the set and respond concurrently
> which results in an improvement to the subnet initialization time.

The idea is nice. However I have some initial comments about an
implementation.

BTW should there be a reason for an option to preserve the current
behavior? (I don't know, just asking)

> This patch also introduces a new config option (max_smps_per_node)
> which indicates how deep the per node pipeline is (current default is 4).
> This also has the effect of limiting the number of times that the switch
> list is traversed. Maybe this embellishment is unnecessary.

Then why is it needed?

> All unicast routing protocols are updated for this with the exception
> of file.
> 
> A similar subsequent change will do this for MFTs.
> 
> Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> 
> With a small cluster of 17 IS4 switches and 11 HCAs and
> to artificially increase the cluster, LMC of 7 was used
> including EnhancedSwitchPort 0 LMC.
> 
> With the new code, LFT configuration is more than twice as
> fast as with the old code :)
> Current ucast manager ran on avarage for ~250msec, with the
> new code - 110-120msec.
> 
> Routing calculation phase of the ucast manager took ~1200 usec,
> the rest was sending the blocks and waiting for no more pending
> transactions.
> 
> No noticeable difference between various max_smps_per_node values
> was observed.

What is the reason? And what was value of 'max_wire_smps'?

> Here are some detailed results of different executions (the
> number on the left is timer value in usec):
> 
> Current ucast manager (w/o the optimization):
> 
> 000000 [LFT]: osm_ucast_mgr_process() - START
> 001131 [LFT]: ucast_mgr_process_tbl() - START
> 032251 [LFT]: ucast_mgr_process_tbl() - END
> 032263 [LFT]: osm_ucast_mgr_process() - END
> 253416 [LFT]: Done wait_for_pending_transactions()
> 
> New code, max_smps_per_node=0:
> 
> 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node)
> 002690 [LFT]: ucast_mgr_process_tbl() - START
> 032946 [LFT]: ucast_mgr_process_tbl() - END
> 032948 [LFT]: osm_ucast_pipeline_tbl() - START
> 033846 [LFT]: osm_ucast_pipeline_tbl() - END
> 033858 [LFT]: osm_ucast_mgr_process() - END
> 108203 [LFT]: Done wait_for_pending_transactions()
> 
> New code, max_smps_per_node=1:
> 
> 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node)
> 008735 [LFT]: ucast_mgr_process_tbl() - START
> 040071 [LFT]: ucast_mgr_process_tbl() - END
> 040074 [LFT]: osm_ucast_pipeline_tbl() - START
> 040103 [LFT]: osm_ucast_pipeline_tbl() - END
> 040114 [LFT]: osm_ucast_mgr_process() - END
> 120097 [LFT]: Done wait_for_pending_transactions()
> 
> New code, max_smps_per_node=4:
> 
> 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node)
> 005380 [LFT]: ucast_mgr_process_tbl() - START
> 037436 [LFT]: ucast_mgr_process_tbl() - END
> 037439 [LFT]: osm_ucast_pipeline_tbl() - START
> 037495 [LFT]: osm_ucast_pipeline_tbl() - END
> 037506 [LFT]: osm_ucast_mgr_process() - END
> 114983 [LFT]: Done wait_for_pending_transactions()
> 
> 
> With IS3 based Qlogic switches, which do not handle DR packets forwarding
> in HW, with a fabric of ~1100 HCAs, ~280 switches:
> 
> Current OSM configures LFTs in ~2 seconds.
> New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up),
> depending on the max_smps_per_node value.
> 
> As in case of IS4 switches, the shortest config time was obtained with
> max_smps_per_node=0, which is unlimited pipeline.
> 
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added Yevgeny's performance data to patch description above
> No change to actual patch
> 
> diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
> index 0537002..617e8a9 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
>   *
> @@ -449,6 +449,18 @@ BEGIN_C_DECLS
>  */
>  #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
>  /***********/
> +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE
> +* NAME
> +*	OSM_DEFAULT_SMP_MAX_PER_NODE
> +*
> +* DESCRIPTION
> +*	Specifies the default number of VL15 SMP MADs allowed
> +*	per node for certain attributes.
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4
> +/***********/
>  /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
>  * NAME
>  *	OSM_SM_DEFAULT_QP0_RCV_SIZE
> diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
> index cc8321d..1776380 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -130,6 +130,7 @@ typedef struct osm_sm {
>  	osm_sm_mad_ctrl_t mad_ctrl;
>  	osm_lid_mgr_t lid_mgr;
>  	osm_ucast_mgr_t ucast_mgr;
> +	boolean_t lfts_updated;

The name is unclear - actually this means "update in progress".

>  	cl_disp_reg_handle_t sweep_fail_disp_h;
>  	cl_disp_reg_handle_t ni_disp_h;
>  	cl_disp_reg_handle_t pi_disp_h;
> @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm,
>  *
>  *********/
>  
> +/****f* OpenSM: SM/osm_sm_set_next_lft_block
> +* NAME
> +*	osm_sm_set_next_lft_block
> +*
> +* DESCRIPTION
> +*	Set the next LFT (LinearForwardingTable) block in the indicated switch.
> +*
> +* SYNOPSIS
> +*/
> +void
> +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> +			  IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> +			  IN osm_madw_context_t *p_context);

Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem
much more appropriate place for this.

> +/*
> +* PARAMETERS
> +*	p_sm
> +*		[in] Pointer to an osm_sm_t object.
> +*
> +*	p_switch
> +*		[in] Pointer to the switch object.
> +*
> +*	p_block
> +*		[in] Pointer to the forwarding table block.
> +*
> +*	p_path
> +*		[in] Pointer to a directed route path object.
> +*
> +*	p_context
> +*		[in] Mad wrapper context structure to be copied into the wrapper
> +*		context, and thus visible to the recipient of the response.
> +*
> +* RETURN VALUES
> +*	None
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +
>  /****f* OpenSM: SM/osm_sm_mcgrp_join
>  * NAME
>  *	osm_sm_mcgrp_join
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index 59a32ad..f12afae 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
>   *
> @@ -147,6 +147,7 @@ typedef struct osm_subn_opt {
>  	uint32_t sweep_interval;
>  	uint32_t max_wire_smps;
>  	uint32_t transaction_timeout;
> +	uint32_t max_smps_per_node;
>  	uint8_t sm_priority;
>  	uint8_t lmc;
>  	boolean_t lmc_esp0;
> diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
> index 7ce28c5..e12113f 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -102,6 +102,7 @@ typedef struct osm_switch {
>  	osm_port_profile_t *p_prof;
>  	uint8_t *lft;
>  	uint8_t *new_lft;
> +	uint16_t lft_block_id_ho;
>  	osm_mcast_tbl_t mcast_tbl;
>  	unsigned endport_links;
>  	unsigned need_update;
> diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
> index a040476..fdea49a 100644
> --- a/opensm/include/opensm/osm_ucast_mgr.h
> +++ b/opensm/include/opensm/osm_ucast_mgr.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN struct osm_sm * sm);
>  *	osm_ucast_mgr_destroy
>  *********/
>  
> -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table
> +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl
>  * NAME
> -*	osm_ucast_mgr_set_fwd_table
> +*	osm_ucast_pipeline_tbl
>  *
>  * DESCRIPTION
> -*	Setup forwarding table for the switch (from prepared new_lft).
> +*	The osm_ucast_pipeline_tbl function pipelines the LFT
> +*	(LinearForwardingTable) sets across the switches
> +*	(from prepared new_lft).
>  *
>  * SYNOPSIS
>  */
> -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
> -				IN osm_switch_t * const p_sw);
> +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr);
> +/*
> +* PARAMETERS
> +*	p_mgr
> +*		[in] Pointer to an osm_ucast_mgr_t object.
> +*
> +* RETURN VALUES
> +*	None.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +
> +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top
> +* NAME
> +*	osm_ucast_mgr_set_fwd_tbl_top
> +*
> +* DESCRIPTION
> +*	Setup LinearFDBTop for the switch.
> +*
> +* SYNOPSIS
> +*/
> +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr,
> +				  IN osm_switch_t * const p_sw);

I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
osm_ucast_pipeline_tbl). Why to not use a single function and update all
routing engines appropriately (you need to do it anyway), so that this
will only fill up new_lfts table?

>  /*
>  * PARAMETERS
>  *	p_mgr
> diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c
> index 2edb8d3..cb131b4 100644
> --- a/opensm/opensm/osm_lin_fwd_rcv.c
> +++ b/opensm/opensm/osm_lin_fwd_rcv.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -36,7 +36,7 @@
>  /*
>   * Abstract:
>   *    Implementation of osm_lft_rcv_t.
> - * This object represents the NodeDescription Receiver object.
> + * This object represents the Linear Forwarding Table Receiver object.
>   * This object is part of the opensm family of objects.
>   */
>  
> @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
>  {
>  	osm_sm_t *sm = context;
>  	osm_madw_t *p_madw = data;
> +	osm_dr_path_t *p_path;
>  	ib_smp_t *p_smp;
>  	uint32_t block_num;
>  	osm_switch_t *p_sw;
> @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
>  	uint8_t *p_block;
>  	ib_net64_t node_guid;
>  	ib_api_status_t status;
> +	uint8_t block[IB_SMP_DATA_SIZE];
> +	osm_madw_context_t mad_context;
>  
>  	CL_ASSERT(sm);
>  
> @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
>  				"\n\t\t\t\tSwitch 0x%" PRIx64 "\n",
>  				ib_get_err_str(status), cl_ntoh64(node_guid));
>  		}
> +
> +		p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> +
> +		mad_context.lft_context.node_guid = node_guid;
> +		mad_context.lft_context.set_method = TRUE;
> +
> +		osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> +					  &mad_context);
> +
> +		p_sw->lft_block_id_ho++;

Wouldn't it be simpler to encode block_id in a mad context?

>  	}
>  
>  	CL_PLOCK_RELEASE(sm->p_lock);
> diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> index daa60ff..4e0fd2a 100644
> --- a/opensm/opensm/osm_sm.c
> +++ b/opensm/opensm/osm_sm.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
>   *
> @@ -441,6 +441,45 @@ Exit:
>  
>  /**********************************************************************
>   **********************************************************************/
> +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> +			       IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> +			       IN osm_madw_context_t *context)
> +{
> +	ib_api_status_t status;
> +
> +	for (;
> +	     osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho, p_block);
> +	     p_sw->lft_block_id_ho++) {
> +		if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> +		    !memcmp(p_block,
> +			    p_sw->new_lft + p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE,
> +			    IB_SMP_DATA_SIZE))
> +			continue;
> +
> +		p_sm->lfts_updated = 1;
> +
> +		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> +			"Writing FT block %u to switch 0x%" PRIx64 "\n",
> +			p_sw->lft_block_id_ho,
> +			cl_ntoh64(context->lft_context.node_guid));
> +
> +		status = osm_req_set(p_sm, p_path,
> +				     p_sw->new_lft +
> +				     p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE,
> +				     IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
> +				     cl_hton32(p_sw->lft_block_id_ho),
> +				     CL_DISP_MSGID_NONE, context);
> +
> +		if (status != IB_SUCCESS)
> +			OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: "
> +				"Sending linear fwd. tbl. block failed (%s)\n",
> +				ib_get_err_str(status));
> +		break;
> +	}
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
>  static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
>  				       IN osm_mgrp_t * p_mgrp)
>  {
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..1964b7f 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
>   *
> @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = {
>  	{ "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_parse_net16, NULL, 1 },
>  	{ "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32, NULL, 1 },
>  	{ "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32, NULL, 1 },
> +	{ "max_smps_per_node", OPT_OFFSET(max_smps_per_node), opts_parse_uint32, NULL, 1 },
>  	{ "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 },
>  	{ "console_port", OPT_OFFSET(console_port), opts_parse_uint16, NULL, 0 },
>  	{ "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_parse_uint32, NULL, 1 },
> @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->m_key_lease_period = 0;
>  	p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
>  	p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> +	p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE;
>  	p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
>  	p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
>  	p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>  		"max_wire_smps %u\n\n"
>  		"# The maximum time in [msec] allowed for a transaction to complete\n"
>  		"transaction_timeout %u\n\n"
> +		"# Maximum number of SMPs per node sent in parallel\n"
> +		"# (0 means unlimited)\n"
> +		"# Only applies to certain attributes\n"
> +		"max_smps_per_node %u\n\n"
>  		"# Maximal time in [msec] a message can stay in the incoming message queue.\n"
>  		"# If there is more than one message in the queue and the last message\n"
>  		"# stayed in the queue more than this value, any SA request will be\n"
> @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>  		"single_thread %s\n\n",
>  		p_opts->max_wire_smps,
>  		p_opts->transaction_timeout,
> +		p_opts->max_smps_per_node,
>  		p_opts->max_msg_fifo_timeout,
>  		p_opts->single_thread ? "TRUE" : "FALSE");
>  
> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
> index 216b496..31c930b 100644
> --- a/opensm/opensm/osm_ucast_cache.c
> +++ b/opensm/opensm/osm_ucast_cache.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2008      Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr)
>  			memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>  		}
>  
> -		osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> +		osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
>  	}
>  
> +	osm_ucast_pipeline_tbl(p_mgr);
> +
>  	return 0;
>  }
>  
> diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c
> index 2505c46..099e8ba 100644
> --- a/opensm/opensm/osm_ucast_file.c
> +++ b/opensm/opensm/osm_ucast_file.c
> @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context)
>  				"routing algorithm\n");
>  		} else if (!strncmp(p, "Unicast lids", 12)) {
>  			if (p_sw)
> -				osm_ucast_mgr_set_fwd_table(&p_osm->sm.
> -							    ucast_mgr, p_sw);
> +				osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.
> +							      ucast_mgr, p_sw);
>  			q = strstr(p, " guid 0x");
>  			if (!q) {
>  				OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
> @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context)
>  	}
>  
>  	if (p_sw)
> -		osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> +		osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
>  
>  	fclose(file);
>  	return 0;

I suppose that this breaks 'file' routing engine (did you test it?) -
instead of switch LFTs setup this will only update its TOPs.

> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index bde6dbd..d65c685 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -2,7 +2,7 @@
>   * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
>   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
>  	ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
>  
>  	p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
> -	osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
> -				    p_sw->p_osm_sw);
> +	osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr,
> +				      p_sw->p_osm_sw);
>  }
>  
>  /***************************************************
> @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context)
>  	/* for each switch, set its fwd table */
>  	cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void *)p_ftree);
>  
> +	osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr);
> +
>  	/* write out hca ordering file */
>  	fabric_dump_hca_ordering(p_ftree);
>  
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index 12b5e34..adf5f6c 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
>   * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
> @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash)
>  					physical_egress_port);
>  			}
>  		}		/* for */
> -		osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> +		osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
>  	}
> +
> +	osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr);
> +
>  	OSM_LOG_EXIT(p_log);
>  }
>  
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 78a7031..86d1c98 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -315,16 +315,14 @@ Exit:
>  
>  /**********************************************************************
>   **********************************************************************/
> -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> -				IN osm_switch_t * p_sw)
> +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> +				  IN osm_switch_t * p_sw)
>  {
>  	osm_node_t *p_node;
>  	osm_dr_path_t *p_path;
>  	osm_madw_context_t context;
>  	ib_api_status_t status;
>  	ib_switch_info_t si;
> -	uint16_t block_id_ho = 0;
> -	uint8_t block[IB_SMP_DATA_SIZE];
>  	boolean_t set_swinfo_require = FALSE;
>  	uint16_t lin_top;
>  	uint8_t life_state;
> @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
>  				ib_get_err_str(status));
>  	}
>  
> -	/*
> -	   Send linear forwarding table blocks to the switch
> -	   as long as the switch indicates it has blocks needing
> -	   configuration.
> -	 */
> -
> -	context.lft_context.node_guid = osm_node_get_node_guid(p_node);
> -	context.lft_context.set_method = TRUE;
> -
> -	if (!p_sw->new_lft) {
> -		/* any routing should provide the new_lft */
> -		CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> -			  p_mgr->cache_valid && !p_sw->need_update);
> -		goto Exit;
> -	}
> -
> -	for (block_id_ho = 0;
> -	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
> -	     block_id_ho++) {
> -		if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> -		    !memcmp(block,
> -			    p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> -			    IB_SMP_DATA_SIZE))
> -			continue;
> -
> -		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> -			"Writing FT block %u\n", block_id_ho);
> -
> -		status = osm_req_set(p_mgr->sm, p_path,
> -				     p_sw->new_lft +
> -				     block_id_ho * IB_SMP_DATA_SIZE,
> -				     sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL,
> -				     cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
> -				     &context);
> +	p_sw->lft_block_id_ho = 0;
>  
> -		if (status != IB_SUCCESS)
> -			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> -				"Sending linear fwd. tbl. block failed (%s)\n",
> -				ib_get_err_str(status));
> -	}
> -
> -Exit:
>  	OSM_LOG_EXIT(p_mgr->p_log);
>  	return 0;
>  }
> @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
>  		}
>  	}
>  
> -	osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> +	osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
>  
>  	if (p_mgr->p_subn->opt.lmc)
>  		free_ports_priv(p_mgr);
> @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
>  	OSM_LOG_EXIT(p_mgr->p_log);
>  }
>  
> +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw,
> +				   IN osm_ucast_mgr_t *p_mgr)
> +{
> +	osm_dr_path_t *p_path;
> +	osm_madw_context_t mad_context;
> +	uint8_t block[IB_SMP_DATA_SIZE];
> +
> +	OSM_LOG_ENTER(p_mgr->p_log);
> +
> +	CL_ASSERT(p_sw && p_sw->p_node);
> +
> +	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> +		"Processing switch 0x%" PRIx64 "\n",
> +		cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> +
> +	/*
> +	   Send linear forwarding table blocks to the switch
> +	   as long as the switch indicates it has blocks needing
> +	   configuration.
> +	 */
> +	if (!p_sw->new_lft) {
> +		/* any routing should provide the new_lft */
> +		CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> +			  p_mgr->cache_valid && !p_sw->need_update);
> +		goto Exit;
> +	}
> +
> +	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> +
> +	mad_context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node);
> +	mad_context.lft_context.set_method = TRUE;
> +
> +	osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path,
> +				  &mad_context);
> +
> +	p_sw->lft_block_id_ho++;
> +
> +Exit:
> +	OSM_LOG_EXIT(p_mgr->p_log);
> +}
> +
>  /**********************************************************************
>   **********************************************************************/
>  static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
> @@ -870,6 +869,28 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m)
>  		add_sw_endports_to_order_list(s[i], m);
>  }
>  
> +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr)
> +{
> +	cl_qmap_t *p_sw_tbl;
> +	osm_switch_t *p_sw;
> +	int i;
> +
> +	for (i = 0;
> +	     !p_mgr->p_subn->opt.max_smps_per_node ||
> +	     i < p_mgr->p_subn->opt.max_smps_per_node;
> +	     i++) {
> +		p_mgr->sm->lfts_updated = 0;
> +		p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> +		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> +		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> +			ucast_mgr_pipeline_tbl(p_sw, p_mgr);
> +			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
> +		}
> +		if (!p_mgr->sm->lfts_updated)
> +			break;
> +	}
> +}

Is it possible (for example in case of send errors) that "partial" LFT
blocks sending will trigger wait_for_pending_transaction() completion?

Sasha

> +
>  static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
>  {
>  	cl_qlist_init(&p_mgr->port_order_list);
> @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
>  	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl,
>  			   p_mgr);
>  
> +	osm_ucast_pipeline_tbl(p_mgr);
> +
>  	cl_qlist_remove_all(&p_mgr->port_order_list);
>  
>  	return 0;
> 


From sashak at voltaire.com  Tue Aug  4 08:30:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 18:30:20 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Added error
 numbers to some error log messages
In-Reply-To: <20090804151337.GA6037@comcast.net>
References: <20090804151337.GA6037@comcast.net>
Message-ID: <20090804153020.GG7993@me>

On 11:13 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> Also, made routine local which didn't need to be global
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Aug  4 08:35:09 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 18:35:09 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <4A784698.10803@dev.mellanox.co.il>
References: <4A784698.10803@dev.mellanox.co.il>
Message-ID: <20090804153509.GH7993@me>

On 17:32 Tue 04 Aug     , Yevgeny Kliteynik wrote:
> opt.max_wire_smps is uint32, but then when it's propagated
> into the VL15 poller it's casted to int32. Fixing the
> parameter handling to protect it from wrong values.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/main.c       |    2 +-
>  opensm/opensm/osm_subnet.c |    7 +++++++
>  2 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 296d5d5..9cb9990 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -722,7 +722,7 @@ int main(int argc, char *argv[])
> 
>  		case 'n':
>  			opt.max_wire_smps = strtol(optarg, NULL, 0);

Then you likely want to use strtoul().

> -			if (opt.max_wire_smps <= 0)
> +			if (opt.max_wire_smps > 0x7FFFFFFF)
>  				opt.max_wire_smps = 0x7FFFFFFF;

What about opt.max_wire_smps == 0?

Sasha

>  			printf(" Max wire smp's = %d\n", opt.max_wire_smps);
>  			break;
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..c07d823 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
>  		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
>  	}
> 
> +	if (p_opts->max_wire_smps > 0x7FFFFFFF) {
> +		log_report(" Invalid Cached Option Value: max_wire_smps = %u,"
> +			   " Using Default: %u\n",
> +			   p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE);
> +		p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> +	}

Ditto.

Sasha


From rdreier at cisco.com  Tue Aug  4 09:05:13 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Aug 2009 09:05:13 -0700
Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory
In-Reply-To: <20090804033221.GA30949@mtls03> (Eli Cohen's message of "Tue, 4
	Aug 2009 06:32:21 +0300")
References: <20090803092528.GA25528@mtls03> <adak51kod06.fsf@cisco.com>
	<20090804033221.GA30949@mtls03>
Message-ID: <adak51jmuo6.fsf@cisco.com>


 > Maybe it's just a loose connection but yet, it seems to me that
 > operations on id_priv->mc_list should be protected. Should I send a
 > different patch?

"seems ... should be" is very weak justification for locking.  What
should they be protected from?

 - R.


From bart.vanassche at gmail.com  Tue Aug  4 09:07:31 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Tue, 4 Aug 2009 18:07:31 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adafxc8ocst.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
Message-ID: <e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>

On Mon, Aug 3, 2009 at 10:36 PM, Roland Dreier<rdreier at cisco.com> wrote:
> How do we end up in srp_reset_device() with req->scmnd->device == NULL?
> Presumably req->scmnd should match scmnd if I am understanding the code
> properly -- and then scmnd->device == NULL??

An update: apparently it is possible to trigger scmnd->device == NULL even
without triggering a prior IB CM disconnect. The following shell commands
are sufficient to trigger the WARN_ON statement in the patch below:

rmmod ib_srp
modprobe ib_srp
ibsrpdm -c | while read target_info; do echo "${target_info}"; echo
"${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target;
done
sg_reset -d ${srp_device}

So it should be analyzed why scmnd->device can be NULL before applying any
patches to fix the NULL pointer dereference.

Bart.

--- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c	2009-08-03
12:13:11.000000000 +0200
+++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c	2009-08-04
17:25:27.000000000 +0200
@@ -1330,6 +1330,8 @@ static int srp_send_tsk_mgmt(struct srp_
 	struct srp_iu *iu;
 	struct srp_tsk_mgmt *tsk_mgmt;

+	BUG_ON(!req->scmnd->device);
+
 	spin_lock_irq(target->scsi_host->host_lock);

 	if (target->state == SRP_TARGET_DEAD ||
@@ -1425,6 +1427,8 @@ static int srp_reset_device(struct scsi_
 		return FAILED;
 	if (srp_find_req(target, scmnd, &req))
 		return FAILED;
+	if (WARN_ON(!req->scmnd->device))
+		return FAILED;
 	if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET))
 		return FAILED;
 	if (req->tsk_status)


From rdreier at cisco.com  Tue Aug  4 09:27:23 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Aug 2009 09:27:23 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP initiator triggered by a SCSI reset after the SRP
	connection has been closed
In-Reply-To: <e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	(Bart Van Assche's message of "Tue, 4 Aug 2009 18:07:31 +0200")
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
Message-ID: <adafxc7mtn8.fsf@cisco.com>


 > An update: apparently it is possible to trigger scmnd->device == NULL even
 > without triggering a prior IB CM disconnect. The following shell commands
 > are sufficient to trigger the WARN_ON statement in the patch below:

 > rmmod ib_srp
 > modprobe ib_srp
 > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo
 > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target;
 > done
 > sg_reset -d ${srp_device}

So in other words, just sg_reset on an SRP device triggers the warning?


From hal.rosenstock at gmail.com  Tue Aug  4 09:29:08 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Aug 2009 12:29:08 -0400
Subject: [ofa-general] umad SLID and LMC
In-Reply-To: <f0e08f230908040703t28335fabib18e795d3d3a35a2@mail.gmail.com>
References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com>
	<200907231121.00140.jackm@dev.mellanox.co.il>
	<adaocrb43su.fsf@cisco.com>
	<F4251187-C5FA-42E8-A40A-F3C7B32E09EB@redhat.com>
	<5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org>
	<f0e08f230908040703t28335fabib18e795d3d3a35a2@mail.gmail.com>
Message-ID: <f0e08f230908040929h2708a6d3neb1b551a3f6ea80f@mail.gmail.com>

On Tue, Aug 4, 2009 at 10:03 AM, Hal Rosenstock <hal.rosenstock at gmail.com>wrote:

>
>
>  On Sun, Aug 2, 2009 at 5:45 PM, Todd Rimmer <todd.rimmer at qlogic.com>wrote:
>
>> What is the proper way to control the SLID used for outgoing umad sends?
>>
>> For example, when using LMC>0, the PathRecord returned from the SM for
>> talking to a given remove node may have a SLID which is not the BaseLid for
>> the sender.  How does the sender ensure the correct SLID is used for the
>> outgoing mad?
>>
>> In reviewing the API it seems like the only way to do this is:
>> void *umad = umad_alloc(...);
>>
>> // call various umad calls to initialize address and contents
>> umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID;
>>
>> umad_send(..., umad, ...);
>>
>> Was path_bits an intentional omission in the API?
>
>
> No; it was an unintentional omission AFAIT.
>
>
>>  It would seem that a function which could update the ib_mad_addr in a
>> umad given a path record would seem appropriate.
>
>
> Seems reasonable to me.
>

On second thought, umad is lower level than this and knows nothing of path
records (that at higher level). Some API like umad_set_addr handling path
bits would be another alternative for this.

-- Hal


>  Care to supply a patch ?
>
> -- Hal
>
>
>>
>> Todd Rimmer
>> Chief Architect
>> QLogic Network Systems Group
>> Voice: 610-233-4852     Fax: 610-233-4777
>> Todd.Rimmer at QLogic.com  www.QLogic.com <http://www.qlogic.com/>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/cf9cabce/attachment.html>

From bart.vanassche at gmail.com  Tue Aug  4 09:30:18 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Tue, 4 Aug 2009 18:30:18 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adafxc7mtn8.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
Message-ID: <e2e108260908040930s726686adqecc693f40717f207@mail.gmail.com>

On Tue, Aug 4, 2009 at 6:27 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > An update: apparently it is possible to trigger scmnd->device == NULL even
>  > without triggering a prior IB CM disconnect. The following shell commands
>  > are sufficient to trigger the WARN_ON statement in the patch below:
>
>  > rmmod ib_srp
>  > modprobe ib_srp
>  > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo
>  > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target;
>  > done
>  > sg_reset -d ${srp_device}
>
> So in other words, just sg_reset on an SRP device triggers the warning?

Yes, but only if no I/O has been performed after the ${srp_device} has
been created and before the sg_reset has been issued. When e.g. the
command dd if=${srp_device} of=/dev/null iflag=direct bs=1M is
inserted just before the sg_reset command, the kernel warning is not
triggered.

Bart.


From hal.rosenstock at gmail.com  Tue Aug  4 09:45:05 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Aug 2009 12:45:05 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090804152700.GF7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
Message-ID: <f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>

On Tue, Aug 4, 2009 at 11:27 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi,
>
> On 19:28 Thu 30 Jul     , Hal Rosenstock wrote:
> >
> > Currently, MADs are pipelined to a single switch at a time which
> > effectively serializes these requests due to processing at the SMA.
> > This patch pipelines (stripes) them across the switches first before
> > proceeding with successive blocks. As a result of this striping,
> > multiple switches can process the set and respond concurrently
> > which results in an improvement to the subnet initialization time.
>
> The idea is nice. However I have some initial comments about an
> implementation.
>
> BTW should there be a reason for an option to preserve the current
> behavior? (I don't know, just asking)


I asked this in an email on the thread on this. It's up to you. I don't see
a need but if we want to be conservative, it can be added.


>
>
> > This patch also introduces a new config option (max_smps_per_node)
> > which indicates how deep the per node pipeline is (current default is 4).
> > This also has the effect of limiting the number of times that the switch
> > list is traversed. Maybe this embellishment is unnecessary.
>
> Then why is it needed?


Also, as was discussed in the thread on this, it gives a way to control
possible VL15 overflow.


>
>
> > All unicast routing protocols are updated for this with the exception
> > of file.
> >
> > A similar subsequent change will do this for MFTs.
> >
> > Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> >
> > With a small cluster of 17 IS4 switches and 11 HCAs and
> > to artificially increase the cluster, LMC of 7 was used
> > including EnhancedSwitchPort 0 LMC.
> >
> > With the new code, LFT configuration is more than twice as
> > fast as with the old code :)
> > Current ucast manager ran on avarage for ~250msec, with the
> > new code - 110-120msec.
> >
> > Routing calculation phase of the ucast manager took ~1200 usec,
> > the rest was sending the blocks and waiting for no more pending
> > transactions.
> >
> > No noticeable difference between various max_smps_per_node values
> > was observed.
>
> What is the reason?


I think the reason was max_wire_smps may have kicked in but Yevgeny is best
to elaborate on this.


> And what was value of 'max_wire_smps'?
>

> Here are some detailed results of different executions (the
> number on the left is timer value in usec):
>
> Current ucast manager (w/o the optimization):
>
> 000000 [LFT]: osm_ucast_mgr_process() - START
> 001131 [LFT]: ucast_mgr_process_tbl() - START
> 032251 [LFT]: ucast_mgr_process_tbl() - END
> 032263 [LFT]: osm_ucast_mgr_process() - END
> 253416 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=0:
>
> 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node)
> 002690 [LFT]: ucast_mgr_process_tbl() - START
> 032946 [LFT]: ucast_mgr_process_tbl() - END
> 032948 [LFT]: osm_ucast_pipeline_tbl() - START
> 033846 [LFT]: osm_ucast_pipeline_tbl() - END
> 033858 [LFT]: osm_ucast_mgr_process() - END
> 108203 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=1:
>
> 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node)
> 008735 [LFT]: ucast_mgr_process_tbl() - START
> 040071 [LFT]: ucast_mgr_process_tbl() - END
> 040074 [LFT]: osm_ucast_pipeline_tbl() - START
> 040103 [LFT]: osm_ucast_pipeline_tbl() - END
> 040114 [LFT]: osm_ucast_mgr_process() - END
> 120097 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=4:
>
> 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node)
> 005380 [LFT]: ucast_mgr_process_tbl() - START
> 037436 [LFT]: ucast_mgr_process_tbl() - END
> 037439 [LFT]: osm_ucast_pipeline_tbl() - START
> 037495 [LFT]: osm_ucast_pipeline_tbl() - END
> 037506 [LFT]: osm_ucast_mgr_process() - END
> 114983 [LFT]: Done wait_for_pending_transactions()
>
>
> With IS3 based Qlogic switches, which do not handle DR packets forwarding
> in HW, with a fabric of ~1100 HCAs, ~280 switches:
>
> Current OSM configures LFTs in ~2 seconds.
> New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up),
> depending on the max_smps_per_node value.
>
> As in case of IS4 switches, the shortest config time was obtained with
> max_smps_per_node=0, which is unlimited pipeline.
>
>
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added Yevgeny's performance data to patch description above
> No change to actual patch
>
> diff --git a/opensm/include/opensm/osm_base.h
b/opensm/include/opensm/osm_base.h
> index 0537002..617e8a9 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
>   *
> @@ -449,6 +449,18 @@ BEGIN_C_DECLS
>  */
>  #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
>  /***********/
> +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE
> +* NAME
> +*    OSM_DEFAULT_SMP_MAX_PER_NODE
> +*
> +* DESCRIPTION
> +*    Specifies the default number of VL15 SMP MADs allowed
> +*    per node for certain attributes.
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4
> +/***********/
>  /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
>  * NAME
>  *    OSM_SM_DEFAULT_QP0_RCV_SIZE
> diff --git a/opensm/include/opensm/osm_sm.h
b/opensm/include/opensm/osm_sm.h
> index cc8321d..1776380 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -130,6 +130,7 @@ typedef struct osm_sm {
>       osm_sm_mad_ctrl_t mad_ctrl;
>       osm_lid_mgr_t lid_mgr;
>       osm_ucast_mgr_t ucast_mgr;
> +     boolean_t lfts_updated;

The name is unclear - actually this means "update in progress".


OK.


>
>
> >       cl_disp_reg_handle_t sweep_fail_disp_h;
> >       cl_disp_reg_handle_t ni_disp_h;
> >       cl_disp_reg_handle_t pi_disp_h;
> > @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm,
> >  *
> >  *********/
> >
> > +/****f* OpenSM: SM/osm_sm_set_next_lft_block
> > +* NAME
> > +*    osm_sm_set_next_lft_block
> > +*
> > +* DESCRIPTION
> > +*    Set the next LFT (LinearForwardingTable) block in the indicated
> switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +void
> > +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > +                       IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> > +                       IN osm_madw_context_t *p_context);
>
> Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem
> much more appropriate place for this.


OK.


>
>
> > +/*
> > +* PARAMETERS
> > +*    p_sm
> > +*            [in] Pointer to an osm_sm_t object.
> > +*
> > +*    p_switch
> > +*            [in] Pointer to the switch object.
> > +*
> > +*    p_block
> > +*            [in] Pointer to the forwarding table block.
> > +*
> > +*    p_path
> > +*            [in] Pointer to a directed route path object.
> > +*
> > +*    p_context
> > +*            [in] Mad wrapper context structure to be copied into the
> wrapper
> > +*            context, and thus visible to the recipient of the response.
> > +*
> > +* RETURN VALUES
> > +*    None
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> >  /****f* OpenSM: SM/osm_sm_mcgrp_join
> >  * NAME
> >  *    osm_sm_mcgrp_join
> > diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> > index 59a32ad..f12afae 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -147,6 +147,7 @@ typedef struct osm_subn_opt {
> >       uint32_t sweep_interval;
> >       uint32_t max_wire_smps;
> >       uint32_t transaction_timeout;
> > +     uint32_t max_smps_per_node;
> >       uint8_t sm_priority;
> >       uint8_t lmc;
> >       boolean_t lmc_esp0;
> > diff --git a/opensm/include/opensm/osm_switch.h
> b/opensm/include/opensm/osm_switch.h
> > index 7ce28c5..e12113f 100644
> > --- a/opensm/include/opensm/osm_switch.h
> > +++ b/opensm/include/opensm/osm_switch.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -102,6 +102,7 @@ typedef struct osm_switch {
> >       osm_port_profile_t *p_prof;
> >       uint8_t *lft;
> >       uint8_t *new_lft;
> > +     uint16_t lft_block_id_ho;
> >       osm_mcast_tbl_t mcast_tbl;
> >       unsigned endport_links;
> >       unsigned need_update;
> > diff --git a/opensm/include/opensm/osm_ucast_mgr.h
> b/opensm/include/opensm/osm_ucast_mgr.h
> > index a040476..fdea49a 100644
> > --- a/opensm/include/opensm/osm_ucast_mgr.h
> > +++ b/opensm/include/opensm/osm_ucast_mgr.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const
> p_mgr, IN struct osm_sm * sm);
> >  *    osm_ucast_mgr_destroy
> >  *********/
> >
> > -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table
> > +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl
> >  * NAME
> > -*    osm_ucast_mgr_set_fwd_table
> > +*    osm_ucast_pipeline_tbl
> >  *
> >  * DESCRIPTION
> > -*    Setup forwarding table for the switch (from prepared new_lft).
> > +*    The osm_ucast_pipeline_tbl function pipelines the LFT
> > +*    (LinearForwardingTable) sets across the switches
> > +*    (from prepared new_lft).
> >  *
> >  * SYNOPSIS
> >  */
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
> > -                             IN osm_switch_t * const p_sw);
> > +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr);
> > +/*
> > +* PARAMETERS
> > +*    p_mgr
> > +*            [in] Pointer to an osm_ucast_mgr_t object.
> > +*
> > +* RETURN VALUES
> > +*    None.
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> > +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top
> > +* NAME
> > +*    osm_ucast_mgr_set_fwd_tbl_top
> > +*
> > +* DESCRIPTION
> > +*    Setup LinearFDBTop for the switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr,
> > +                               IN osm_switch_t * const p_sw);
>
> I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
> osm_ucast_pipeline_tbl).


Why not ? What's the matter with doing this ?


> Why to not use a single function and update all
> routing engines appropriately (you need to do it anyway), so that this
> will only fill up new_lfts table?


I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop
whereas pipeline_tbl starts the cascade of LFT sets based on
max_smps_per_node.


>
>
> >  /*
> >  * PARAMETERS
> >  *    p_mgr
> > diff --git a/opensm/opensm/osm_lin_fwd_rcv.c
> b/opensm/opensm/osm_lin_fwd_rcv.c
> > index 2edb8d3..cb131b4 100644
> > --- a/opensm/opensm/osm_lin_fwd_rcv.c
> > +++ b/opensm/opensm/osm_lin_fwd_rcv.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -36,7 +36,7 @@
> >  /*
> >   * Abstract:
> >   *    Implementation of osm_lft_rcv_t.
> > - * This object represents the NodeDescription Receiver object.
> > + * This object represents the Linear Forwarding Table Receiver object.
> >   * This object is part of the opensm family of objects.
> >   */
> >
> > @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >  {
> >       osm_sm_t *sm = context;
> >       osm_madw_t *p_madw = data;
> > +     osm_dr_path_t *p_path;
> >       ib_smp_t *p_smp;
> >       uint32_t block_num;
> >       osm_switch_t *p_sw;
> > @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >       uint8_t *p_block;
> >       ib_net64_t node_guid;
> >       ib_api_status_t status;
> > +     uint8_t block[IB_SMP_DATA_SIZE];
> > +     osm_madw_context_t mad_context;
> >
> >       CL_ASSERT(sm);
> >
> > @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >                               "\n\t\t\t\tSwitch 0x%" PRIx64 "\n",
> >                               ib_get_err_str(status),
> cl_ntoh64(node_guid));
> >               }
> > +
> > +             p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > +             mad_context.lft_context.node_guid = node_guid;
> > +             mad_context.lft_context.set_method = TRUE;
> > +
> > +             osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> > +                                       &mad_context);
> > +
> > +             p_sw->lft_block_id_ho++;
>
> Wouldn't it be simpler to encode block_id in a mad context?


Why simpler ? I think it complicates the receiver code to do that (assuming
max_smps_per_node remains).


>
>
> >       }
> >
> >       CL_PLOCK_RELEASE(sm->p_lock);
> > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> > index daa60ff..4e0fd2a 100644
> > --- a/opensm/opensm/osm_sm.c
> > +++ b/opensm/opensm/osm_sm.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -441,6 +441,45 @@ Exit:
> >
> >  /**********************************************************************
> >   **********************************************************************/
> > +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > +                            IN uint8_t *p_block, IN osm_dr_path_t
> *p_path,
> > +                            IN osm_madw_context_t *context)
> > +{
> > +     ib_api_status_t status;
> > +
> > +     for (;
> > +          osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho,
> p_block);
> > +          p_sw->lft_block_id_ho++) {
> > +             if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> > +                 !memcmp(p_block,
> > +                         p_sw->new_lft + p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > +                         IB_SMP_DATA_SIZE))
> > +                     continue;
> > +
> > +             p_sm->lfts_updated = 1;
> > +
> > +             OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> > +                     "Writing FT block %u to switch 0x%" PRIx64 "\n",
> > +                     p_sw->lft_block_id_ho,
> > +                     cl_ntoh64(context->lft_context.node_guid));
> > +
> > +             status = osm_req_set(p_sm, p_path,
> > +                                  p_sw->new_lft +
> > +                                  p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > +                                  IB_SMP_DATA_SIZE,
> IB_MAD_ATTR_LIN_FWD_TBL,
> > +                                  cl_hton32(p_sw->lft_block_id_ho),
> > +                                  CL_DISP_MSGID_NONE, context);
> > +
> > +             if (status != IB_SUCCESS)
> > +                     OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: "
> > +                             "Sending linear fwd. tbl. block failed
> (%s)\n",
> > +                             ib_get_err_str(status));
> > +             break;
> > +     }
> > +}
> > +
> > +/**********************************************************************
> > + **********************************************************************/
> >  static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
> >                                      IN osm_mgrp_t * p_mgrp)
> >  {
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index ec15f8a..1964b7f 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = {
> >       { "m_key_lease_period", OPT_OFFSET(m_key_lease_period),
> opts_parse_net16, NULL, 1 },
> >       { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32,
> NULL, 1 },
> >       { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32,
> NULL, 1 },
> > +     { "max_smps_per_node", OPT_OFFSET(max_smps_per_node),
> opts_parse_uint32, NULL, 1 },
> >       { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 },
> >       { "console_port", OPT_OFFSET(console_port), opts_parse_uint16,
> NULL, 0 },
> >       { "transaction_timeout", OPT_OFFSET(transaction_timeout),
> opts_parse_uint32, NULL, 1 },
> > @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t *
> const p_opt)
> >       p_opt->m_key_lease_period = 0;
> >       p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
> >       p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> > +     p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE;
> >       p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
> >       p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
> >       p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> > @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> >               "max_wire_smps %u\n\n"
> >               "# The maximum time in [msec] allowed for a transaction to
> complete\n"
> >               "transaction_timeout %u\n\n"
> > +             "# Maximum number of SMPs per node sent in parallel\n"
> > +             "# (0 means unlimited)\n"
> > +             "# Only applies to certain attributes\n"
> > +             "max_smps_per_node %u\n\n"
> >               "# Maximal time in [msec] a message can stay in the
> incoming message queue.\n"
> >               "# If there is more than one message in the queue and the
> last message\n"
> >               "# stayed in the queue more than this value, any SA request
> will be\n"
> > @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> >               "single_thread %s\n\n",
> >               p_opts->max_wire_smps,
> >               p_opts->transaction_timeout,
> > +             p_opts->max_smps_per_node,
> >               p_opts->max_msg_fifo_timeout,
> >               p_opts->single_thread ? "TRUE" : "FALSE");
> >
> > diff --git a/opensm/opensm/osm_ucast_cache.c
> b/opensm/opensm/osm_ucast_cache.c
> > index 216b496..31c930b 100644
> > --- a/opensm/opensm/osm_ucast_cache.c
> > +++ b/opensm/opensm/osm_ucast_cache.c
> > @@ -1,5 +1,5 @@
> >  /*
> > - * Copyright (c) 2008      Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights
> reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t *
> p_mgr)
> >                       memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO
> + 1);
> >               }
> >
> > -             osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> >       }
> >
> > +     osm_ucast_pipeline_tbl(p_mgr);
> > +
> >       return 0;
> >  }
> >
> > diff --git a/opensm/opensm/osm_ucast_file.c
> b/opensm/opensm/osm_ucast_file.c
> > index 2505c46..099e8ba 100644
> > --- a/opensm/opensm/osm_ucast_file.c
> > +++ b/opensm/opensm/osm_ucast_file.c
> > @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context)
> >                               "routing algorithm\n");
> >               } else if (!strncmp(p, "Unicast lids", 12)) {
> >                       if (p_sw)
> > -                             osm_ucast_mgr_set_fwd_table(&p_osm->sm.
> > -                                                         ucast_mgr,
> p_sw);
> > +                             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.
> > +                                                           ucast_mgr,
> p_sw);
> >                       q = strstr(p, " guid 0x");
> >                       if (!q) {
> >                               OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
> > @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context)
> >       }
> >
> >       if (p_sw)
> > -             osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> >
> >       fclose(file);
> >       return 0;
>
> I suppose that this breaks 'file' routing engine (did you test it?) -
> instead of switch LFTs setup this will only update its TOPs.


At this point, I don't recall.


>
>
> > diff --git a/opensm/opensm/osm_ucast_ftree.c
> b/opensm/opensm/osm_ucast_ftree.c
> > index bde6dbd..d65c685 100644
> > --- a/opensm/opensm/osm_ucast_ftree.c
> > +++ b/opensm/opensm/osm_ucast_ftree.c
> > @@ -2,7 +2,7 @@
> >   * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
> >   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t *
> const p_map_item,
> >       ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
> >
> >       p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
> > -     osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
> > -                                 p_sw->p_osm_sw);
> > +     osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr,
> > +                                   p_sw->p_osm_sw);
> >  }
> >
> >  /***************************************************
> > @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context)
> >       /* for each switch, set its fwd table */
> >       cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void
> *)p_ftree);
> >
> > +     osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr);
> > +
> >       /* write out hca ordering file */
> >       fabric_dump_hca_ordering(p_ftree);
> >
> > diff --git a/opensm/opensm/osm_ucast_lash.c
> b/opensm/opensm/osm_ucast_lash.c
> > index 12b5e34..adf5f6c 100644
> > --- a/opensm/opensm/osm_ucast_lash.c
> > +++ b/opensm/opensm/osm_ucast_lash.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2007      Simula Research Laboratory. All rights
> reserved.
> >   * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
> > @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash)
> >                                       physical_egress_port);
> >                       }
> >               }               /* for */
> > -             osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> >       }
> > +
> > +     osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr);
> > +
> >       OSM_LOG_EXIT(p_log);
> >  }
> >
> > diff --git a/opensm/opensm/osm_ucast_mgr.c
> b/opensm/opensm/osm_ucast_mgr.c
> > index 78a7031..86d1c98 100644
> > --- a/opensm/opensm/osm_ucast_mgr.c
> > +++ b/opensm/opensm/osm_ucast_mgr.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -315,16 +315,14 @@ Exit:
> >
> >  /**********************************************************************
> >   **********************************************************************/
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> > -                             IN osm_switch_t * p_sw)
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> > +                               IN osm_switch_t * p_sw)
> >  {
> >       osm_node_t *p_node;
> >       osm_dr_path_t *p_path;
> >       osm_madw_context_t context;
> >       ib_api_status_t status;
> >       ib_switch_info_t si;
> > -     uint16_t block_id_ho = 0;
> > -     uint8_t block[IB_SMP_DATA_SIZE];
> >       boolean_t set_swinfo_require = FALSE;
> >       uint16_t lin_top;
> >       uint8_t life_state;
> > @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t *
> p_mgr,
> >                               ib_get_err_str(status));
> >       }
> >
> > -     /*
> > -        Send linear forwarding table blocks to the switch
> > -        as long as the switch indicates it has blocks needing
> > -        configuration.
> > -      */
> > -
> > -     context.lft_context.node_guid = osm_node_get_node_guid(p_node);
> > -     context.lft_context.set_method = TRUE;
> > -
> > -     if (!p_sw->new_lft) {
> > -             /* any routing should provide the new_lft */
> > -             CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > -                       p_mgr->cache_valid && !p_sw->need_update);
> > -             goto Exit;
> > -     }
> > -
> > -     for (block_id_ho = 0;
> > -          osm_switch_get_lft_block(p_sw, block_id_ho, block);
> > -          block_id_ho++) {
> > -             if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> > -                 !memcmp(block,
> > -                         p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> > -                         IB_SMP_DATA_SIZE))
> > -                     continue;
> > -
> > -             OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > -                     "Writing FT block %u\n", block_id_ho);
> > -
> > -             status = osm_req_set(p_mgr->sm, p_path,
> > -                                  p_sw->new_lft +
> > -                                  block_id_ho * IB_SMP_DATA_SIZE,
> > -                                  sizeof(block),
> IB_MAD_ATTR_LIN_FWD_TBL,
> > -                                  cl_hton32(block_id_ho),
> CL_DISP_MSGID_NONE,
> > -                                  &context);
> > +     p_sw->lft_block_id_ho = 0;
> >
> > -             if (status != IB_SUCCESS)
> > -                     OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> > -                             "Sending linear fwd. tbl. block failed
> (%s)\n",
> > -                             ib_get_err_str(status));
> > -     }
> > -
> > -Exit:
> >       OSM_LOG_EXIT(p_mgr->p_log);
> >       return 0;
> >  }
> > @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> >               }
> >       }
> >
> > -     osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > +     osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> >
> >       if (p_mgr->p_subn->opt.lmc)
> >               free_ports_priv(p_mgr);
> > @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> >       OSM_LOG_EXIT(p_mgr->p_log);
> >  }
> >
> > +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw,
> > +                                IN osm_ucast_mgr_t *p_mgr)
> > +{
> > +     osm_dr_path_t *p_path;
> > +     osm_madw_context_t mad_context;
> > +     uint8_t block[IB_SMP_DATA_SIZE];
> > +
> > +     OSM_LOG_ENTER(p_mgr->p_log);
> > +
> > +     CL_ASSERT(p_sw && p_sw->p_node);
> > +
> > +     OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > +             "Processing switch 0x%" PRIx64 "\n",
> > +             cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> > +
> > +     /*
> > +        Send linear forwarding table blocks to the switch
> > +        as long as the switch indicates it has blocks needing
> > +        configuration.
> > +      */
> > +     if (!p_sw->new_lft) {
> > +             /* any routing should provide the new_lft */
> > +             CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > +                       p_mgr->cache_valid && !p_sw->need_update);
> > +             goto Exit;
> > +     }
> > +
> > +     p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > +     mad_context.lft_context.node_guid =
> osm_node_get_node_guid(p_sw->p_node);
> > +     mad_context.lft_context.set_method = TRUE;
> > +
> > +     osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path,
> > +                               &mad_context);
> > +
> > +     p_sw->lft_block_id_ho++;
> > +
> > +Exit:
> > +     OSM_LOG_EXIT(p_mgr->p_log);
> > +}
> > +
> >  /**********************************************************************
> >   **********************************************************************/
> >  static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
> > @@ -870,6 +869,28 @@ static void
> sort_ports_by_switch_load(osm_ucast_mgr_t * m)
> >               add_sw_endports_to_order_list(s[i], m);
> >  }
> >
> > +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr)
> > +{
> > +     cl_qmap_t *p_sw_tbl;
> > +     osm_switch_t *p_sw;
> > +     int i;
> > +
> > +     for (i = 0;
> > +          !p_mgr->p_subn->opt.max_smps_per_node ||
> > +          i < p_mgr->p_subn->opt.max_smps_per_node;
> > +          i++) {
> > +             p_mgr->sm->lfts_updated = 0;
> > +             p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> > +             p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> > +             while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> > +                     ucast_mgr_pipeline_tbl(p_sw, p_mgr);
> > +                     p_sw = (osm_switch_t *)
> cl_qmap_next(&p_sw->map_item);
> > +             }
> > +             if (!p_mgr->sm->lfts_updated)
> > +                     break;
> > +     }
> > +}
>
> Is it possible (for example in case of send errors) that "partial" LFT
> blocks sending will trigger wait_for_pending_transaction() completion?


I don't know. Is this different from the original algorithm in the case of
send errors ?

-- Hal


>
>
> Sasha
>
> > +
> >  static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> >  {
> >       cl_qlist_init(&p_mgr->port_order_list);
> > @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
> >       cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
> ucast_mgr_process_tbl,
> >                          p_mgr);
> >
> > +     osm_ucast_pipeline_tbl(p_mgr);
> > +
> >       cl_qlist_remove_all(&p_mgr->port_order_list);
> >
> >       return 0;
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/8e38491b/attachment.html>

From ofedrnicuser at yahoo.com  Tue Aug  4 10:46:07 2009
From: ofedrnicuser at yahoo.com (Bill N)
Date: Tue, 4 Aug 2009 10:46:07 -0700 (PDT)
Subject: [ofa-general] perftest for Chelsio RNIC adapters
In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383035F7A95D5@azsmsx501.amr.corp.intel.com>
Message-ID: <641544.10718.qm@web111213.mail.gq1.yahoo.com>

yes. I am able to run them.
Thanks a lot.
Bill

--- On Tue, 8/4/09, Tung, Chien Tin <chien.tin.tung at intel.com> wrote:

> From: Tung, Chien Tin <chien.tin.tung at intel.com>
> Subject: RE: [ofa-general] perftest for Chelsio RNIC adapters
> To: "Bill N" <ofedrnicuser at yahoo.com>, "OFED General" <general at lists.openfabrics.org>
> Date: Tuesday, August 4, 2009, 2:25 PM
>  
> >Is performance tests of the perftest-1.2 supported for
> Chelsio 
> >and other RNIC adapters?
> 
> You can run ib_rdma_bw and ib_rdma_lat over iWarp adapters
> with -c flag (use RDMA CM).
> 
> Chien


From bart.vanassche at gmail.com  Tue Aug  4 11:25:35 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Tue, 4 Aug 2009 20:25:35 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adafxc7mtn8.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
Message-ID: <e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>

On Tue, Aug 4, 2009 at 6:27 PM, Roland Dreier <rdreier at cisco.com> wrote:
>
>  > An update: apparently it is possible to trigger scmnd->device == NULL even
>  > without triggering a prior IB CM disconnect. The following shell commands
>  > are sufficient to trigger the WARN_ON statement in the patch below:
>
>  > rmmod ib_srp
>  > modprobe ib_srp
>  > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo
>  > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target;
>  > done
>  > sg_reset -d ${srp_device}
>
> So in other words, just sg_reset on an SRP device triggers the warning?

By the way, Vladislav Bolkhovitin was so kind to inform me that this
issue is not specific to the SRP initiator. For more information, see
also http://thread.gmane.org/gmane.linux.scsi/26166.

Bart.


From eli at dev.mellanox.co.il  Tue Aug  4 12:41:25 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Tue, 4 Aug 2009 22:41:25 +0300
Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory
In-Reply-To: <adak51jmuo6.fsf@cisco.com>
References: <20090803092528.GA25528@mtls03> <adak51kod06.fsf@cisco.com>
	<20090804033221.GA30949@mtls03> <adak51jmuo6.fsf@cisco.com>
Message-ID: <20090804194125.GA29370@mtls03>

On Tue, Aug 04, 2009 at 09:05:13AM -0700, Roland Dreier wrote:
> 
>  > Maybe it's just a loose connection but yet, it seems to me that
>  > operations on id_priv->mc_list should be protected. Should I send a
>  > different patch?
> 
> "seems ... should be" is very weak justification for locking.  What
> should they be protected from?
> 

What if rdma_join_multicast() is called when rdma_destroy_id() - for
example from cma_ib_handler() due to error returned from the handler?
In this case list_add(&mc->list, &id_priv->mc_list) in
rdma_join_multicast() can may be executed along with the list
manipulation done in cma_leave_mc_groups().

Generally, it looks strange that in some places list handling is
protected with a spinlock and in other places not.


From sashak at voltaire.com  Tue Aug  4 13:15:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 23:15:05 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
Message-ID: <20090804201505.GI7993@me>

On 12:45 Tue 04 Aug     , Hal Rosenstock wrote:
> >
> > > This patch also introduces a new config option (max_smps_per_node)
> > > which indicates how deep the per node pipeline is (current default is 4).
> > > This also has the effect of limiting the number of times that the switch
> > > list is traversed. Maybe this embellishment is unnecessary.
> >
> > Then why is it needed?
> 
> 
> Also, as was discussed in the thread on this, it gives a way to control
> possible VL15 overflow.

VL15 overflow is controlled by max_wire_smps not by max_smps_per_node.

> > I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
> > osm_ucast_pipeline_tbl).
> 
> 
> Why not ? What's the matter with doing this ?

To not expose this (LFTs setup) algorithm to routing engines. And to
eliminate duplicated function calls.

> > Why to not use a single function and update all
> > routing engines appropriately (you need to do it anyway), so that this
> > will only fill up new_lfts table?
> 
> 
> I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop
> whereas pipeline_tbl starts the cascade of LFT sets based on
> max_smps_per_node.

You can setup new_lfts arrays in routing engines and at the end of cycle
call single osm_*setup*_lfts() which will do everything - setup TOPs and
start to run LFT blocks update.

> > > +
> > > +             p_path =
> > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > > +
> > > +             mad_context.lft_context.node_guid = node_guid;
> > > +             mad_context.lft_context.set_method = TRUE;
> > > +
> > > +             osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> > > +                                       &mad_context);
> > > +
> > > +             p_sw->lft_block_id_ho++;
> >
> > Wouldn't it be simpler to encode block_id in a mad context?
> 
> 
> Why simpler ? I think it complicates the receiver code to do that (assuming
> max_smps_per_node remains).

Ok.

> > I suppose that this breaks 'file' routing engine (did you test it?) -
> > instead of switch LFTs setup this will only update its TOPs.
> 
> At this point, I don't recall.

You removed osm_ucast_mgr_set_fwd_table() calls and placed
osm_ucast_mgr_set_fwd_tbl_top() instead - obviously nothing will run an
actual LFT blocks setup.

> > Is it possible (for example in case of send errors) that "partial" LFT
> > blocks sending will trigger wait_for_pending_transaction() completion?
> 
> 
> I don't know. Is this different from the original algorithm in the case of
> send errors ?

Yes, it is different - unlike the original code it leaves ucast mgr (and
go to wait in wait_for_pending()) before all required LFT blocks update
requests were sent.

Sasha


From hal.rosenstock at gmail.com  Tue Aug  4 13:44:06 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Aug 2009 16:44:06 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090804201505.GI7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
Message-ID: <f0e08f230908041344h2f3304ay78aa9221918fc035@mail.gmail.com>

On Tue, Aug 4, 2009 at 4:15 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 12:45 Tue 04 Aug     , Hal Rosenstock wrote:
> > >
> > > > This patch also introduces a new config option (max_smps_per_node)
> > > > which indicates how deep the per node pipeline is (current default is
> 4).
> > > > This also has the effect of limiting the number of times that the
> switch
> > > > list is traversed. Maybe this embellishment is unnecessary.
> > >
> > > Then why is it needed?
> >
> >
> > Also, as was discussed in the thread on this, it gives a way to control
> > possible VL15 overflow.
>
> VL15 overflow is controlled by max_wire_smps not by max_smps_per_node.


It's a different control on VL15 overflow. It can easily be eliminated if
that's what you want. There's actually some minor simplification with doing
this.


>
>
> > > I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
> > > osm_ucast_pipeline_tbl).
> >
> >
> > Why not ? What's the matter with doing this ?
>
> To not expose this (LFTs setup) algorithm to routing engines. And to
> eliminate duplicated function calls.
>
> > > Why to not use a single function and update all
> > > routing engines appropriately (you need to do it anyway), so that this
> > > will only fill up new_lfts table?
> >
> >
> > I'm not following what you're describing. set_fwd_tbl_top sets
> LinearFDBTop
> > whereas pipeline_tbl starts the cascade of LFT sets based on
> > max_smps_per_node.
>
> You can setup new_lfts arrays in routing engines and at the end of cycle
> call single osm_*setup*_lfts() which will do everything - setup TOPs and
> start to run LFT blocks update.
>
> > > > +
> > > > +             p_path =
> > > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > > > +
> > > > +             mad_context.lft_context.node_guid = node_guid;
> > > > +             mad_context.lft_context.set_method = TRUE;
> > > > +
> > > > +             osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> > > > +                                       &mad_context);
> > > > +
> > > > +             p_sw->lft_block_id_ho++;
> > >
> > > Wouldn't it be simpler to encode block_id in a mad context?
> >
> >
> > Why simpler ? I think it complicates the receiver code to do that
> (assuming
> > max_smps_per_node remains).
>
> Ok.
>
> > > I suppose that this breaks 'file' routing engine (did you test it?) -
> > > instead of switch LFTs setup this will only update its TOPs.
> >
> > At this point, I don't recall.
>
> You removed osm_ucast_mgr_set_fwd_table() calls and placed
> osm_ucast_mgr_set_fwd_tbl_top() instead - obviously nothing will run an
> actual LFT blocks setup.


>
> > > Is it possible (for example in case of send errors) that "partial" LFT
> > > blocks sending will trigger wait_for_pending_transaction() completion?
> >
> >
> > I don't know. Is this different from the original algorithm in the case
> of
> > send errors ?
>
> Yes, it is different - unlike the original code it leaves ucast mgr (and
> go to wait in wait_for_pending()) before all required LFT blocks update
> requests were sent.


This goes away if there is no max_smps_per_node support.

So do you want to also preserve the original behavior/algorithm or you have
no preference ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/78f65531/attachment.html>

From sashak at voltaire.com  Tue Aug  4 13:59:50 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 4 Aug 2009 23:59:50 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets  across switches
In-Reply-To: <f0e08f230908041344h2f3304ay78aa9221918fc035@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908041344h2f3304ay78aa9221918fc035@mail.gmail.com>
Message-ID: <20090804205950.GJ7993@me>

On 16:44 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> So do you want to also preserve the original behavior/algorithm or you have
> no preference ?

No need unless there is a reason for doing this.

Sasha


From rdreier at cisco.com  Tue Aug  4 14:39:01 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Aug 2009 14:39:01 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP initiator triggered by a SCSI reset after the SRP
	connection has been closed
In-Reply-To: <e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	(Bart Van Assche's message of "Tue, 4 Aug 2009 20:25:35 +0200")
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
Message-ID: <adaljlzl0ne.fsf@cisco.com>


 > By the way, Vladislav Bolkhovitin was so kind to inform me that this
 > issue is not specific to the SRP initiator. For more information, see
 > also http://thread.gmane.org/gmane.linux.scsi/26166.

I'm not sure I follow this exactly -- the idea is that sg_reset
generates SCSI commands that are somehow different?  What does the LLD
have to do to handle them?

Is the problem that we get a command with bogus host_scribble (since SRP
never saw it before) and so srp_find_req() gets confused?

 - R.


From hnrose at comcast.net  Tue Aug  4 14:39:05 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 17:39:05 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Directly call
	calloc/free rather than create/delete_cdg
Message-ID: <20090804213905.GA23497@comcast.net>


Reduce call stack by one call level

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 6210477..168a758 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -62,20 +62,6 @@ typedef struct _reachable_dest {
 	struct _reachable_dest *next;
 } reachable_dest_t;
 
-static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
-{
-	cdg_vertex_t *v;
-
-	v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
-
-	return v;
-}
-
-static void delete_cdg_vertex(cdg_vertex_t *v)
-{
-	free(v);
-}
-
 static void connect_switches(lash_t * p_lash, int sw1, int sw2, int phy_port_1)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
@@ -207,7 +193,7 @@ static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
 
 			cdg_vertex_matrix[lane][sw][i_next_switch] = NULL;
 
-			delete_cdg_vertex(v);
+			free(v);
 		} else {
 			v->num_using_vertex--;
 			if (i_next_switch != dest_switch) {
@@ -352,7 +338,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch,
 	while (sw != dest_switch) {
 
 		if (cdg_vertex_matrix[lane][sw][next_switch] == NULL) {
-			v = create_cdg_vertex(num_switches);
+			v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
 			v->from = sw;
 			v->to = next_switch;
 			v->temp = 1;
@@ -442,7 +428,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash, int sw, int dest_switch,
 
 		if (v->temp == 1) {
 			cdg_vertex_matrix[lane][sw][next_switch] = NULL;
-			delete_cdg_vertex(v);
+			free(v);
 		} else {
 			CL_ASSERT(v->num_temp_depend <= v->num_deps);
 			v->num_deps = v->num_deps - v->num_temp_depend;
@@ -684,7 +670,7 @@ static void free_lash_structures(lash_t * p_lash)
 		for (j = 0; j < num_switches; j++) {
 			for (k = 0; k < num_switches; k++)
 				if (p_lash->cdg_vertex_matrix[i][j][k])
-					delete_cdg_vertex(p_lash->cdg_vertex_matrix[i][j][k]);
+					free(p_lash->cdg_vertex_matrix[i][j][k]);
 			if (p_lash->cdg_vertex_matrix[i][j])
 				free(p_lash->cdg_vertex_matrix[i][j]);
 		}


From hnrose at comcast.net  Tue Aug  4 14:44:13 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 4 Aug 2009 17:44:13 -0400
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_lin_fwd_rcv.c: Commentary
	change
Message-ID: <20090804214413.GA24878@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c
index 2edb8d3..ae40b0d 100644
--- a/opensm/opensm/osm_lin_fwd_rcv.c
+++ b/opensm/opensm/osm_lin_fwd_rcv.c
@@ -36,7 +36,7 @@
 /*
  * Abstract:
  *    Implementation of osm_lft_rcv_t.
- * This object represents the NodeDescription Receiver object.
+ * This object represents the Linear Forwarding Table Receiver object.
  * This object is part of the opensm family of objects.
  */
 

From arlin.r.davis at intel.com  Tue Aug  4 22:32:03 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Tue, 4 Aug 2009 22:32:03 -0700
Subject: [ofa-general] [PATCH] uDAPL v2: CNO pre-triggered events not
	delivered during cno_wait
Message-ID: <53ED9F5E1BB14E13BDA0BE594E3F43B6@amr.corp.intel.com>

CNO events, once triggered will not be returned during the cno wait.
Check for triggered state before going to sleep in cno_wait. Reset
triggered EVD reference after reporting.

diff --git a/dapl/udapl/dapl_cno_wait.c b/dapl/udapl/dapl_cno_wait.c
index e89317d..6bbd249 100644
--- a/dapl/udapl/dapl_cno_wait.c
+++ b/dapl/udapl/dapl_cno_wait.c
@@ -82,6 +82,14 @@ DAT_RETURN DAT_API dapl_cno_wait(IN DAT_CNO_HANDLE cno_handle,	/* cno_handle */
 	}
 
 	dapl_os_lock(&cno_ptr->header.lock);
+	if (cno_ptr->cno_state == DAPL_CNO_STATE_TRIGGERED) {
+		cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED;
+		*evd_handle = cno_ptr->cno_evd_triggered;
+		cno_ptr->cno_evd_triggered = NULL;
+		dapl_os_unlock(&cno_ptr->header.lock);
+		goto bail;
+	}
+
 	while (cno_ptr->cno_state == DAPL_CNO_STATE_UNTRIGGERED
 	       && DAT_GET_TYPE(dat_status) != DAT_TIMEOUT_EXPIRED) {
 		cno_ptr->cno_waiters++;
@@ -104,6 +112,7 @@ DAT_RETURN DAT_API dapl_cno_wait(IN DAT_CNO_HANDLE cno_handle,	/* cno_handle */
 		dapl_os_assert(cno_ptr->cno_state == DAPL_CNO_STATE_TRIGGERED);
 		cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED;
 		*evd_handle = cno_ptr->cno_evd_triggered;
+		cno_ptr->cno_evd_triggered = NULL;
 	} else if (DAT_GET_TYPE(dat_status) == DAT_TIMEOUT_EXPIRED) {
 		cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED;
 		*evd_handle = NULL;


From arlin.r.davis at intel.com  Tue Aug  4 22:32:06 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 4 Aug 2009 22:32:06 -0700
Subject: [ofa-general] [PATCH] uDAPL v2: fix dtest to handle CNO events
	properly
Message-ID: <E3280858FA94444CA49D2BA02341C983567FA159@orsmsx506.amr.corp.intel.com>


modify dtest.c to cleanup CNO wait code and consolidate into
collect_event() call. After waking up from CNO wait the
consumer must check all EVD's. The EVD's under the CNO
could be dropped if already triggered or could come in any order.
DT_RetToString changed to DT_RetToStr and DT_EventToSTr
changed to DT_EventToStr for consistency.

diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index 739ccca..d868490 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -104,6 +104,7 @@
 /* definitions */
 #define SERVER_CONN_QUAL  45248
 #define DTO_TIMEOUT       (1000*1000*5)
+#define CNO_TIMEOUT       (1000*1000*1)
 #define DTO_FLUSH_TIMEOUT (1000*1000*2)
 #define CONN_TIMEOUT      (1000*1000*10)
 #define SERVER_TIMEOUT    DAT_TIMEOUT_INFINITE
@@ -208,8 +209,8 @@ static int burst_msg_posted = 0;
 static int burst_msg_index = 0;

 /* forward prototypes */
-const char *DT_RetToString(DAT_RETURN ret_value);
-const char *DT_EventToSTr(DAT_EVENT_NUMBER event_code);
+const char *DT_RetToStr(DAT_RETURN ret_value);
+const char *DT_EventToStr(DAT_EVENT_NUMBER event_code);
 void print_usage(void);
 double get_time(void);
 void init_data(void);
@@ -262,6 +263,51 @@ void flush_evds(void)
        }
 }

+
+static inline DAT_RETURN
+collect_event(DAT_EVD_HANDLE dto_evd,
+             DAT_EVENT *event,
+             DAT_TIMEOUT timeout,
+             int *counter)
+{
+       DAT_EVD_HANDLE  evd = DAT_HANDLE_NULL;
+       DAT_COUNT       nmore;
+       DAT_RETURN      ret = DAT_SUCCESS;
+
+       if (use_cno) {
+retry:
+               /* CNO wait could return EVD's in any order and
+                * may drop some EVD notification's if already
+                * triggered. Once woken, simply dequeue the
+                * Evd the caller wants to collect and return.
+                * If notification without EVD, retry.
+                */
+               ret = dat_cno_wait(h_dto_cno, CNO_TIMEOUT, &evd);
+               if (dat_evd_dequeue(dto_evd, event) != DAT_SUCCESS) {
+                       if (ret == DAT_SUCCESS)
+                               printf(" WARNING: CNO notification:"
+                                      " without EVD?\n");
+                       goto retry;
+               }
+               ret = DAT_SUCCESS; /* cno timed out, but EVD dequeued */
+
+       } else if (!polling) {
+
+               /* use wait to dequeue */
+               ret = dat_evd_wait(dto_evd, timeout, 1, event, &nmore);
+               if (ret != DAT_SUCCESS)
+                       fprintf(stderr,
+                               "Error waiting on h_dto_evd %p: %s\n",
+                               dto_evd, DT_RetToStr(ret));
+
+       } else {
+               while (dat_evd_dequeue(dto_evd, event) == DAT_QUEUE_EMPTY)
+                       if (counter)
+                               (*counter)++;
+       }
+       return (ret);
+}
+
 int main(int argc, char **argv)
 {
        int i, c;
@@ -355,7 +401,7 @@ int main(int argc, char **argv)
        time.open += ((stop - start) * 1.0e6);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d: Error Adaptor open: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                exit(1);
        } else
                LOGPRINTF("%d Opened Interface Adaptor\n", getpid());
@@ -368,7 +414,7 @@ int main(int argc, char **argv)
        time.pzc += ((stop - start) * 1.0e6);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error creating Protection Zone: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                exit(1);
        } else
                LOGPRINTF("%d Created Protection Zone\n", getpid());
@@ -378,7 +424,7 @@ int main(int argc, char **argv)
        ret = register_rdma_memory();
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error registering RDMA memory: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d Register RDMA memory done\n", getpid());
@@ -387,7 +433,7 @@ int main(int argc, char **argv)
        ret = create_events();
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error creating events: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else {
                LOGPRINTF("%d Create events done\n", getpid());
@@ -419,7 +465,7 @@ int main(int argc, char **argv)
        time.total += time.epc;
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_ep_create: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d EP created %p \n", getpid(), h_ep);
@@ -431,7 +477,7 @@ int main(int argc, char **argv)
        ret = connect_ep(hostname, SERVER_CONN_QUAL);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error connect_ep: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d connect_ep complete\n", getpid());
@@ -440,7 +486,7 @@ int main(int argc, char **argv)
        ret = dat_ep_query(h_ep, DAT_EP_FIELD_ALL, &ep_param);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_ep_query: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d EP queried %p \n", getpid(), h_ep);
@@ -483,7 +529,7 @@ int main(int argc, char **argv)
        ret = do_rdma_write_with_msg();
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error do_rdma_write_with_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d do_rdma_write_with_msg complete\n", getpid());
@@ -492,7 +538,7 @@ int main(int argc, char **argv)
        ret = do_rdma_read_with_msg();
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error do_rdma_read_with_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else
                LOGPRINTF("%d do_rdma_read_with_msg complete\n", getpid());
@@ -501,7 +547,7 @@ int main(int argc, char **argv)
        ret = do_ping_pong_msg();
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error do_ping_pong_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                goto cleanup;
        } else {
                LOGPRINTF("%d do_ping_pong_msg complete\n", getpid());
@@ -528,7 +574,7 @@ complete:
                time.total += time.epf;
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing EP: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                } else {
                        LOGPRINTF("%d Freed EP\n", getpid());
                        h_ep = DAT_HANDLE_NULL;
@@ -540,7 +586,7 @@ complete:
        ret = destroy_events();
        if (ret != DAT_SUCCESS)
                fprintf(stderr, "%d Error destroy_events: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
        else
                LOGPRINTF("%d destroy events done\n", getpid());

@@ -548,7 +594,7 @@ complete:
        LOGPRINTF("%d unregister_rdma_memory \n", getpid());
        if (ret != DAT_SUCCESS)
                fprintf(stderr, "%d Error unregister_rdma_memory: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
        else
                LOGPRINTF("%d unregister_rdma_memory done\n", getpid());

@@ -560,7 +606,7 @@ complete:
        time.pzf += ((stop - start) * 1.0e6);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error freeing PZ: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
        } else {
                LOGPRINTF("%d Freed pz\n", getpid());
                h_pz = NULL;
@@ -574,7 +620,7 @@ complete:
        time.close += ((stop - start) * 1.0e6);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d: Error Adaptor close: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
        } else
                LOGPRINTF("%d Closed Interface Adaptor\n", getpid());

@@ -652,7 +698,6 @@ send_msg(void *data,
 {
        DAT_LMR_TRIPLET iov;
        DAT_EVENT event;
-       DAT_COUNT nmore;
        DAT_RETURN ret;

        iov.lmr_context = context;
@@ -669,47 +714,23 @@ send_msg(void *data,

        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d: ERROR: dat_ep_post_send() %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return ret;
        }

        if (!(flags & DAT_COMPLETION_SUPPRESS_FLAG)) {
-               if (polling) {
-                       printf("%d Polling post send completion...\n",
-                              getpid());
-                       while (dat_evd_dequeue(h_dto_req_evd, &event) ==
-                              DAT_QUEUE_EMPTY) ;
-               } else {
-                       LOGPRINTF("%d waiting for post_send completion event\n",
-                                 getpid());
-                       if (use_cno) {
-                               DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-                               ret =
-                                   dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                               LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                         getpid(), evd);
-                               if (evd != h_dto_req_evd) {
-                                       /* CNO timeout, already on EVD */
-                                       if (evd != NULL)
-                                               return (ret);
-                               }
-                       }
-                       /* use wait to dequeue */
-                       ret =
-                           dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event,
-                                        &nmore);
-                       if (ret != DAT_SUCCESS) {
-                               fprintf(stderr,
-                                       "%d: ERROR: DTO dat_evd_wait() %s\n",
-                                       getpid(), DT_RetToString(ret));
-                               return ret;
-                       }
-               }
+
+               if (collect_event(h_dto_req_evd,
+                                 &event,
+                                 DTO_TIMEOUT,
+                                 &poll_count) != DAT_SUCCESS)
+                       return (DAT_ABORT);

                /* validate event number, len, cookie, and status */
                if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                        fprintf(stderr, "%d: ERROR: DTO event number %s\n",
-                               getpid(), DT_EventToSTr(event.event_number));
+                               getpid(),
+                               DT_EventToStr(event.event_number));
                        return (DAT_ABORT);
                }

@@ -730,7 +751,7 @@ send_msg(void *data,
                if (event.event_data.dto_completion_event_data.status !=
                    DAT_SUCCESS) {
                        fprintf(stderr, "%d: ERROR: DTO event status %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (DAT_ABORT);
                }
        }
@@ -772,7 +793,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)

        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error registering send msg buffer: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else
                LOGPRINTF("%d Registered send Message Buffer %p \n",
@@ -796,7 +817,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                             &registered_addr_recv_msg);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error registering recv msg buffer: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else
                LOGPRINTF("%d Registered Receive Message Buffer %p\n",
@@ -823,7 +844,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d Error registering recv msg buffer: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else
                        LOGPRINTF("%d Registered Receive Message Buffer %p\n",
@@ -846,7 +867,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                                     h_cr_evd, DAT_PSP_CONSUMER_FLAG, &h_psp);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_psp_create: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else
                        LOGPRINTF("%d dat_psp_created for server listen\n",
@@ -858,7 +879,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                ret = dat_evd_wait(h_cr_evd, SERVER_TIMEOUT, 1, &event, &nmore);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_evd_wait: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else
                        LOGPRINTF("%d dat_evd_wait for cr_evd completed\n",
@@ -866,7 +887,8 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)

                if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) {
                        fprintf(stderr, "%d Error unexpected cr event : %s\n",
-                               getpid(), DT_EventToSTr(event.event_number));
+                               getpid(),
+                               DT_EventToStr(event.event_number));
                        return (DAT_ABORT);
                }
                if ((event.event_data.cr_arrival_event_data.conn_qual !=
@@ -874,7 +896,8 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                    || (event.event_data.cr_arrival_event_data.sp_handle.
                        psp_handle != h_psp)) {
                        fprintf(stderr, "%d Error wrong cr event data : %s\n",
-                               getpid(), DT_EventToSTr(event.event_number));
+                               getpid(),
+                               DT_EventToStr(event.event_number));
                        return (DAT_ABORT);
                }

@@ -922,7 +945,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)

                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_cr_accept: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else
                        LOGPRINTF("%d dat_cr_accept completed\n", getpid());
@@ -966,7 +989,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
                                     0, DAT_CONNECT_DEFAULT_FLAG);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_ep_connect: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else
                        LOGPRINTF("%d dat_ep_connect completed\n", getpid());
@@ -990,7 +1013,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 #ifdef TEST_REJECT_WITH_PRIVATE_DATA
        if (event.event_number != DAT_CONNECTION_EVENT_PEER_REJECTED) {
                fprintf(stderr, "%d expected conn reject event : %s\n",
-                       getpid(), DT_EventToSTr(event.event_number));
+                       getpid(), DT_EventToStr(event.event_number));
                return (DAT_ABORT);
        }
        /* get the reject private data and validate */
@@ -1013,7 +1036,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
        if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) {
                fprintf(stderr, "%d Error unexpected conn event : 0x%x %s\n",
                        getpid(), event.event_number,
-                       DT_EventToSTr(event.event_number));
+                       DT_EventToStr(event.event_number));
                return (DAT_ABORT);
        }

@@ -1064,7 +1087,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)

        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error send_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else
                LOGPRINTF("%d send_msg completed\n", getpid());
@@ -1072,42 +1095,17 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
        /*
         *  Wait for remote RMR information for RDMA
         */
-       if (polling) {
-               printf("%d Polling for remote to send RMR data\n", getpid());
-               while (dat_evd_dequeue(h_dto_rcv_evd, &event) ==
-                      DAT_QUEUE_EMPTY) ;
-       } else {
-               printf("%d Waiting for remote to send RMR data\n", getpid());
-               if (use_cno) {
-                       DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-                       ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                       LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                 getpid(), evd);
-                       if (evd != h_dto_rcv_evd) {
-                               /* CNO timeout, already on EVD */
-                               if (evd != NULL)
-                                       return (ret);
-                       }
-               }
-               /* use wait to dequeue */
-               ret =
-                   dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore);
-               if (ret != DAT_SUCCESS) {
-                       fprintf(stderr,
-                               "%d Error waiting on h_dto_rcv_evd: %s\n",
-                               getpid(), DT_RetToString(ret));
-                       return (ret);
-               } else {
-                       LOGPRINTF("%d dat_evd_wait h_dto_rcv_evd completed\n",
-                                 getpid());
-               }
-       }
-
+       if (collect_event(h_dto_rcv_evd,
+                         &event,
+                         DTO_TIMEOUT,
+                         &poll_count) != DAT_SUCCESS)
+               return (DAT_ABORT);
+
        printf("%d remote RMR data arrived!\n", getpid());

        if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                fprintf(stderr, "%d Error unexpected DTO event : %s\n",
-                       getpid(), DT_EventToSTr(event.event_number));
+                       getpid(), DT_EventToStr(event.event_number));
                return (DAT_ABORT);
        }
        if ((event.event_data.dto_completion_event_data.transfered_length !=
@@ -1162,7 +1160,7 @@ void disconnect_ep(void)
                        if (ret != DAT_SUCCESS) {
                                fprintf(stderr,
                                        "%d Error dat_ep_disconnect: %s\n",
-                                       getpid(), DT_RetToString(ret));
+                                       getpid(), DT_RetToStr(ret));
                        } else {
                                LOGPRINTF("%d dat_ep_disconnect completed\n",
                                          getpid());
@@ -1177,7 +1175,7 @@ void disconnect_ep(void)
                                 &nmore);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_evd_wait: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                } else {
                        LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n",
                                  getpid());
@@ -1189,7 +1187,7 @@ void disconnect_ep(void)
                ret = dat_psp_free(h_psp);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_psp_free: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                } else {
                        LOGPRINTF("%d dat_psp_free completed\n", getpid());
                }
@@ -1203,7 +1201,7 @@ void disconnect_ep(void)
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d Error deregistering send msg mr: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                } else {
                        LOGPRINTF("%d Unregistered send message Buffer\n",
                                  getpid());
@@ -1219,7 +1217,7 @@ void disconnect_ep(void)
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d Error deregistering recv msg mr: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                } else {
                        LOGPRINTF("%d Unregistered recv message Buffer\n",
                                  getpid());
@@ -1232,7 +1230,6 @@ void disconnect_ep(void)
 DAT_RETURN do_rdma_write_with_msg(void)
 {
        DAT_EVENT event;
-       DAT_COUNT nmore;
        DAT_LMR_TRIPLET l_iov[MSG_IOV_COUNT];
        DAT_RMR_TRIPLET r_iov;
        DAT_DTO_COOKIE cookie;
@@ -1277,7 +1274,7 @@ DAT_RETURN do_rdma_write_with_msg(void)
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d: ERROR: dat_ep_post_rdma_write() %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (DAT_ABORT);
                }
                LOGPRINTF("%d rdma_write # %d completed\n", getpid(), i + 1);
@@ -1296,41 +1293,19 @@ DAT_RETURN do_rdma_write_with_msg(void)

        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error send_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d send_msg completed\n", getpid());
        }

-       /*
-        *  Collect first event, write completion or the inbound recv
-        */
-       if (polling) {
-               while (dat_evd_dequeue(h_dto_rcv_evd, &event) ==
-                      DAT_QUEUE_EMPTY)
-                       rdma_wr_poll_count++;
-       } else {
-               LOGPRINTF("%d waiting for message receive event\n", getpid());
-               if (use_cno) {
-                       DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-                       ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                       LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                 getpid(), evd);
-                       if (evd != h_dto_rcv_evd) {
-                               /* CNO timeout, already on EVD */
-                               if (evd != NULL)
-                                       return (ret);
-                       }
-               }
-               /* use wait to dequeue */
-               ret =
-                   dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore);
-               if (ret != DAT_SUCCESS) {
-                       fprintf(stderr, "%d: ERROR: DTO dat_evd_wait() %s\n",
-                               getpid(), DT_RetToString(ret));
-                       return (ret);
-               }
-       }
+       /* inbound recv event, send completion's suppressed */
+       if (collect_event(h_dto_rcv_evd,
+                         &event,
+                         DTO_TIMEOUT,
+                         &rdma_wr_poll_count) != DAT_SUCCESS)
+               return (DAT_ABORT);
+
        stop = get_time();
        time.rdma_wr = ((stop - start) * 1.0e6);

@@ -1338,7 +1313,7 @@ DAT_RETURN do_rdma_write_with_msg(void)
        printf("%d inbound rdma_write; send message arrived!\n", getpid());
        if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                fprintf(stderr, "%d Error unexpected DTO event : %s\n",
-                       getpid(), DT_EventToSTr(event.event_number));
+                       getpid(), DT_EventToStr(event.event_number));
                return (DAT_ABORT);
        }

@@ -1386,7 +1361,6 @@ DAT_RETURN do_rdma_write_with_msg(void)
 DAT_RETURN do_rdma_read_with_msg(void)
 {
        DAT_EVENT event;
-       DAT_COUNT nmore;
        DAT_LMR_TRIPLET l_iov;
        DAT_RMR_TRIPLET r_iov;
        DAT_DTO_COOKIE cookie;
@@ -1425,44 +1399,21 @@ DAT_RETURN do_rdma_read_with_msg(void)
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d: ERROR: dat_ep_post_rdma_read() %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (DAT_ABORT);
                }

-               if (polling) {
-                       while (dat_evd_dequeue(h_dto_req_evd, &event) ==
-                              DAT_QUEUE_EMPTY)
-                               rdma_rd_poll_count[i]++;
-               } else {
-                       LOGPRINTF("%d waiting for rdma_read completion event\n",
-                                 getpid());
-                       if (use_cno) {
-                               DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-                               ret =
-                                   dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                               LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                         getpid(), evd);
-                               if (evd != h_dto_req_evd) {
-                                       /* CNO timeout, already on EVD */
-                                       if (evd != NULL)
-                                               return (ret);
-                               }
-                       }
-                       /* use wait to dequeue */
-                       ret =
-                           dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event,
-                                        &nmore);
-                       if (ret != DAT_SUCCESS) {
-                               fprintf(stderr,
-                                       "%d: ERROR: DTO dat_evd_wait() %s\n",
-                                       getpid(), DT_RetToString(ret));
-                               return ret;
-                       }
-               }
+               /* RDMA read completion event */
+               if (collect_event(h_dto_req_evd,
+                                 &event,
+                                 DTO_TIMEOUT,
+                                 &rdma_rd_poll_count[i]) != DAT_SUCCESS)
+                       return (DAT_ABORT);
+
                /* validate event number, len, cookie, and status */
                if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                        fprintf(stderr, "%d: ERROR: DTO event number %s\n",
-                               getpid(), DT_EventToSTr(event.event_number));
+                               getpid(), DT_EventToStr(event.event_number));
                        return (DAT_ABORT);
                }
                if ((event.event_data.dto_completion_event_data.
@@ -1481,7 +1432,7 @@ DAT_RETURN do_rdma_read_with_msg(void)
                if (event.event_data.dto_completion_event_data.status !=
                    DAT_SUCCESS) {
                        fprintf(stderr, "%d: ERROR: DTO event status %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (DAT_ABORT);
                }
                stop = get_time();
@@ -1513,48 +1464,25 @@ DAT_RETURN do_rdma_read_with_msg(void)

        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error send_msg: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d send_msg completed\n", getpid());
        }

-       /*
-        *  Collect first event, write completion or the inbound recv with immed
-        */
        printf("%d Waiting for inbound message....\n", getpid());
-       if (polling) {
-               while (dat_evd_dequeue(h_dto_rcv_evd, &event) ==
-                      DAT_QUEUE_EMPTY) ;
-       } else {
-               LOGPRINTF("%d waiting for message receive event\n", getpid());
-               if (use_cno) {
-                       DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-
-                       ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                       LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                 getpid(), evd);
-                       if (evd != h_dto_rcv_evd) {
-                               /* CNO timeout, already on EVD */
-                               if (evd != NULL)
-                                       return (ret);
-                       }
-               }
-               /* use wait to dequeue */
-               ret =
-                   dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore);
-               if (ret != DAT_SUCCESS) {
-                       fprintf(stderr, "%d: ERROR: DTO dat_evd_wait() %s\n",
-                               getpid(), DT_RetToString(ret));
-                       return (ret);
-               }
-       }
+
+       if (collect_event(h_dto_rcv_evd,
+                         &event,
+                         DTO_TIMEOUT,
+                         &poll_count) != DAT_SUCCESS)
+               return (DAT_ABORT);

        /* validate event number and status */
        printf("%d inbound rdma_read; send message arrived!\n", getpid());
        if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                fprintf(stderr, "%d Error unexpected DTO event : %s\n",
-                       getpid(), DT_EventToSTr(event.event_number));
+                       getpid(), DT_EventToStr(event.event_number));
                return (DAT_ABORT);
        }

@@ -1603,7 +1531,6 @@ DAT_RETURN do_rdma_read_with_msg(void)
 DAT_RETURN do_ping_pong_msg()
 {
        DAT_EVENT event;
-       DAT_COUNT nmore;
        DAT_DTO_COOKIE cookie;
        DAT_LMR_TRIPLET l_iov;
        DAT_RETURN ret;
@@ -1635,7 +1562,7 @@ DAT_RETURN do_ping_pong_msg()
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr,
                                "%d Error posting recv msg buffer: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Posted Receive Message Buffer %p\n",
@@ -1673,47 +1600,21 @@ DAT_RETURN do_ping_pong_msg()

                        if (ret != DAT_SUCCESS) {
                                fprintf(stderr, "%d Error send_msg: %s\n",
-                                       getpid(), DT_RetToString(ret));
+                                       getpid(), DT_RetToStr(ret));
                                return (ret);
                        } else {
                                LOGPRINTF("%d send_msg completed\n", getpid());
                        }
                }

-               /* Wait for recv message */
-               if (polling) {
-                       poll_count = 0;
-                       LOGPRINTF("%d Polling for message receive event\n",
-                                 getpid());
-                       while (dat_evd_dequeue(h_dto_rcv_evd, &event) ==
-                              DAT_QUEUE_EMPTY)
-                               poll_count++;
-               } else {
-                       LOGPRINTF("%d waiting for message receive event\n",
-                                 getpid());
-                       if (use_cno) {
-                               DAT_EVD_HANDLE evd = DAT_HANDLE_NULL;
-                               ret =
-                                   dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd);
-                               LOGPRINTF("%d cno wait return evd_handle=%p\n",
-                                         getpid(), evd);
-                               if (evd != h_dto_rcv_evd) {
-                                       /* CNO timeout, already on EVD */
-                                       if (evd != NULL)
-                                               return (ret);
-                               }
-                       }
-                       /* use wait to dequeue */
-                       ret =
-                           dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event,
-                                        &nmore);
-                       if (ret != DAT_SUCCESS) {
-                               fprintf(stderr,
-                                       "%d: ERROR: DTO dat_evd_wait() %s\n",
-                                       getpid(), DT_RetToString(ret));
-                               return (ret);
-                       }
-               }
+               /* recv message, send completions suppressed */
+               if (collect_event(h_dto_rcv_evd,
+                                 &event,
+                                 DTO_TIMEOUT,
+                                 &poll_count) != DAT_SUCCESS)
+                       return (DAT_ABORT);
+
+
                /* start timer after first message arrives on server */
                if (i == 0) {
                        start = get_time();
@@ -1722,7 +1623,7 @@ DAT_RETURN do_ping_pong_msg()
                LOGPRINTF("%d inbound message; message arrived!\n", getpid());
                if (event.event_number != DAT_DTO_COMPLETION_EVENT) {
                        fprintf(stderr, "%d Error unexpected DTO event : %s\n",
-                               getpid(), DT_EventToSTr(event.event_number));
+                               getpid(), DT_EventToStr(event.event_number));
                        return (DAT_ABORT);
                }
                if ((event.event_data.dto_completion_event_data.
@@ -1762,7 +1663,7 @@ DAT_RETURN do_ping_pong_msg()

                        if (ret != DAT_SUCCESS) {
                                fprintf(stderr, "%d Error send_msg: %s\n",
-                                       getpid(), DT_RetToString(ret));
+                                       getpid(), DT_RetToStr(ret));
                                return (ret);
                        } else {
                                LOGPRINTF("%d send_msg completed\n", getpid());
@@ -1805,7 +1706,7 @@ DAT_RETURN register_rdma_memory(void)
        if (ret != DAT_SUCCESS) {
                fprintf(stderr,
                        "%d Error registering Receive RDMA buffer: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d Registered Receive RDMA Buffer %p\n",
@@ -1827,7 +1728,7 @@ DAT_RETURN register_rdma_memory(void)
                             &registered_size_send, &registered_addr_send);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error registering send RDMA buffer: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d Registered Send RDMA Buffer %p\n",
@@ -1854,7 +1755,7 @@ DAT_RETURN unregister_rdma_memory(void)
                time.total += time.unreg;
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error deregistering recv mr: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Unregistered Recv Buffer\n", getpid());
@@ -1868,7 +1769,7 @@ DAT_RETURN unregister_rdma_memory(void)
                ret = dat_lmr_free(h_lmr_send);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error deregistering send mr: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Unregistered send Buffer\n", getpid());
@@ -1904,7 +1805,7 @@ DAT_RETURN create_events(void)
                time.total += time.cnoc;
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error dat_cno_create: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d cr_evd created, %p\n", getpid(),
@@ -1922,7 +1823,7 @@ DAT_RETURN create_events(void)
        time.total += time.evdc;
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_evd_create: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d cr_evd created %p\n", getpid(), h_cr_evd);
@@ -1935,7 +1836,7 @@ DAT_RETURN create_events(void)
                             DAT_EVD_CONNECTION_FLAG, &h_conn_evd);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_evd_create: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d con_evd created %p\n", getpid(), h_conn_evd);
@@ -1947,7 +1848,7 @@ DAT_RETURN create_events(void)
                             h_dto_cno, DAT_EVD_DTO_FLAG, &h_dto_req_evd);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_evd_create REQ: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d dto_req_evd created %p\n", getpid(),
@@ -1960,7 +1861,7 @@ DAT_RETURN create_events(void)
                             h_dto_cno, DAT_EVD_DTO_FLAG, &h_dto_rcv_evd);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_evd_create RCV: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else {
                LOGPRINTF("%d dto_rcv_evd created %p\n", getpid(),
@@ -1971,7 +1872,7 @@ DAT_RETURN create_events(void)
        ret = dat_evd_query(h_dto_req_evd, DAT_EVD_FIELD_EVD_QLEN, &param);
        if (ret != DAT_SUCCESS) {
                fprintf(stderr, "%d Error dat_evd_query request evd: %s\n",
-                       getpid(), DT_RetToString(ret));
+                       getpid(), DT_RetToStr(ret));
                return (ret);
        } else if (param.evd_qlen < (MSG_BUF_COUNT + MAX_RDMA_RD + burst) * 2) {
                fprintf(stderr, "%d Error dat_evd qsize too small: %d < %d\n",
@@ -2001,7 +1902,7 @@ DAT_RETURN destroy_events(void)
                ret = dat_evd_free(h_cr_evd);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing cr EVD: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Freed cr EVD\n", getpid());
@@ -2015,7 +1916,7 @@ DAT_RETURN destroy_events(void)
                ret = dat_evd_free(h_conn_evd);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing conn EVD: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Freed conn EVD\n", getpid());
@@ -2033,7 +1934,7 @@ DAT_RETURN destroy_events(void)
                time.total += time.evdf;
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing dto EVD: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Freed dto EVD\n", getpid());
@@ -2047,7 +1948,7 @@ DAT_RETURN destroy_events(void)
                ret = dat_evd_free(h_dto_req_evd);
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing dto EVD: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Freed dto EVD\n", getpid());
@@ -2065,7 +1966,7 @@ DAT_RETURN destroy_events(void)
                time.total += time.cnof;
                if (ret != DAT_SUCCESS) {
                        fprintf(stderr, "%d Error freeing dto CNO: %s\n",
-                               getpid(), DT_RetToString(ret));
+                               getpid(), DT_RetToStr(ret));
                        return (ret);
                } else {
                        LOGPRINTF("%d Freed dto CNO\n", getpid());
@@ -2080,7 +1981,7 @@ DAT_RETURN destroy_events(void)
  * but don't assume the values are zero-based or contiguous.
  */
 char errmsg[512] = { 0 };
-const char *DT_RetToString(DAT_RETURN ret_value)
+const char *DT_RetToStr(DAT_RETURN ret_value)
 {
        const char *major_msg, *minor_msg;

@@ -2096,7 +1997,7 @@ const char *DT_RetToString(DAT_RETURN ret_value)
 /*
  * Map DAT_EVENT_CODE values to readable strings
  */
-const char *DT_EventToSTr(DAT_EVENT_NUMBER event_code)
+const char *DT_EventToStr(DAT_EVENT_NUMBER event_code)
 {
        unsigned int i;
        static struct {


From arlin.r.davis at intel.com  Tue Aug  4 22:40:18 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 4 Aug 2009 22:40:18 -0700
Subject: [ofa-general] [PATCH] uDAPL v2: scm: transistion QP to error state
 when disconnecting instead of reset/init.
Message-ID: <E3280858FA94444CA49D2BA02341C983567FA15C@orsmsx506.amr.corp.intel.com>


SCM: Fix disconnect. QP's need to move to ERROR state in 
order to flush work requests and notify consumer. Moving to 
RESET removed all requests but did not notify consumer.

diff --git a/dapl/openib_scm/cm.c b/dapl/openib_scm/cm.c
index 164cc4e..416ee71 100644
--- a/dapl/openib_scm/cm.c
+++ b/dapl/openib_scm/cm.c
@@ -773,7 +773,7 @@ ud_bail:
 
 bail:
 	/* close socket, and post error event */
-	dapls_ib_reinit_ep(ep_ptr);	/* reset QP state */
+	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
 	closesocket(cm_ptr->socket);
 	cm_ptr->socket = DAPL_INVALID_SOCKET;
 	dapl_evd_connection_callback(NULL, event, cm_ptr->p_data, ep_ptr);
@@ -1107,7 +1107,7 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 	return DAT_SUCCESS;
       bail:
 	dapls_ib_cm_free(cm_ptr, cm_ptr->ep);
-	dapls_ib_reinit_ep(ep_ptr);	/* reset QP state */
+	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
 	return DAT_INTERNAL_ERROR;
 }
 
@@ -1169,7 +1169,7 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr)
 	return;
       
 bail:
-	dapls_ib_reinit_ep(cm_ptr->ep);	/* reset QP state */
+	dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0);
 	dapls_ib_cm_free(cm_ptr, cm_ptr->ep);
 	dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, NULL, cm_ptr->sp);
 }
@@ -1236,9 +1236,9 @@ dapls_ib_disconnect(IN DAPL_EP * ep_ptr, IN DAT_CLOSE_FLAGS close_flags)
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
 		     "dapls_ib_disconnect(ep_handle %p ....)\n", ep_ptr);
 
-	/* reinit to modify QP state */
-	dapls_ib_reinit_ep(ep_ptr);
-
+	/* Transition to error state to flush queue */
+        dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
+	
 	if (ep_ptr->cm_handle == NULL ||
 	    ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED)
 		return DAT_SUCCESS;


From yosefe at voltaire.com  Tue Aug  4 23:06:18 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Wed, 05 Aug 2009 09:06:18 +0300
Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid	changes
In-Reply-To: <20090804045647.GK24282@obsidianresearch.com>
References: <f0e08f230907290715q49fe595j7e1f2be78f050878@mail.gmail.com>	<4A705B3A.7060404@Voltaire.COM>	<f0e08f230907290935k28a90ffkc4f39436f1e1460b@mail.gmail.com>	<4A731818.3060500@voltaire.com>	<f0e08f230907311050wa750cf2n497039acafdab3b4@mail.gmail.com>	<4A733D24.3040201@voltaire.com>	<f0e08f230907311205s239eb1afk36c6a8f3cefd90e7@mail.gmail.com>	<4A742E94.2070002@gmail.com>	<f0e08f230908020426i2331cf0fg3bc3a21f1e86d1b5@mail.gmail.com>	<4A771852.1010606@voltaire.com>
	<20090804045647.GK24282@obsidianresearch.com>
Message-ID: <4A79215A.3030807@voltaire.com>

On 04/08/09 07:56, Jason Gunthorpe wrote:
> On Mon, Aug 03, 2009 at 08:03:14PM +0300, Yossi Etigin wrote:
> 
>> The ARP stuff works this way: Remote LID changes. In some point, either the remote
>> node will send an ARP reply (gratuitous), or (more likely) the local network stack
>> will start sending solicited ARPs, unicast, using the invalid path. They will fail,
>> so the stack will send broadcast ARP.
> 
> Erm.. Maybe a little tighter integration with the ARP/ND layer is in
> order. If it knows unicast isn't working thats a pretty damn good clue
> to discard the PR.
> 
> Jason

I agree with that. If the network stack told ipoib when the neighbour became
unreachable, life would have been a lot easier. Unfortunately, the closest thing
now is neigh_cleanup, and this is only called when neighbour entry removed from
the table (which may be quite some time after it becomes unreachable).

--Yossi


From sashak at voltaire.com  Wed Aug  5 00:17:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 10:17:04 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Directly call
 calloc/free rather than create/delete_cdg
In-Reply-To: <20090804213905.GA23497@comcast.net>
References: <20090804213905.GA23497@comcast.net>
Message-ID: <20090805071704.GK7993@me>

On 17:39 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> Reduce call stack by one call level
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Aug  5 00:17:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 10:17:33 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_lin_fwd_rcv.c:
	Commentary change
In-Reply-To: <20090804214413.GA24878@comcast.net>
References: <20090804214413.GA24878@comcast.net>
Message-ID: <20090805071733.GL7993@me>

On 17:44 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Aug  5 00:18:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 10:18:57 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsendtrap.c: Add support
 for link_speed_enabled_change trap
In-Reply-To: <20090804125009.GB12236@comcast.net>
References: <20090804125009.GB12236@comcast.net>
Message-ID: <20090805071857.GM7993@me>

On 08:50 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Wed Aug  5 00:25:00 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 10:25:00 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets 	across switches
In-Reply-To: <f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
Message-ID: <4A7933CC.6080503@dev.mellanox.co.il>

Hal Rosenstock wrote:
> 
>      > Routing calculation phase of the ucast manager took ~1200 usec,
>      > the rest was sending the blocks and waiting for no more pending
>      > transactions.
>      >
>      > No noticeable difference between various max_smps_per_node values
>      > was observed.
> 
>     What is the reason?
> 
>  
> I think the reason was max_wire_smps may have kicked in but Yevgeny is 
> best to elaborate on this.
>  

Correct, this was because of max_wire_smps

-- Yevgeny 


From sashak at voltaire.com  Wed Aug  5 00:33:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 10:33:30 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <4A7933CC.6080503@dev.mellanox.co.il>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<4A7933CC.6080503@dev.mellanox.co.il>
Message-ID: <20090805073330.GN7993@me>

On 10:25 Wed 05 Aug     , Yevgeny Kliteynik wrote:
> >      >
> >      > No noticeable difference between various max_smps_per_node values
> >      > was observed.
> > 
> >     What is the reason?
> > 
> >  
> > I think the reason was max_wire_smps may have kicked in but Yevgeny is 
> > best to elaborate on this.
> >  
> 
> Correct, this was because of max_wire_smps

What was 'max_wire_smps' value used in the tests?

Sasha


From kliteyn at dev.mellanox.co.il  Wed Aug  5 00:37:00 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 10:37:00 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <20090805073330.GN7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<4A7933CC.6080503@dev.mellanox.co.il> <20090805073330.GN7993@me>
Message-ID: <4A79369C.4090108@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 10:25 Wed 05 Aug     , Yevgeny Kliteynik wrote:
>>>      >
>>>      > No noticeable difference between various max_smps_per_node values
>>>      > was observed.
>>>
>>>     What is the reason?
>>>
>>>  
>>> I think the reason was max_wire_smps may have kicked in but Yevgeny is 
>>> best to elaborate on this.
>>>  
>> Correct, this was because of max_wire_smps
> 
> What was 'max_wire_smps' value used in the tests?

The numbers that I posted refer to default max_wire_smps, which is 4.
I didn't try to bump it up, though I guess that it might improve LFT
config time.

-- Yevgeny
 
> Sasha
> 


From eli at dev.mellanox.co.il  Wed Aug  5 01:27:51 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:27:51 +0300
Subject: [ofa-general] [PATCHv4 0/10]  RDMAoE support
Message-ID: <20090805082751.GA5599@mtls03>

RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using
Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet
fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned
Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet
management and SA services are not required for RDMAoE operation; Ethernet
management practices are used instead. RDMAoE encodes IP addresses into its
GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs,
standard IP to MAC mappings apply.

To support RDMAoE, a new transport protocol was added to the IB core. An RDMA
device can have ports with different transports, which are identified by a port
transport attribute.  The RDMA Verbs API is syntactically unmodified. When
referring to RDMAoE ports, Address handles are required to contain GIDs while
LID fields are ignored. The Ethernet L2 information is subsequently obtained by
the vendor-specific driver (both in kernel- and user-space) while modifying QPs
to RTR and creating address handles.  As there is no SA in RDMAoE, the CMA code
is modified to fill the necessary path record attributes locally before sending
CM packets. Similarly, the CMA provides to the user the required address handle
attributes when processing SIDR requests and joining multicast groups.

In this patch set, an RDMAoE port is currently assigned a single GID, encoding
the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code
temporarily uses IPv6 link-local addresses as GIDs instead of the IP address
provided by the user, thereby supporting any IP address. In addition, multicast
packets currently use the broadcast MAC.

To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib
drivers must be loaded, and the netdevice for the corresponding RDMAoE port
must be running. Individual ports of a multi port HCA can be independently
configured as Ethernet (with support for RDMAoE) or IB, as is already the case.
We have successfully tested MPI, SDP, RDS, and native Verbs applications over
RDMAoE.

Following is a series of 10 patches based on version 2.6.30 of the Linux
kernel. This new series reflects changes based on feedback from the community
on the previous set of patches, and is tagged v4.

Changes from v3:
1. RDMA transport is determined on a per-port basis instead of the link-type
notion.
2. SA services are not provided for RDMAoE clients. CMA code is modified to
support RDMAoE transport types.
3. For brevity, GID to MAC resolution is currently restricted to link local
addresses.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

 b/drivers/infiniband/core/agent.c           |   12 -
 b/drivers/infiniband/core/cm.c              |   26 +-
 b/drivers/infiniband/core/cma.c             |   54 ++--
 b/drivers/infiniband/core/mad.c             |   42 ++-
 b/drivers/infiniband/core/multicast.c       |    5 
 b/drivers/infiniband/core/sa_query.c        |   40 ++-
 b/drivers/infiniband/core/ucm.c             |    8 
 b/drivers/infiniband/core/ucma.c            |    2 
 b/drivers/infiniband/core/ud_header.c       |  111 ++++++++++
 b/drivers/infiniband/core/user_mad.c        |    7 
 b/drivers/infiniband/core/uverbs.h          |    1 
 b/drivers/infiniband/core/uverbs_cmd.c      |   32 ++
 b/drivers/infiniband/core/uverbs_main.c     |    1 
 b/drivers/infiniband/core/verbs.c           |   12 -
 b/drivers/infiniband/hw/mlx4/ah.c           |  187 +++++++++++++---
 b/drivers/infiniband/hw/mlx4/main.c         |  309 +++++++++++++++++++++++++---
 b/drivers/infiniband/hw/mlx4/mlx4_ib.h      |   19 +
 b/drivers/infiniband/hw/mlx4/qp.c           |  172 ++++++++++-----
 b/drivers/infiniband/ulp/ipoib/ipoib_main.c |   12 -
 b/drivers/net/mlx4/en_main.c                |   15 +
 b/drivers/net/mlx4/en_port.c                |    4 
 b/drivers/net/mlx4/en_port.h                |    3 
 b/drivers/net/mlx4/fw.c                     |    3 
 b/drivers/net/mlx4/intf.c                   |   20 +
 b/drivers/net/mlx4/main.c                   |    6 
 b/drivers/net/mlx4/mlx4.h                   |    1 
 b/include/linux/mlx4/cmd.h                  |    1 
 b/include/linux/mlx4/device.h               |   31 ++
 b/include/linux/mlx4/driver.h               |   16 +
 b/include/linux/mlx4/qp.h                   |    8 
 b/include/rdma/ib_addr.h                    |   87 +++++++
 b/include/rdma/ib_pack.h                    |   26 ++
 b/include/rdma/ib_user_verbs.h              |   21 +
 b/include/rdma/ib_verbs.h                   |    9 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |    3 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c  |    2 
 drivers/infiniband/core/cm.c                |    2 
 drivers/infiniband/core/cma.c               |  150 +++++++++++++
 drivers/infiniband/core/mad.c               |   55 +++-
 drivers/infiniband/core/ucm.c               |   12 -
 drivers/infiniband/core/ucma.c              |   25 +-
 drivers/infiniband/core/user_mad.c          |   27 +-
 drivers/infiniband/core/verbs.c             |   10 
 include/rdma/ib_verbs.h                     |   15 +
 44 files changed, 1333 insertions(+), 271 deletions(-)


From eli at mellanox.co.il  Wed Aug  5 01:28:08 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:28:08 +0300
Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality
	from node type to port type
Message-ID: <20090805082808.GB5599@mtls03>

As a preparation to devices that, in general, support different transport
protocol for each port, specifically RDMAoE, this patch defines transport type
for each of a device's ports. As a result rdma_node_get_transport() has been
unexported and is used internally by the implementation of the new API,
rdma_port_get_transport() which gives the transport protocol of the queried
port. All references to rdma_node_get_transport() are changed to to use
rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum
rdma_transport_type.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/cm.c              |   26 ++++++++-----
 drivers/infiniband/core/cma.c             |   54 +++++++++++++++--------------
 drivers/infiniband/core/mad.c             |   42 +++++++++++++---------
 drivers/infiniband/core/multicast.c       |    5 +--
 drivers/infiniband/core/sa_query.c        |   40 ++++++++++++---------
 drivers/infiniband/core/ucm.c             |    8 +++-
 drivers/infiniband/core/ucma.c            |    2 +-
 drivers/infiniband/core/user_mad.c        |    7 ++--
 drivers/infiniband/core/verbs.c           |   12 +++++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   12 +++---
 include/rdma/ib_verbs.h                   |    9 +++--
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |    3 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c  |    2 +-
 13 files changed, 128 insertions(+), 94 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 5130fc5..f930f1d 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device)
 	unsigned long flags;
 	int ret;
 	u8 i;
-
-	if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB)
-		return;
+	enum rdma_transport_type tt;
 
 	cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) *
 			 ib_device->phys_port_cnt, GFP_KERNEL);
@@ -3700,6 +3698,10 @@ static void cm_add_one(struct ib_device *ib_device)
 
 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
+		tt = rdma_port_get_transport(ib_device, i);
+		if (tt != RDMA_TRANSPORT_IB)
+			continue;
+
 		port = kzalloc(sizeof *port, GFP_KERNEL);
 		if (!port)
 			goto error1;
@@ -3742,9 +3744,11 @@ error1:
 	port_modify.clr_port_cap_mask = IB_PORT_CM_SUP;
 	while (--i) {
 		port = cm_dev->port[i-1];
-		ib_modify_port(ib_device, port->port_num, 0, &port_modify);
-		ib_unregister_mad_agent(port->mad_agent);
-		cm_remove_port_fs(port);
+		if (port) {
+			ib_modify_port(ib_device, port->port_num, 0, &port_modify);
+			ib_unregister_mad_agent(port->mad_agent);
+			cm_remove_port_fs(port);
+		}
 	}
 	device_unregister(cm_dev->device);
 	kfree(cm_dev);
@@ -3770,10 +3774,12 @@ static void cm_remove_one(struct ib_device *ib_device)
 
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
 		port = cm_dev->port[i-1];
-		ib_modify_port(ib_device, port->port_num, 0, &port_modify);
-		ib_unregister_mad_agent(port->mad_agent);
-		flush_workqueue(cm.wq);
-		cm_remove_port_fs(port);
+		if (port) {
+			ib_modify_port(ib_device, port->port_num, 0, &port_modify);
+			ib_unregister_mad_agent(port->mad_agent);
+			flush_workqueue(cm.wq);
+			cm_remove_port_fs(port);
+		}
 	}
 	device_unregister(cm_dev->device);
 	kfree(cm_dev);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index aa62101..866ff7f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -337,24 +337,26 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 	struct cma_device *cma_dev;
 	union ib_gid gid;
 	int ret = -ENODEV;
-
-	switch (rdma_node_get_transport(dev_addr->dev_type)) {
-	case RDMA_TRANSPORT_IB:
-		ib_addr_get_sgid(dev_addr, &gid);
-		break;
-	case RDMA_TRANSPORT_IWARP:
-		iw_addr_get_sgid(dev_addr, &gid);
-		break;
-	default:
-		return -ENODEV;
-	}
+	int port;
 
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, &gid,
-					 &id_priv->id.port_num, NULL);
-		if (!ret) {
-			cma_attach_to_dev(id_priv, cma_dev);
-			break;
+		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) {
+			switch (rdma_port_get_transport(cma_dev->device, port)) {
+			case RDMA_TRANSPORT_IB:
+				ib_addr_get_sgid(dev_addr, &gid);
+				break;
+			case RDMA_TRANSPORT_IWARP:
+				iw_addr_get_sgid(dev_addr, &gid);
+				break;
+			default:
+				return -ENODEV;
+			}
+			ret = ib_find_cached_gid(cma_dev->device, &gid,
+						 &id_priv->id.port_num, NULL);
+			if (!ret) {
+				cma_attach_to_dev(id_priv, cma_dev);
+				return ret;
+			}
 		}
 	}
 	return ret;
@@ -605,7 +607,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 	int ret = 0;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps))
 			ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
@@ -755,7 +757,7 @@ static inline int cma_user_data_offset(enum rdma_port_space ps)
 
 static void cma_cancel_route(struct rdma_id_private *id_priv)
 {
-	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (id_priv->query)
 			ib_sa_cancel_query(id_priv->query_id, id_priv->query);
@@ -851,7 +853,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
 	mutex_lock(&lock);
 	if (id_priv->cma_dev) {
 		mutex_unlock(&lock);
-		switch (rdma_node_get_transport(id->device->node_type)) {
+		switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 		case RDMA_TRANSPORT_IB:
 			if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
@@ -1508,7 +1510,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
 
 	id_priv->backlog = backlog;
 	if (id->device) {
-		switch (rdma_node_get_transport(id->device->node_type)) {
+		switch (rdma_port_get_transport(id->device, id->port_num)) {
 		case RDMA_TRANSPORT_IB:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
@@ -1735,7 +1737,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 		return -EINVAL;
 
 	atomic_inc(&id_priv->refcount);
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_resolve_ib_route(id_priv, timeout_ms);
 		break;
@@ -2415,7 +2417,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_resolve_ib_udp(id_priv, conn_param);
@@ -2528,7 +2530,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
@@ -2589,7 +2591,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
 	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
@@ -2620,7 +2622,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
 	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_modify_qp_err(id_priv);
 		if (ret)
@@ -2776,7 +2778,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	spin_unlock(&id_priv->lock);
 
 	kref_get(&mc->mcref);
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
 		break;
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..7b737c4 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2905,9 +2905,7 @@ static int ib_mad_port_close(struct ib_device *device, int port_num)
 static void ib_mad_init_device(struct ib_device *device)
 {
 	int start, end, i;
-
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
+	enum rdma_transport_type tt;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		start = 0;
@@ -2918,6 +2916,10 @@ static void ib_mad_init_device(struct ib_device *device)
 	}
 
 	for (i = start; i <= end; i++) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt != RDMA_TRANSPORT_IB)
+			continue;
+
 		if (ib_mad_port_open(device, i)) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, i);
@@ -2941,13 +2943,15 @@ error:
 	i--;
 
 	while (i >= start) {
-		if (ib_agent_port_close(device, i))
-			printk(KERN_ERR PFX "Couldn't close %s port %d "
-			       "for agents\n",
-			       device->name, i);
-		if (ib_mad_port_close(device, i))
-			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
-			       device->name, i);
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) {
+			if (ib_agent_port_close(device, i))
+				printk(KERN_ERR PFX "Couldn't close %s port %d "
+				       "for agents\n",
+				       device->name, i);
+			if (ib_mad_port_close(device, i))
+				printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+				       device->name, i);
+		}
 		i--;
 	}
 }
@@ -2955,6 +2959,7 @@ error:
 static void ib_mad_remove_device(struct ib_device *device)
 {
 	int i, num_ports, cur_port;
+	enum rdma_transport_type tt;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		num_ports = 1;
@@ -2964,13 +2969,16 @@ static void ib_mad_remove_device(struct ib_device *device)
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		if (ib_agent_port_close(device, cur_port))
-			printk(KERN_ERR PFX "Couldn't close %s port %d "
-			       "for agents\n",
-			       device->name, cur_port);
-		if (ib_mad_port_close(device, cur_port))
-			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
-			       device->name, cur_port);
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB) {
+			if (ib_agent_port_close(device, cur_port))
+				printk(KERN_ERR PFX "Couldn't close %s port %d "
+				       "for agents\n",
+				       device->name, cur_port);
+			if (ib_mad_port_close(device, cur_port))
+				printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+				       device->name, cur_port);
+		}
 	}
 }
 
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index 107f170..3a4c6f8 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -788,10 +788,7 @@ static void mcast_add_one(struct ib_device *device)
 	struct mcast_port *port;
 	int i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
-	dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
+	dev = kzalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
 		      GFP_KERNEL);
 	if (!dev)
 		return;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 1865049..834ea14 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -416,14 +416,16 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 		struct ib_sa_port *port =
 			&sa_dev->port[event->element.port_num - sa_dev->start_port];
 
-		spin_lock_irqsave(&port->ah_lock, flags);
-		if (port->sm_ah)
-			kref_put(&port->sm_ah->ref, free_sm_ah);
-		port->sm_ah = NULL;
-		spin_unlock_irqrestore(&port->ah_lock, flags);
-
-		schedule_work(&sa_dev->port[event->element.port_num -
-					    sa_dev->start_port].update_task);
+		if (rdma_port_get_transport(handler->device, port->port_num) == RDMA_TRANSPORT_IB) {
+			spin_lock_irqsave(&port->ah_lock, flags);
+			if (port->sm_ah)
+				kref_put(&port->sm_ah->ref, free_sm_ah);
+			port->sm_ah = NULL;
+			spin_unlock_irqrestore(&port->ah_lock, flags);
+
+			schedule_work(&sa_dev->port[event->element.port_num -
+						    sa_dev->start_port].update_task);
+		}
 	}
 }
 
@@ -991,9 +993,6 @@ static void ib_sa_add_one(struct ib_device *device)
 	struct ib_sa_device *sa_dev;
 	int s, e, i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
@@ -1001,7 +1000,7 @@ static void ib_sa_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
-	sa_dev = kmalloc(sizeof *sa_dev +
+	sa_dev = kzalloc(sizeof *sa_dev +
 			 (e - s + 1) * sizeof (struct ib_sa_port),
 			 GFP_KERNEL);
 	if (!sa_dev)
@@ -1011,6 +1010,9 @@ static void ib_sa_add_one(struct ib_device *device)
 	sa_dev->end_port   = e;
 
 	for (i = 0; i <= e - s; ++i) {
+		if (rdma_port_get_transport(device, i + 1) != RDMA_TRANSPORT_IB)
+			continue;
+
 		sa_dev->port[i].sm_ah    = NULL;
 		sa_dev->port[i].port_num = i + s;
 		spin_lock_init(&sa_dev->port[i].ah_lock);
@@ -1039,13 +1041,15 @@ static void ib_sa_add_one(struct ib_device *device)
 		goto err;
 
 	for (i = 0; i <= e - s; ++i)
-		update_sm_ah(&sa_dev->port[i].update_task);
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB)
+			update_sm_ah(&sa_dev->port[i].update_task);
 
 	return;
 
 err:
 	while (--i >= 0)
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
+		if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB)
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
 
 	kfree(sa_dev);
 
@@ -1065,9 +1069,11 @@ static void ib_sa_remove_one(struct ib_device *device)
 	flush_scheduled_work();
 
 	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
-		if (sa_dev->port[i].sm_ah)
-			kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
+		if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+			if (sa_dev->port[i].sm_ah)
+				kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
+		}
 	}
 
 	kfree(sa_dev);
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 51bd966..4f5096d 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1239,11 +1239,15 @@ static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
 static void ib_ucm_add_one(struct ib_device *device)
 {
 	struct ib_ucm_device *ucm_dev;
+	int i;
 
-	if (!device->alloc_ucontext ||
-	    rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH)
 		return;
 
+	for (i = 1; i <= device->phys_port_cnt; ++i)
+		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
+			return;
+
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
 	if (!ucm_dev)
 		return;
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 4346a24..24d9510 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -614,7 +614,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 
 	resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
 	resp.port_num = ctx->cm_id->port_num;
-	switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
+	switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
 		break;
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 8c46f22..3e58fc0 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1113,9 +1113,6 @@ static void ib_umad_add_one(struct ib_device *device)
 	struct ib_umad_device *umad_dev;
 	int s, e, i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
@@ -1123,6 +1120,10 @@ static void ib_umad_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
+	for (i = s; i <= e; ++i)
+		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
+			return;
+
 	umad_dev = kzalloc(sizeof *umad_dev +
 			   (e - s + 1) * sizeof (struct ib_umad_port),
 			   GFP_KERNEL);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a7da9be..3b2f00b 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -77,7 +77,7 @@ enum ib_rate mult_to_ib_rate(int mult)
 }
 EXPORT_SYMBOL(mult_to_ib_rate);
 
-enum rdma_transport_type
+static enum rdma_transport_type
 rdma_node_get_transport(enum rdma_node_type node_type)
 {
 	switch (node_type) {
@@ -92,7 +92,15 @@ rdma_node_get_transport(enum rdma_node_type node_type)
 		return 0;
 	}
 }
-EXPORT_SYMBOL(rdma_node_get_transport);
+
+enum rdma_transport_type rdma_port_get_transport(struct ib_device *device,
+						 u8 port_num)
+{
+	return device->get_port_transport ?
+		device->get_port_transport(device, port_num) :
+		rdma_node_get_transport(device->node_type);
+}
+EXPORT_SYMBOL(rdma_port_get_transport);
 
 /* Protection domains */
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index ab2c192..39df0f7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1337,9 +1337,6 @@ static void ipoib_add_one(struct ib_device *device)
 	struct ipoib_dev_priv *priv;
 	int s, e, p;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
 	if (!dev_list)
 		return;
@@ -1355,6 +1352,9 @@ static void ipoib_add_one(struct ib_device *device)
 	}
 
 	for (p = s; p <= e; ++p) {
+		if (rdma_port_get_transport(device, p) != RDMA_TRANSPORT_IB)
+			continue;
+
 		dev = ipoib_add_port("ib%d", device, p);
 		if (!IS_ERR(dev)) {
 			priv = netdev_priv(dev);
@@ -1370,12 +1370,12 @@ static void ipoib_remove_one(struct ib_device *device)
 	struct ipoib_dev_priv *priv, *tmp;
 	struct list_head *dev_list;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	dev_list = ib_get_client_data(device, &ipoib_client);
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
+		if (rdma_port_get_transport(device, priv->port) != RDMA_TRANSPORT_IB)
+			continue;
+
 		ib_unregister_event_handler(&priv->event_handler);
 
 		rtnl_lock();
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c179318..b557129 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -72,9 +72,6 @@ enum rdma_transport_type {
 	RDMA_TRANSPORT_IWARP
 };
 
-enum rdma_transport_type
-rdma_node_get_transport(enum rdma_node_type node_type) __attribute_const__;
-
 enum ib_device_cap_flags {
 	IB_DEVICE_RESIZE_MAX_WR		= 1,
 	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1),
@@ -298,6 +295,7 @@ struct ib_port_attr {
 	u8			active_width;
 	u8			active_speed;
 	u8                      phys_state;
+	enum rdma_transport_type	transport;
 };
 
 enum ib_device_modify_flags {
@@ -1003,6 +1001,8 @@ struct ib_device {
 	int		           (*query_port)(struct ib_device *device,
 						 u8 port_num,
 						 struct ib_port_attr *port_attr);
+	enum rdma_transport_type   (*get_port_transport)(struct ib_device *device,
+							 u8 port_num);
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
@@ -1213,6 +1213,9 @@ int ib_query_device(struct ib_device *device,
 int ib_query_port(struct ib_device *device,
 		  u8 port_num, struct ib_port_attr *port_attr);
 
+enum rdma_transport_type rdma_port_get_transport(struct ib_device *device,
+						 u8 port_num);
+
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 42a6f9f..769dc18 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -338,8 +338,7 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
 static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
 {
 	if ((RDMA_TRANSPORT_IWARP ==
-	     rdma_node_get_transport(xprt->sc_cm_id->
-				     device->node_type))
+	     rdma_port_get_transport(xprt->sc_cm_id->device, xprt->sc_cm_id->port_num))
 	    && sge_count > 1)
 		return 1;
 	else
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5151f9f..a5a4162 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -976,7 +976,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	/*
 	 * Determine if a DMA MR is required and if so, what privs are required
 	 */
-	switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
+	switch (rdma_port_get_transport(newxprt->sc_cm_id->device, newxprt->sc_cm_id->port_num)) {
 	case RDMA_TRANSPORT_IWARP:
 		newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
 		if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:28:23 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:28:23 +0300
Subject: [ofa-general] [PATCHv4 02/10] ib_core: Add RDMAoE transport protocol
Message-ID: <20090805082823.GC5599@mtls03>

Add a new transport protocol, RDMAoE, used for transporting Infiniband traffic
over Ethernet fabrics.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 include/rdma/ib_verbs.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index b557129..4eec70f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -69,7 +69,8 @@ enum rdma_node_type {
 
 enum rdma_transport_type {
 	RDMA_TRANSPORT_IB,
-	RDMA_TRANSPORT_IWARP
+	RDMA_TRANSPORT_IWARP,
+	RDMA_TRANSPORT_RDMAOE
 };
 
 enum ib_device_cap_flags {
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:28:54 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:28:54 +0300
Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1
Message-ID: <20090805082854.GD5599@mtls03>

Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. QP1
is still needed since it handles communications between CM agents. This patch
will create only QP1 for RDMAoE ports.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/agent.c |   12 +++++---
 drivers/infiniband/core/mad.c   |   55 +++++++++++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index ae7c288..c3f2048 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -48,6 +48,8 @@
 struct ib_agent_port_private {
 	struct list_head port_list;
 	struct ib_mad_agent *agent[2];
+	struct ib_device    *device;
+	u8		     port_num;
 };
 
 static DEFINE_SPINLOCK(ib_agent_port_list_lock);
@@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int port_num)
 {
 	struct ib_agent_port_private *entry;
 
-	list_for_each_entry(entry, &ib_agent_port_list, port_list) {
-		if (entry->agent[0]->device == device &&
-		    entry->agent[0]->port_num == port_num)
+	list_for_each_entry(entry, &ib_agent_port_list, port_list)
+		if (entry->device == device && entry->port_num == port_num)
 			return entry;
-	}
+
 	return NULL;
 }
 
@@ -175,6 +176,9 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
 		goto error3;
 	}
 
+	port_priv->device = device;
+	port_priv->port_num = port_num;
+
 	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
 	list_add_tail(&port_priv->port_list, &ib_agent_port_list);
 	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 7b737c4..de83c71 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	unsigned long flags;
 	u8 mgmt_class, vclass;
 
+	/* Validate device and port */
+	port_priv = ib_get_mad_port(device, port_num);
+	if (!port_priv) {
+		ret = ERR_PTR(-ENODEV);
+		goto error1;
+	}
+
+	if (!port_priv->qp_info[qp_type].qp)
+		return NULL;
+
 	/* Validate parameters */
 	qpn = get_spl_qp_index(qp_type);
 	if (qpn == -1)
@@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 			goto error1;
 	}
 
-	/* Validate device and port */
-	port_priv = ib_get_mad_port(device, port_num);
-	if (!port_priv) {
-		ret = ERR_PTR(-ENODEV);
-		goto error1;
-	}
-
 	/* Allocate structures */
 	mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL);
 	if (!mad_agent_priv) {
@@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
 	struct ib_mad_agent_private *mad_agent_priv;
 	struct ib_mad_snoop_private *mad_snoop_priv;
 
+	if (!mad_agent)
+		return 0;
+
 	/* If the TID is zero, the agent can only snoop. */
 	if (mad_agent->hi_tid) {
 		mad_agent_priv = container_of(mad_agent,
@@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info)
 	struct ib_mad_private *recv;
 	struct ib_mad_list_head *mad_list;
 
+	if (!qp_info->qp)
+		return;
+
 	while (!list_empty(&qp_info->recv_queue.list)) {
 
 		mad_list = list_entry(qp_info->recv_queue.list.next,
@@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		qp = port_priv->qp_info[i].qp;
+		if (!qp)
+			continue;
+
 		/*
 		 * PKey index for QP1 is irrelevant but
 		 * one is needed for the Reset to Init transition
@@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		if (!port_priv->qp_info[i].qp)
+			continue;
+
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
@@ -2762,6 +2777,9 @@ error:
 
 static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
 {
+	if (!qp_info->qp)
+		return;
+
 	ib_destroy_qp(qp_info->qp);
 	kfree(qp_info->snoop_table);
 }
@@ -2777,6 +2795,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
 	char name[sizeof "ib_mad123"];
+	int has_smi;
 
 	/* Create new device info */
 	port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL);
@@ -2793,6 +2812,10 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	has_smi = rdma_port_get_transport(device, port_num) == RDMA_TRANSPORT_IB;
+	if (has_smi)
+		cq_size *= 2;
+
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2816,9 +2839,11 @@ static int ib_mad_port_open(struct ib_device *device,
 		goto error5;
 	}
 
-	ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
-	if (ret)
-		goto error6;
+	if (has_smi) {
+		ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
+		if (ret)
+			goto error6;
+	}
 	ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI);
 	if (ret)
 		goto error7;
@@ -2852,7 +2877,8 @@ error9:
 error8:
 	destroy_mad_qp(&port_priv->qp_info[1]);
 error7:
-	destroy_mad_qp(&port_priv->qp_info[0]);
+	if (has_smi)
+		destroy_mad_qp(&port_priv->qp_info[0]);
 error6:
 	ib_dereg_mr(port_priv->mr);
 error5:
@@ -2917,7 +2943,7 @@ static void ib_mad_init_device(struct ib_device *device)
 
 	for (i = start; i <= end; i++) {
 		tt = rdma_port_get_transport(device, i);
-		if (tt != RDMA_TRANSPORT_IB)
+		if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE)
 			continue;
 
 		if (ib_mad_port_open(device, i)) {
@@ -2943,7 +2969,8 @@ error:
 	i--;
 
 	while (i >= start) {
-		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) {
 			if (ib_agent_port_close(device, i))
 				printk(KERN_ERR PFX "Couldn't close %s port %d "
 				       "for agents\n",
@@ -2970,7 +2997,7 @@ static void ib_mad_remove_device(struct ib_device *device)
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
 		tt = rdma_port_get_transport(device, i);
-		if (tt == RDMA_TRANSPORT_IB) {
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) {
 			if (ib_agent_port_close(device, cur_port))
 				printk(KERN_ERR PFX "Couldn't close %s port %d "
 				       "for agents\n",
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:29:10 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:29:10 +0300
Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE
	ports
Message-ID: <20090805082910.GE5599@mtls03>

Initialize umad context for devices that have any of their ports either IB or
RDMAoE so as to allow user space apps to send and receive MADs on QP1.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/user_mad.c |   27 ++++++++++++++++++++-------
 1 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 3e58fc0..2189e65 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1112,6 +1112,7 @@ static void ib_umad_add_one(struct ib_device *device)
 {
 	struct ib_umad_device *umad_dev;
 	int s, e, i;
+	enum rdma_transport_type tt;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
@@ -1120,9 +1121,14 @@ static void ib_umad_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
-	for (i = s; i <= e; ++i)
-		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
-			return;
+	for (i = s; i <= e; ++i) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
+			break;
+	}
+
+	if (i > e)
+		return;
 
 	umad_dev = kzalloc(sizeof *umad_dev +
 			   (e - s + 1) * sizeof (struct ib_umad_port),
@@ -1147,8 +1153,11 @@ static void ib_umad_add_one(struct ib_device *device)
 	return;
 
 err:
-	while (--i >= s)
-		ib_umad_kill_port(&umad_dev->port[i - s]);
+	while (--i >= s) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
+			ib_umad_kill_port(&umad_dev->port[i - s]);
+	}
 
 	kref_put(&umad_dev->ref, ib_umad_release_dev);
 }
@@ -1157,12 +1166,16 @@ static void ib_umad_remove_one(struct ib_device *device)
 {
 	struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client);
 	int i;
+	enum rdma_transport_type tt;
 
 	if (!umad_dev)
 		return;
 
-	for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i)
-		ib_umad_kill_port(&umad_dev->port[i]);
+	for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
+			ib_umad_kill_port(&umad_dev->port[i]);
+	}
 
 	kref_put(&umad_dev->ref, ib_umad_release_dev);
 }
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:29:19 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:29:19 +0300
Subject: [ofa-general] [PATCHv4 05/10] ib/cm: Enable CM support for RDMAoE
Message-ID: <20090805082919.GF5599@mtls03>

CM messages can be transported on RDMAoE protocol ports so they are enabled
here.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/cm.c  |    2 +-
 drivers/infiniband/core/ucm.c |   12 +++++++++---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index f930f1d..63d6de3 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3699,7 +3699,7 @@ static void cm_add_one(struct ib_device *ib_device)
 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
 		tt = rdma_port_get_transport(ib_device, i);
-		if (tt != RDMA_TRANSPORT_IB)
+		if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE)
 			continue;
 
 		port = kzalloc(sizeof *port, GFP_KERNEL);
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 4f5096d..21c78f5 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device)
 {
 	struct ib_ucm_device *ucm_dev;
 	int i;
+	enum rdma_transport_type tt;
 
 	if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH)
 		return;
 
-	for (i = 1; i <= device->phys_port_cnt; ++i)
-		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
-			return;
+	for (i = 1; i <= device->phys_port_cnt; ++i) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
+			break;
+	}
+
+	if (i > device->phys_port_cnt)
+		return;
 
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
 	if (!ucm_dev)
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:29:29 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:29:29 +0300
Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding
Message-ID: <20090805082929.GG5599@mtls03>

Add support for RDMAoE device binding and IP --> GID resolution. Path resolving
and multicast joining are implemented within cma.c by filling the responses and
pushing the callbacks to the cma work queue. IP->GID resolution always yield
IPv6 link local addresses - remote GIDs are derived from the destination MAC
address of the remote port. Multicast GIDs are always mapped to broadcast MAC
(all FFs). Some helper functions are added to ib_addr.h.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/cma.c  |  150 ++++++++++++++++++++++++++++++++++++++-
 drivers/infiniband/core/ucma.c |   25 +++++--
 include/rdma/ib_addr.h         |   87 +++++++++++++++++++++++
 3 files changed, 251 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 866ff7f..8f5675b 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 #define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
+#define RDMAOE_PACKET_LIFETIME 18
 
 static void cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device);
@@ -174,6 +175,12 @@ struct cma_ndev_work {
 	struct rdma_cm_event	event;
 };
 
+struct rdmaoe_mcast_work {
+	struct work_struct	 work;
+	struct rdma_id_private	*id;
+	struct cma_multicast	*mc;
+};
+
 union cma_ip_addr {
 	struct in6_addr ip6;
 	struct {
@@ -348,6 +355,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 			case RDMA_TRANSPORT_IWARP:
 				iw_addr_get_sgid(dev_addr, &gid);
 				break;
+			case RDMA_TRANSPORT_RDMAOE:
+				rdmaoe_addr_get_sgid(dev_addr, &gid);
+				break;
 			default:
 				return -ENODEV;
 			}
@@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv,
 {
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	int ret;
+	u16 pkey;
+
+        if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) ==
+	    RDMA_TRANSPORT_IB)
+		pkey = ib_addr_get_pkey(dev_addr);
+	else
+		pkey = 0xffff;
 
 	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
-				  ib_addr_get_pkey(dev_addr),
-				  &qp_attr->pkey_index);
+				  pkey, &qp_attr->pkey_index);
 	if (ret)
 		return ret;
 
@@ -609,6 +625,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 	id_priv = container_of(id, struct rdma_id_private, id);
 	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps))
 			ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
 		else
@@ -836,7 +853,9 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 		mc = container_of(id_priv->mc_list.next,
 				  struct cma_multicast, list);
 		list_del(&mc->list);
-		ib_sa_free_multicast(mc->multicast.ib);
+		if (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) ==
+		    RDMA_TRANSPORT_IB)
+			ib_sa_free_multicast(mc->multicast.ib);
 		kref_put(&mc->mcref, release_mc);
 	}
 }
@@ -855,6 +874,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
 		mutex_unlock(&lock);
 		switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 		case RDMA_TRANSPORT_IB:
+		case RDMA_TRANSPORT_RDMAOE:
 			if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
@@ -1512,6 +1532,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
 	if (id->device) {
 		switch (rdma_port_get_transport(id->device, id->port_num)) {
 		case RDMA_TRANSPORT_IB:
+		case RDMA_TRANSPORT_RDMAOE:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
 				goto err;
@@ -1727,6 +1748,65 @@ static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
 	return 0;
 }
 
+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv)
+{
+	struct rdma_route *route = &id_priv->id.route;
+	struct rdma_addr *addr = &route->addr;
+	struct cma_work *work;
+	int ret;
+	struct sockaddr_in *src_addr = (struct sockaddr_in *)&route->addr.src_addr;
+	struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route->addr.dst_addr;
+
+	if (src_addr->sin_family != dst_addr->sin_family)
+		return -EINVAL;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->id = id_priv;
+	INIT_WORK(&work->work, cma_work_handler);
+
+	route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL);
+	if (!route->path_rec) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	route->num_paths = 1;
+
+	rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr);
+	rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr);
+
+	route->path_rec->hop_limit = 2;
+	route->path_rec->reversible = 1;
+	route->path_rec->pkey = cpu_to_be16(0xffff);
+	route->path_rec->mtu_selector = 2;
+	route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu);
+	route->path_rec->rate_selector = 2;
+	route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev);
+	route->path_rec->packet_life_time_selector = 2;
+	route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME;
+
+	work->old_state = CMA_ROUTE_QUERY;
+	work->new_state = CMA_ROUTE_RESOLVED;
+	if (!route->path_rec->mtu || !route->path_rec->rate) {
+		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
+		work->event.status = -1;
+	} else {
+		work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
+		work->event.status = 0;
+	}
+
+	queue_work(cma_wq, &work->work);
+
+	return 0;
+
+err:
+	kfree(work);
+	return ret;
+}
+
 int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 {
 	struct rdma_id_private *id_priv;
@@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 	case RDMA_TRANSPORT_IWARP:
 		ret = cma_resolve_iw_route(id_priv, timeout_ms);
 		break;
+	case RDMA_TRANSPORT_RDMAOE:
+		ret = cma_resolve_rdmaoe_route(id_priv);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_resolve_ib_udp(id_priv, conn_param);
 		else
@@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
 						conn_param->private_data,
@@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
 						private_data, private_data_len);
@@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		ret = cma_modify_qp_err(id_priv);
 		if (ret)
 			goto out;
@@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct rdma_id_private *id_priv,
 	return 0;
 }
 
+
+static void rdmaoe_mcast_work_handler(struct work_struct *work)
+{
+	struct rdmaoe_mcast_work *mw = container_of(work, struct rdmaoe_mcast_work, work);
+	struct cma_multicast *mc = mw->mc;
+	struct ib_sa_multicast *m = mc->multicast.ib;
+
+	mc->multicast.ib->context = mc;
+	cma_ib_mc_handler(0, m);
+	kfree(m);
+	kfree(mw);
+}
+
+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv,
+				     struct cma_multicast *mc)
+{
+	struct rdmaoe_mcast_work *work;
+	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
+
+	if (cma_zero_addr((struct sockaddr *)&mc->addr))
+		return -EINVAL;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL);
+	if (!mc->multicast.ib) {
+		kfree(work);
+		return -ENOMEM;
+	}
+
+	cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib->rec.mgid);
+	mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff);
+	if (id_priv->id.ps == RDMA_PS_UDP)
+		mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY);
+	mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev);
+	mc->multicast.ib->rec.hop_limit = 1;
+	mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu);
+	rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
+	work->id = id_priv;
+	work->mc = mc;
+	INIT_WORK(&work->work, rdmaoe_mcast_work_handler);
+
+	queue_work(cma_wq, &work->work);
+
+	return 0;
+}
+
 int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 			void *context)
 {
@@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
 		break;
+	case RDMA_TRANSPORT_RDMAOE:
+		ret = cma_rdmaoe_join_multicast(id_priv, mc);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 		spin_unlock_irq(&id_priv->lock);
 		kfree(mc);
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(rdma_join_multicast);
@@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
 				ib_detach_mcast(id->qp,
 						&mc->multicast.ib->rec.mgid,
 						mc->multicast.ib->rec.mlid);
-			ib_sa_free_multicast(mc->multicast.ib);
+			if (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) ==
+			    RDMA_TRANSPORT_IB)
+				ib_sa_free_multicast(mc->multicast.ib);
 			kref_put(&mc->mcref, release_mc);
 			return;
 		}
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 24d9510..c7c9e92 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file *file,
 }
 
 static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
-			       struct rdma_route *route)
+			       struct rdma_route *route,
+			       enum rdma_transport_type tt)
 {
 	struct rdma_dev_addr *dev_addr;
 
@@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 	switch (route->num_paths) {
 	case 0:
 		dev_addr = &route->addr.dev_addr;
-		ib_addr_get_dgid(dev_addr,
-				 (union ib_gid *) &resp->ib_route[0].dgid);
-		ib_addr_get_sgid(dev_addr,
-				 (union ib_gid *) &resp->ib_route[0].sgid);
+		if (tt == RDMA_TRANSPORT_IB) {
+			ib_addr_get_dgid(dev_addr,
+					 (union ib_gid *) &resp->ib_route[0].dgid);
+			ib_addr_get_sgid(dev_addr,
+					 (union ib_gid *) &resp->ib_route[0].sgid);
+		} else {
+			rdmaoe_mac_to_ll((union ib_gid *) &resp->ib_route[0].dgid,
+					 dev_addr->dst_dev_addr);
+			rdmaoe_addr_get_sgid(dev_addr,
+					 (union ib_gid *) &resp->ib_route[0].sgid);
+		}
 		resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr));
 		break;
 	case 2:
@@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 	struct ucma_context *ctx;
 	struct sockaddr *addr;
 	int ret = 0;
+	enum rdma_transport_type tt;
 
 	if (out_len < sizeof(resp))
 		return -ENOSPC;
@@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 
 	resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
 	resp.port_num = ctx->cm_id->port_num;
-	switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) {
+	tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num);
+	switch (tt) {
 	case RDMA_TRANSPORT_IB:
-		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
+	case RDMA_TRANSPORT_RDMAOE:
+		ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt);
 		break;
 	default:
 		break;
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index 483057b..66a848e 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -39,6 +39,8 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <rdma/ib_verbs.h>
+#include <linux/ethtool.h>
+#include <rdma/ib_pack.h>
 
 struct rdma_addr_client {
 	atomic_t refcount;
@@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr *dev_addr,
 	memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid);
 }
 
+static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac)
+{
+	memset(gid->raw, 0, 16);
+	*((u32 *)gid->raw) = cpu_to_be32(0xfe800000);
+	gid->raw[12] = 0xfe;
+	gid->raw[11] = 0xff;
+	memcpy(gid->raw + 13, mac + 3, 3);
+	memcpy(gid->raw + 8, mac, 3);
+	gid->raw[8] ^= 2;
+}
+
+static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr,
+					union ib_gid *gid)
+{
+	rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr);
+}
+
+static inline enum ib_mtu rdmaoe_get_mtu(int mtu)
+{
+	/*
+	 * reduce IB headers from effective RDMAoE MTU. 28 stands for
+	 * atomic header which is the biggest possible header after BTH
+	 */
+	mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28;
+
+	if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096))
+		return IB_MTU_4096;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048))
+		return IB_MTU_2048;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024))
+		return IB_MTU_1024;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512))
+		return IB_MTU_512;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256))
+		return IB_MTU_256;
+	else
+		return 0;
+}
+
+static inline int rdmaoe_get_rate(struct net_device *dev)
+{
+	struct ethtool_cmd cmd;
+
+	if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings ||
+	    dev->ethtool_ops->get_settings(dev, &cmd))
+		return IB_RATE_PORT_CURRENT;
+
+	if (cmd.speed >= 40000)
+		return IB_RATE_40_GBPS;
+	else if (cmd.speed >= 30000)
+		return IB_RATE_30_GBPS;
+	else if (cmd.speed >= 20000)
+		return IB_RATE_20_GBPS;
+	else if (cmd.speed >= 10000)
+		return IB_RATE_10_GBPS;
+	else
+		return IB_RATE_PORT_CURRENT;
+}
+
+static inline int rdma_link_local_addr(struct in6_addr *addr)
+{
+	if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) &&
+	    addr->s6_addr32[1] == 0)
+		return 1;
+	else
+		return 0;
+}
+
+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
+{
+	memcpy(mac, &addr->s6_addr[8], 3);
+	memcpy(mac + 3, &addr->s6_addr[13], 3);
+	mac[0] ^= 2;
+}
+
+static inline int rdma_is_multicast_addr(struct in6_addr *addr)
+{
+	return addr->s6_addr[0] == 0xff ? 1 : 0;
+}
+
+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
+{
+	memset(mac, 0xff, 6);
+}
+
 #endif /* IB_ADDR_H */
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:29:37 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:29:37 +0300
Subject: [ofa-general] [PATCHv4 07/10] ib_core: RDMAoE UD packet packing
	support
Message-ID: <20090805082937.GH5599@mtls03>

Add support functions to aid in packing RDMAoE packets.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/ud_header.c |  111 +++++++++++++++++++++++++++++++++++
 include/rdma/ib_pack.h              |   26 ++++++++
 2 files changed, 137 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c
index 8ec7876..d04b6f2 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -80,6 +80,29 @@ static const struct ib_field lrh_table[]  = {
 	  .size_bits    = 16 }
 };
 
+static const struct ib_field eth_table[]  = {
+	{ STRUCT_FIELD(eth, dmac_h),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ STRUCT_FIELD(eth, dmac_l),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ STRUCT_FIELD(eth, smac_h),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ STRUCT_FIELD(eth, smac_l),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ STRUCT_FIELD(eth, type),
+	  .offset_words = 3,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 }
+};
+
 static const struct ib_field grh_table[]  = {
 	{ STRUCT_FIELD(grh, ip_version),
 	  .offset_words = 0,
@@ -241,6 +264,53 @@ void ib_ud_header_init(int     		    payload_bytes,
 EXPORT_SYMBOL(ib_ud_header_init);
 
 /**
+ * ib_rdmaoe_ud_header_init - Initialize UD header structure
+ * @payload_bytes:Length of packet payload
+ * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @header:Structure to initialize
+ *
+ * ib_rdmaoe_ud_header_init() initializes the grh.ip_version, grh.payload_length,
+ * grh.next_header, bth.opcode, bth.pad_count and
+ * bth.transport_header_version fields of a &struct eth_ud_header given
+ * the payload length and whether a GRH will be included.
+ */
+void ib_rdmaoe_ud_header_init(int     		    payload_bytes,
+			   int    		    grh_present,
+			   struct eth_ud_header    *header)
+{
+	int header_len;
+
+	memset(header, 0, sizeof *header);
+
+	header_len =
+		sizeof header->eth  +
+		IB_BTH_BYTES  +
+		IB_DETH_BYTES;
+	if (grh_present)
+		header_len += IB_GRH_BYTES;
+
+	header->grh_present          = grh_present;
+	if (grh_present) {
+		header->grh.ip_version      = 6;
+		header->grh.payload_length  =
+			cpu_to_be16((IB_BTH_BYTES     +
+				     IB_DETH_BYTES    +
+				     payload_bytes    +
+				     4                + /* ICRC     */
+				     3) & ~3);          /* round up */
+		header->grh.next_header     = 0x1b;
+	}
+
+	if (header->immediate_present)
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+	else
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY;
+	header->bth.pad_count                = (4 - payload_bytes) & 3;
+	header->bth.transport_header_version = 0;
+}
+EXPORT_SYMBOL(ib_rdmaoe_ud_header_init);
+
+/**
  * ib_ud_header_pack - Pack UD header struct into wire format
  * @header:UD header struct
  * @buf:Buffer to pack into
@@ -281,6 +351,47 @@ int ib_ud_header_pack(struct ib_ud_header *header,
 EXPORT_SYMBOL(ib_ud_header_pack);
 
 /**
+ * rdmaoe_ud_header_pack - Pack UD header struct into eth wire format
+ * @header:UD header struct
+ * @buf:Buffer to pack into
+ *
+ * ib_ud_header_pack() packs the UD header structure @header into wire
+ * format in the buffer @buf.
+ */
+int rdmaoe_ud_header_pack(struct eth_ud_header *header,
+		       void                 *buf)
+{
+	int len = 0;
+
+	ib_pack(eth_table, ARRAY_SIZE(eth_table),
+		&header->eth, buf);
+	len += IB_ETH_BYTES;
+
+	if (header->grh_present) {
+		ib_pack(grh_table, ARRAY_SIZE(grh_table),
+			&header->grh, buf + len);
+		len += IB_GRH_BYTES;
+	}
+
+	ib_pack(bth_table, ARRAY_SIZE(bth_table),
+		&header->bth, buf + len);
+	len += IB_BTH_BYTES;
+
+	ib_pack(deth_table, ARRAY_SIZE(deth_table),
+		&header->deth, buf + len);
+	len += IB_DETH_BYTES;
+
+	if (header->immediate_present) {
+		memcpy(buf + len, &header->immediate_data,
+		       sizeof header->immediate_data);
+		len += sizeof header->immediate_data;
+	}
+
+	return len;
+}
+EXPORT_SYMBOL(rdmaoe_ud_header_pack);
+
+/**
  * ib_ud_header_unpack - Unpack UD header struct from wire format
  * @header:UD header struct
  * @buf:Buffer to pack into
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index d7fc45c..bf199eb 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -37,6 +37,7 @@
 
 enum {
 	IB_LRH_BYTES  = 8,
+	IB_ETH_BYTES  = 14,
 	IB_GRH_BYTES  = 40,
 	IB_BTH_BYTES  = 12,
 	IB_DETH_BYTES = 8
@@ -210,6 +211,14 @@ struct ib_unpacked_deth {
 	__be32       source_qpn;
 };
 
+struct ib_unpacked_eth {
+	u8	dmac_h[4];
+	u8	dmac_l[2];
+	u8	smac_h[2];
+	u8	smac_l[4];
+	__be16	type;
+};
+
 struct ib_ud_header {
 	struct ib_unpacked_lrh  lrh;
 	int                     grh_present;
@@ -220,6 +229,16 @@ struct ib_ud_header {
 	__be32         		immediate_data;
 };
 
+struct eth_ud_header {
+	struct ib_unpacked_eth  eth;
+	int                     grh_present;
+	struct ib_unpacked_grh  grh;
+	struct ib_unpacked_bth  bth;
+	struct ib_unpacked_deth deth;
+	int            		immediate_present;
+	__be32         		immediate_data;
+};
+
 void ib_pack(const struct ib_field        *desc,
 	     int                           desc_len,
 	     void                         *structure,
@@ -234,10 +253,17 @@ void ib_ud_header_init(int     		   payload_bytes,
 		       int    		   grh_present,
 		       struct ib_ud_header *header);
 
+void ib_rdmaoe_ud_header_init(int     		   payload_bytes,
+			   int    		   grh_present,
+			   struct eth_ud_header   *header);
+
 int ib_ud_header_pack(struct ib_ud_header *header,
 		      void                *buf);
 
 int ib_ud_header_unpack(void                *buf,
 			struct ib_ud_header *header);
 
+int rdmaoe_ud_header_pack(struct eth_ud_header *header,
+		       void                 *buf);
+
 #endif /* IB_PACK_H */
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:29:50 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:29:50 +0300
Subject: [ofa-general] [PATCHv4 08/10] ib_core: Add API to support RDMAoE
	from userspace
Message-ID: <20090805082950.GI5599@mtls03>

Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remore
port's MAC address. Port transport is also returned by ibv_query_port().
ABI version is incremented from 6 to 7.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/uverbs.h      |    1 +
 drivers/infiniband/core/uverbs_cmd.c  |   32 ++++++++++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_main.c |    1 +
 drivers/infiniband/core/verbs.c       |   10 ++++++++++
 include/rdma/ib_user_verbs.h          |   21 ++++++++++++++++++---
 include/rdma/ib_verbs.h               |   12 ++++++++++++
 6 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index b3ea958..e69b04c 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -194,5 +194,6 @@ IB_UVERBS_DECLARE_CMD(create_srq);
 IB_UVERBS_DECLARE_CMD(modify_srq);
 IB_UVERBS_DECLARE_CMD(query_srq);
 IB_UVERBS_DECLARE_CMD(destroy_srq);
+IB_UVERBS_DECLARE_CMD(get_mac);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 56feab6..012aadf 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -452,6 +452,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
 	resp.active_width    = attr.active_width;
 	resp.active_speed    = attr.active_speed;
 	resp.phys_state      = attr.phys_state;
+	resp.transport	     = attr.transport;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
@@ -1824,6 +1825,37 @@ err:
 	return ret;
 }
 
+ssize_t ib_uverbs_get_mac(struct ib_uverbs_file *file, const char __user *buf,
+			  int in_len, int out_len)
+{
+	struct ib_uverbs_get_mac        cmd;
+	struct ib_uverbs_get_mac_resp   resp;
+	int              ret;
+	struct ib_pd    *pd;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd)
+		return -EINVAL;
+
+	ret = ib_get_mac(pd->device, cmd.port, cmd.gid, resp.mac);
+	put_pd_read(pd);
+	if (!ret) {
+		if (copy_to_user((void __user *) (unsigned long) cmd.response,
+				 &resp, sizeof resp))
+			return -EFAULT;
+
+		return in_len;
+	}
+
+	return ret;
+}
+
 ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 			     const char __user *buf, int in_len, int out_len)
 {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index eb36a81..2641845 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -108,6 +108,7 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
 	[IB_USER_VERBS_CMD_MODIFY_SRQ]    	= ib_uverbs_modify_srq,
 	[IB_USER_VERBS_CMD_QUERY_SRQ]     	= ib_uverbs_query_srq,
 	[IB_USER_VERBS_CMD_DESTROY_SRQ]   	= ib_uverbs_destroy_srq,
+	[IB_USER_VERBS_CMD_GET_MAC]		= ib_uverbs_get_mac,
 };
 
 static struct vfsmount *uverbs_event_mnt;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 3b2f00b..7cce5d6 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -912,3 +912,13 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
 	return qp->device->detach_mcast(qp, gid, lid);
 }
 EXPORT_SYMBOL(ib_detach_mcast);
+
+int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac)
+{
+	if (!device->get_mac)
+		return -ENOSYS;
+
+	return device->get_mac(device, port, gid, mac);
+}
+EXPORT_SYMBOL(ib_get_mac);
+
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index a17f771..49eee8a 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -42,7 +42,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define IB_USER_VERBS_ABI_VERSION	6
+#define IB_USER_VERBS_ABI_VERSION	7
 
 enum {
 	IB_USER_VERBS_CMD_GET_CONTEXT,
@@ -81,7 +81,8 @@ enum {
 	IB_USER_VERBS_CMD_MODIFY_SRQ,
 	IB_USER_VERBS_CMD_QUERY_SRQ,
 	IB_USER_VERBS_CMD_DESTROY_SRQ,
-	IB_USER_VERBS_CMD_POST_SRQ_RECV
+	IB_USER_VERBS_CMD_POST_SRQ_RECV,
+	IB_USER_VERBS_CMD_GET_MAC
 };
 
 /*
@@ -205,7 +206,8 @@ struct ib_uverbs_query_port_resp {
 	__u8  active_width;
 	__u8  active_speed;
 	__u8  phys_state;
-	__u8  reserved[3];
+	__u8  transport;
+	__u8  reserved[2];
 };
 
 struct ib_uverbs_alloc_pd {
@@ -621,6 +623,19 @@ struct ib_uverbs_destroy_ah {
 	__u32 ah_handle;
 };
 
+struct ib_uverbs_get_mac {
+	__u64	response;
+	__u32	pd_handle;
+	__u8	port;
+	__u8	reserved[3];
+	__u8	gid[16];
+};
+
+struct ib_uverbs_get_mac_resp {
+	__u8	mac[6];
+	__u16	reserved;
+};
+
 struct ib_uverbs_attach_mcast {
 	__u8  gid[16];
 	__u32 qp_handle;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4eec70f..9470e1a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1131,6 +1131,9 @@ struct ib_device {
 						  struct ib_grh *in_grh,
 						  struct ib_mad *in_mad,
 						  struct ib_mad *out_mad);
+	int                        (*get_mac)(struct ib_device *device, u8 port,
+					      u8 *gid, u8 *mac);
+
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
@@ -2035,4 +2038,13 @@ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
  */
 int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 
+/**
+  * ib_get_mac - get the mac address for the specified gid
+  * @device: IB device used for traffic
+  * @port: port number used.
+  * @gid: gid to be resolved into mac
+  * @mac: mac of the port bearing this gid
+  */
+int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac);
+
 #endif /* IB_VERBS_H */
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:30:08 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:30:08 +0300
Subject: [ofa-general] [PATCHv4 09/10] mlx4: Add support for RDMAoE - address
	resolution
Message-ID: <20090805083008.GJ5599@mtls03>

The following path handles address vectors creation for RDMAoE ports. mlx4
needs the MAC address of the remote node to include it in the WQE of a UD QP or
in the QP context of connected QPs. Address resolution is done atomically in
the case of a link local address or a multicast GID and otherwise -EINVAL is
returned.  mlx4 transport packets were changed too to accomodate for RDMAoE.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/ah.c      |  187 ++++++++++++++++++++++++++++------
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   19 +++-
 drivers/infiniband/hw/mlx4/qp.c      |  172 +++++++++++++++++++++----------
 drivers/net/mlx4/fw.c                |    3 +-
 include/linux/mlx4/device.h          |   31 ++++++-
 include/linux/mlx4/qp.h              |    8 +-
 6 files changed, 327 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index c75ac94..0a015c3 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -31,63 +31,166 @@
  */
 
 #include "mlx4_ib.h"
+#include <rdma/ib_addr.h>
+#include <linux/inet.h>
+#include <linux/string.h>
 
-struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
+			u8 *mac, int *is_mcast)
 {
-	struct mlx4_dev *dev = to_mdev(pd->device)->dev;
-	struct mlx4_ib_ah *ah;
+	struct mlx4_ib_rdmaoe *rdmaoe = &dev->rdmaoe;
+	struct sockaddr_in6 s6 = {0};
+	struct net_device *netdev;
+	int ifidx;
 
-	ah = kmalloc(sizeof *ah, GFP_ATOMIC);
-	if (!ah)
-		return ERR_PTR(-ENOMEM);
+	*is_mcast = 0;
+	spin_lock(&rdmaoe->lock);
+	netdev = rdmaoe->netdevs[ah_attr->port_num - 1];
+	if (!netdev) {
+		spin_unlock(&rdmaoe->lock);
+		return -EINVAL;
+	}
+	ifidx = netdev->ifindex;
+	spin_unlock(&rdmaoe->lock);
 
-	memset(&ah->av, 0, sizeof ah->av);
+	memcpy(s6.sin6_addr.s6_addr, ah_attr->grh.dgid.raw, sizeof ah_attr->grh);
+	s6.sin6_family = AF_INET6;
+	s6.sin6_scope_id = ifidx;
+	if (rdma_link_local_addr(&s6.sin6_addr))
+		rdma_get_ll_mac(&s6.sin6_addr, mac);
+	else if (rdma_is_multicast_addr(&s6.sin6_addr)) {
+		rdma_get_mcast_mac(&s6.sin6_addr, mac);
+		*is_mcast = 1;
+	} else
+		return -EINVAL;
 
-	ah->av.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
-	ah->av.g_slid  = ah_attr->src_path_bits;
-	ah->av.dlid    = cpu_to_be16(ah_attr->dlid);
-	if (ah_attr->static_rate) {
-		ah->av.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
-		while (ah->av.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
-		       !(1 << ah->av.stat_rate & dev->caps.stat_rate_support))
-			--ah->av.stat_rate;
-	}
-	ah->av.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	return 0;
+}
+
+static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
+				  struct mlx4_ib_ah *ah)
+{
+	struct mlx4_dev *dev = to_mdev(pd->device)->dev;
+
+	ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
+	ah->av.ib.g_slid  = ah_attr->src_path_bits;
 	if (ah_attr->ah_flags & IB_AH_GRH) {
-		ah->av.g_slid   |= 0x80;
-		ah->av.gid_index = ah_attr->grh.sgid_index;
-		ah->av.hop_limit = ah_attr->grh.hop_limit;
-		ah->av.sl_tclass_flowlabel |=
+		ah->av.ib.g_slid   |= 0x80;
+		ah->av.ib.gid_index = ah_attr->grh.sgid_index;
+		ah->av.ib.hop_limit = ah_attr->grh.hop_limit;
+		ah->av.ib.sl_tclass_flowlabel |=
 			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
 				    ah_attr->grh.flow_label);
-		memcpy(ah->av.dgid, ah_attr->grh.dgid.raw, 16);
+		memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	ah->av.ib.dlid    = cpu_to_be16(ah_attr->dlid);
+	if (ah_attr->static_rate) {
+		ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
+		while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
+		       !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support))
+			--ah->av.ib.stat_rate;
 	}
+	ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
 
 	return &ah->ibah;
 }
 
+static struct ib_ah *create_rdmaoe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
+				   struct mlx4_ib_ah *ah)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(pd->device);
+	struct mlx4_dev *dev = ibdev->dev;
+	u8 mac[6];
+	int err;
+	int is_mcast;
+
+	err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, &is_mcast);
+	if (err)
+		return ERR_PTR(err);
+
+	memcpy(ah->av.eth.mac_0_1, mac, 2);
+	memcpy(ah->av.eth.mac_2_5, mac + 2, 4);
+	ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
+	ah->av.ib.g_slid = 0x80;
+	if (ah_attr->static_rate) {
+		ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
+		while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
+		       !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support))
+			--ah->av.ib.stat_rate;
+	}
+
+	/*
+	 * HW requires multicast LID so we just choose one.
+	 */
+	if (is_mcast)
+		ah->av.ib.dlid = cpu_to_be16(0xc000);
+
+	memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16);
+	ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+
+	return &ah->ibah;
+}
+
+struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+{
+	struct mlx4_ib_ah *ah;
+	enum rdma_transport_type transport;
+	struct ib_ah *ret;
+
+	ah = kzalloc(sizeof *ah, GFP_ATOMIC);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	transport = rdma_port_get_transport(pd->device, ah_attr->port_num);
+	if (transport == RDMA_TRANSPORT_RDMAOE) {
+		if (!(ah_attr->ah_flags & IB_AH_GRH)) {
+			ret = ERR_PTR(-EINVAL);
+			goto out;
+		} else {
+			/* TBD: need to handle the case when we get called
+			in an atomic context and there we might sleep. We
+			don't expect this currently since we're working with
+			link local addresses which we can translate without
+			going to sleep */
+			ret = create_rdmaoe_ah(pd, ah_attr, ah);
+			if (IS_ERR(ret))
+				goto out;
+			else
+				return ret;
+		}
+	} else
+		return create_ib_ah(pd, ah_attr, ah); /* never fails */
+
+out:
+	kfree(ah);
+	return ret;
+}
+
 int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr)
 {
 	struct mlx4_ib_ah *ah = to_mah(ibah);
+	enum rdma_transport_type transport;
 
+	transport = rdma_port_get_transport(ibah->device, ah_attr->port_num);
 	memset(ah_attr, 0, sizeof *ah_attr);
-	ah_attr->dlid	       = be16_to_cpu(ah->av.dlid);
-	ah_attr->sl	       = be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28;
-	ah_attr->port_num      = be32_to_cpu(ah->av.port_pd) >> 24;
-	if (ah->av.stat_rate)
-		ah_attr->static_rate = ah->av.stat_rate - MLX4_STAT_RATE_OFFSET;
-	ah_attr->src_path_bits = ah->av.g_slid & 0x7F;
+	ah_attr->dlid = transport == RDMA_TRANSPORT_IB ? be16_to_cpu(ah->av.ib.dlid) : 0;
+	ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
+	ah_attr->port_num = be32_to_cpu(ah->av.ib.port_pd) >> 24;
+	if (ah->av.ib.stat_rate)
+		ah_attr->static_rate = ah->av.ib.stat_rate - MLX4_STAT_RATE_OFFSET;
+	ah_attr->src_path_bits = ah->av.ib.g_slid & 0x7F;
 
 	if (mlx4_ib_ah_grh_present(ah)) {
 		ah_attr->ah_flags = IB_AH_GRH;
 
 		ah_attr->grh.traffic_class =
-			be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20;
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20;
 		ah_attr->grh.flow_label =
-			be32_to_cpu(ah->av.sl_tclass_flowlabel) & 0xfffff;
-		ah_attr->grh.hop_limit  = ah->av.hop_limit;
-		ah_attr->grh.sgid_index = ah->av.gid_index;
-		memcpy(ah_attr->grh.dgid.raw, ah->av.dgid, 16);
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) & 0xfffff;
+		ah_attr->grh.hop_limit  = ah->av.ib.hop_limit;
+		ah_attr->grh.sgid_index = ah->av.ib.gid_index;
+		memcpy(ah_attr->grh.dgid.raw, ah->av.ib.dgid, 16);
 	}
 
 	return 0;
@@ -98,3 +201,21 @@ int mlx4_ib_destroy_ah(struct ib_ah *ah)
 	kfree(to_mah(ah));
 	return 0;
 }
+
+int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac)
+{
+	int err;
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct ib_ah_attr ah_attr = {
+		.port_num = port,
+	};
+	int is_mcast;
+
+	memcpy(ah_attr.grh.dgid.raw, gid, 16);
+	err = mlx4_ib_resolve_grh(ibdev, &ah_attr, mac, &is_mcast);
+	if (err)
+		ERR_PTR(err);
+
+	return 0;
+}
+
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 8a7dd67..c644cac 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -138,6 +138,7 @@ struct mlx4_ib_qp {
 	u8			resp_depth;
 	u8			sq_no_prefetch;
 	u8			state;
+	int			mlx_type;
 };
 
 struct mlx4_ib_srq {
@@ -157,7 +158,14 @@ struct mlx4_ib_srq {
 
 struct mlx4_ib_ah {
 	struct ib_ah		ibah;
-	struct mlx4_av		av;
+	union mlx4_ext_av       av;
+};
+
+struct mlx4_ib_rdmaoe {
+	spinlock_t		lock;
+	struct net_device      *netdevs[MLX4_MAX_PORTS];
+	struct notifier_block 	nb;
+	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
 };
 
 struct mlx4_ib_dev {
@@ -175,6 +183,8 @@ struct mlx4_ib_dev {
 	spinlock_t		sm_lock;
 
 	struct mutex		cap_mask_mutex;
+
+	struct mlx4_ib_rdmaoe	rdmaoe;
 };
 
 static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev)
@@ -313,9 +323,14 @@ int mlx4_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, int npages,
 int mlx4_ib_unmap_fmr(struct list_head *fmr_list);
 int mlx4_ib_fmr_dealloc(struct ib_fmr *fmr);
 
+int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
+			u8 *mac, int *is_mcast);
+
+int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac);
+
 static inline int mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah)
 {
-	return !!(ah->av.g_slid & 0x80);
+	return !!(ah->av.ib.g_slid & 0x80);
 }
 
 #endif /* MLX4_IB_H */
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 20724ae..4b391fa 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -32,6 +32,7 @@
  */
 
 #include <linux/log2.h>
+#include <linux/netdevice.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
@@ -47,14 +48,21 @@ enum {
 
 enum {
 	MLX4_IB_DEFAULT_SCHED_QUEUE	= 0x83,
-	MLX4_IB_DEFAULT_QP0_SCHED_QUEUE	= 0x3f
+	MLX4_IB_DEFAULT_QP0_SCHED_QUEUE	= 0x3f,
+	MLX4_IB_LINK_TYPE_IB		= 0,
+	MLX4_IB_LINK_TYPE_ETH		= 1
 };
 
 enum {
 	/*
 	 * Largest possible UD header: send with GRH and immediate data.
+	 * 4 bytes added to accommodate for eth header instead of lrh
 	 */
-	MLX4_IB_UD_HEADER_SIZE		= 72
+	MLX4_IB_UD_HEADER_SIZE		= 76
+};
+
+enum {
+	MLX4_RDMAOE_ETHERTYPE = 0x8915
 };
 
 struct mlx4_ib_sqp {
@@ -62,7 +70,10 @@ struct mlx4_ib_sqp {
 	int			pkey_index;
 	u32			qkey;
 	u32			send_psn;
-	struct ib_ud_header	ud_header;
+	union {
+		struct ib_ud_header	ib;
+		struct eth_ud_header	eth;
+	} hdr;
 	u8			header_buf[MLX4_IB_UD_HEADER_SIZE];
 };
 
@@ -546,9 +557,9 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		}
 	}
 
-	if (sqpn) {
+	if (sqpn)
 		qpn = sqpn;
-	} else {
+	else {
 		err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn);
 		if (err)
 			goto err_wrid;
@@ -843,6 +854,12 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port)
 static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 			 struct mlx4_qp_path *path, u8 port)
 {
+	int err;
+	int is_eth = rdma_port_get_transport(&dev->ib_dev, port) ==
+		RDMA_TRANSPORT_RDMAOE ? 1 : 0;
+	u8 mac[6];
+	int is_mcast;
+
 	path->grh_mylmc     = ah->src_path_bits & 0x7f;
 	path->rlid	    = cpu_to_be16(ah->dlid);
 	if (ah->static_rate) {
@@ -873,6 +890,21 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 	path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE |
 		((port - 1) << 6) | ((ah->sl & 0xf) << 2);
 
+	if (is_eth) {
+		if (!(ah->ah_flags & IB_AH_GRH))
+			return -1;
+
+		err = mlx4_ib_resolve_grh(dev, ah, mac, &is_mcast);
+		if (err)
+			return err;
+
+		memcpy(path->dmac_h, mac, 2);
+		memcpy(path->dmac_l, mac + 2, 4);
+		path->ackto = MLX4_IB_LINK_TYPE_ETH;
+		/* use index 0 into MAC table for RDMAoE */
+		path->grh_mylmc &= 0x80;
+	}
+
 	return 0;
 }
 
@@ -972,7 +1004,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	}
 
 	if (attr_mask & IB_QP_TIMEOUT) {
-		context->pri_path.ackto = attr->timeout << 3;
+		context->pri_path.ackto |= (attr->timeout << 3);
 		optpar |= MLX4_QP_OPTPAR_ACK_TIMEOUT;
 	}
 
@@ -1218,79 +1250,109 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 	int header_size;
 	int spc;
 	int i;
+	void *tmp;
+	struct ib_ud_header *ib = NULL;
+	struct eth_ud_header *eth = NULL;
+	struct ib_unpacked_grh *grh;
+	struct ib_unpacked_bth  *bth;
+	struct ib_unpacked_deth *deth;
 
 	send_size = 0;
 	for (i = 0; i < wr->num_sge; ++i)
 		send_size += wr->sg_list[i].length;
 
-	ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), &sqp->ud_header);
+	if (rdma_port_get_transport(sqp->qp.ibqp.device, sqp->qp.port) == RDMA_TRANSPORT_IB) {
+		ib = &sqp->hdr.ib;
+		grh = &ib->grh;
+		bth = &ib->bth;
+		deth = &ib->deth;
+		ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), ib);
+		ib->lrh.service_level   =
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
+		ib->lrh.destination_lid = ah->av.ib.dlid;
+		ib->lrh.source_lid      = cpu_to_be16(ah->av.ib.g_slid & 0x7f);
+	} else {
+		eth = &sqp->hdr.eth;
+		grh = &eth->grh;
+		bth = &eth->bth;
+		deth = &eth->deth;
+		ib_rdmaoe_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), eth);
+	}
 
-	sqp->ud_header.lrh.service_level   =
-		be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28;
-	sqp->ud_header.lrh.destination_lid = ah->av.dlid;
-	sqp->ud_header.lrh.source_lid      = cpu_to_be16(ah->av.g_slid & 0x7f);
 	if (mlx4_ib_ah_grh_present(ah)) {
-		sqp->ud_header.grh.traffic_class =
-			(be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff;
-		sqp->ud_header.grh.flow_label    =
-			ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff);
-		sqp->ud_header.grh.hop_limit     = ah->av.hop_limit;
-		ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24,
-				  ah->av.gid_index, &sqp->ud_header.grh.source_gid);
-		memcpy(sqp->ud_header.grh.destination_gid.raw,
-		       ah->av.dgid, 16);
+		grh->traffic_class =
+			(be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 0xff;
+		grh->flow_label    =
+			ah->av.ib.sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		grh->hop_limit     = ah->av.ib.hop_limit;
+		ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.ib.port_pd) >> 24,
+				  ah->av.ib.gid_index, &grh->source_gid);
+		memcpy(grh->destination_gid.raw,
+		       ah->av.ib.dgid, 16);
 	}
 
 	mlx->flags &= cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE);
-	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) |
-				  (sqp->ud_header.lrh.destination_lid ==
-				   IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) |
-				  (sqp->ud_header.lrh.service_level << 8));
-	mlx->rlid   = sqp->ud_header.lrh.destination_lid;
+
+	if (ib) {
+		mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) |
+					  (ib->lrh.destination_lid ==
+					   IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) |
+					  (ib->lrh.service_level << 8));
+		mlx->rlid   = ib->lrh.destination_lid;
+	}
 
 	switch (wr->opcode) {
 	case IB_WR_SEND:
-		sqp->ud_header.bth.opcode	 = IB_OPCODE_UD_SEND_ONLY;
-		sqp->ud_header.immediate_present = 0;
+		bth->opcode	 = IB_OPCODE_UD_SEND_ONLY;
+		if (ib)
+			ib->immediate_present = 0;
+		else
+			eth->immediate_present = 0;
 		break;
 	case IB_WR_SEND_WITH_IMM:
-		sqp->ud_header.bth.opcode	 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
-		sqp->ud_header.immediate_present = 1;
-		sqp->ud_header.immediate_data    = wr->ex.imm_data;
+		bth->opcode	 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		if (ib) {
+			ib->immediate_present = 1;
+			ib->immediate_data    = wr->ex.imm_data;
+		} else {
+			eth->immediate_present = 1;
+			eth->immediate_data    = wr->ex.imm_data;
+		}
 		break;
 	default:
 		return -EINVAL;
 	}
 
-	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
-	if (sqp->ud_header.lrh.destination_lid == IB_LID_PERMISSIVE)
-		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
-	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (ib) {
+		ib->lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+		if (ib->lrh.destination_lid == IB_LID_PERMISSIVE)
+			ib->lrh.source_lid = IB_LID_PERMISSIVE;
+	} else {
+		memcpy(eth->eth.dmac_h, ah->av.eth.mac_0_1, 2);
+		memcpy(eth->eth.dmac_h + 2, ah->av.eth.mac_2_5, 2);
+		memcpy(eth->eth.dmac_l, ah->av.eth.mac_2_5 + 2, 2);
+		tmp = to_mdev(sqp->qp.ibqp.device)->rdmaoe.netdevs[sqp->qp.port - 1]->dev_addr;
+		memcpy(eth->eth.smac_h, tmp, 2);
+		memcpy(eth->eth.smac_l, tmp + 2, 4);
+		eth->eth.type = cpu_to_be16(MLX4_RDMAOE_ETHERTYPE);
+	}
+	bth->solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+
 	if (!sqp->qp.ibqp.qp_num)
 		ib_get_cached_pkey(ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
 	else
 		ib_get_cached_pkey(ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
-	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
-	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
-	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
-	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+	bth->pkey = cpu_to_be16(pkey);
+	bth->destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	bth->psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	deth->qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
 					       sqp->qkey : wr->wr.ud.remote_qkey);
-	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
-
-	header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf);
-
-	if (0) {
-		printk(KERN_ERR "built UD header of size %d:\n", header_size);
-		for (i = 0; i < header_size / 4; ++i) {
-			if (i % 8 == 0)
-				printk("  [%02x] ", i * 4);
-			printk(" %08x",
-			       be32_to_cpu(((__be32 *) sqp->header_buf)[i]));
-			if ((i + 1) % 8 == 0)
-				printk("\n");
-		}
-		printk("\n");
-	}
+	deth->source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	if (ib)
+		header_size = ib_ud_header_pack(ib, sqp->header_buf);
+	else
+		header_size = rdmaoe_ud_header_pack(eth, sqp->header_buf);
 
 	/*
 	 * Inline data segments may not cross a 64 byte boundary.  If
@@ -1414,6 +1476,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg,
 	memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av));
 	dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn);
 	dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey);
+	dseg->vlan = to_mah(wr->wr.ud.ah)->av.eth.vlan;
+	memcpy(dseg->mac_0_1, to_mah(wr->wr.ud.ah)->av.eth.mac_0_1, 6);
 }
 
 static void set_mlx_icrc_seg(void *dseg)
diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index cee199c..20526ce 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -96,7 +96,8 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags)
 		[20] = "Address vector port checking support",
 		[21] = "UD multicast support",
 		[24] = "Demand paging support",
-		[25] = "Router support"
+		[25] = "Router support",
+		[30] = "RDMAoE support"
 	};
 	int i;
 
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 3aff8a6..b73b5f0 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -66,7 +66,8 @@ enum {
 	MLX4_DEV_CAP_FLAG_ATOMIC	= 1 << 18,
 	MLX4_DEV_CAP_FLAG_RAW_MCAST	= 1 << 19,
 	MLX4_DEV_CAP_FLAG_UD_AV_PORT	= 1 << 20,
-	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1 << 21
+	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1 << 21,
+	MLX4_DEV_CAP_FLAG_RDMAOE	= 1 << 30
 };
 
 enum {
@@ -371,6 +372,28 @@ struct mlx4_av {
 	u8			dgid[16];
 };
 
+struct mlx4_eth_av {
+	__be32		port_pd;
+	u8		reserved1;
+	u8		smac_idx;
+	u16		reserved2;
+	u8		reserved3;
+	u8		gid_index;
+	u8		stat_rate;
+	u8		hop_limit;
+	__be32		sl_tclass_flowlabel;
+	u8		dgid[16];
+	u32		reserved4[2];
+	__be16		vlan;
+	u8		mac_0_1[2];
+	u8		mac_2_5[4];
+};
+
+union mlx4_ext_av {
+	struct mlx4_av		ib;
+	struct mlx4_eth_av	eth;
+};
+
 struct mlx4_dev {
 	struct pci_dev	       *pdev;
 	unsigned long		flags;
@@ -399,6 +422,12 @@ struct mlx4_init_port_param {
 		if (((type) == MLX4_PORT_TYPE_IB ? (dev)->caps.port_mask : \
 		     ~(dev)->caps.port_mask) & 1 << ((port) - 1))
 
+#define mlx4_foreach_ib_transport_port(port, dev)			\
+	for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++)	\
+		if (((dev)->caps.port_mask & 1 << ((port) - 1)) ||	\
+		    ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE))
+
+
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		   struct mlx4_buf *buf);
 void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index bf8f119..d73534f 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -112,7 +112,9 @@ struct mlx4_qp_path {
 	u8			snooper_flags;
 	u8			reserved3[2];
 	u8			counter_index;
-	u8			reserved4[7];
+	u8			reserved4;
+	u8			dmac_h[2];
+	u8			dmac_l[4];
 };
 
 struct mlx4_qp_context {
@@ -218,7 +220,9 @@ struct mlx4_wqe_datagram_seg {
 	__be32			av[8];
 	__be32			dqpn;
 	__be32			qkey;
-	__be32			reservd[2];
+	__be16			vlan;
+	u8			mac_0_1[2];
+	u8			mac_2_5[4];
 };
 
 struct mlx4_wqe_lso_seg {
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:30:23 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:30:23 +0300
Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow
	interfaces to correspond to each other
Message-ID: <20090805083023.GK5599@mtls03>

This patch add support RDMAoE for mlx4. Since mlx4_ib now needs to reference
mlx4_en netdevices, a new mechanism was added. Two new fields were added to
struct mlx4_interface to define a protocol and a get_prot_dev method to
retrieve the corresponding protocol's net device.  An implementation of the new
verb ib_get_port_link_type() - mlx4_ib_get_port_link_type - was added.
mlx4_ib_query_port() has been modified to support eth link types. An interface
is considered to be active if its corresponding eth interface is active. Code
for setting the GID table of a port has been added. Currently, each IB port has
a single GID entry in its table and that GID entery equals the link local IPv6
address.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/main.c |  309 +++++++++++++++++++++++++++++++++----
 drivers/net/mlx4/en_main.c        |   15 ++-
 drivers/net/mlx4/en_port.c        |    4 +-
 drivers/net/mlx4/en_port.h        |    3 +-
 drivers/net/mlx4/intf.c           |   20 +++
 drivers/net/mlx4/main.c           |    6 +
 drivers/net/mlx4/mlx4.h           |    1 +
 include/linux/mlx4/cmd.h          |    1 +
 include/linux/mlx4/driver.h       |   16 ++-
 9 files changed, 335 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..737c6b9 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -34,9 +34,12 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
 
 #include <rdma/ib_smi.h>
 #include <rdma/ib_user_verbs.h>
+#include <rdma/ib_addr.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/cmd.h>
@@ -57,6 +60,15 @@ static const char mlx4_ib_version[] =
 	DRV_NAME ": Mellanox ConnectX InfiniBand driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+struct update_gid_work {
+	struct work_struct work;
+	union ib_gid gids[128];
+	int port;
+	struct mlx4_ib_dev *dev;
+};
+
+static struct workqueue_struct *wq;
+
 static void init_query_mad(struct ib_smp *mad)
 {
 	mad->base_version  = 1;
@@ -152,28 +164,19 @@ out:
 	return err;
 }
 
-static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
-			      struct ib_port_attr *props)
+static enum rdma_transport_type
+mlx4_ib_port_get_transport(struct ib_device *device, u8 port_num)
 {
-	struct ib_smp *in_mad  = NULL;
-	struct ib_smp *out_mad = NULL;
-	int err = -ENOMEM;
-
-	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
-	if (!in_mad || !out_mad)
-		goto out;
-
-	memset(props, 0, sizeof *props);
-
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
-	in_mad->attr_mod = cpu_to_be32(port);
+	struct mlx4_dev *dev = to_mdev(device)->dev;
 
-	err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad);
-	if (err)
-		goto out;
+	return dev->caps.port_mask & (1 << (port_num - 1)) ?
+		RDMA_TRANSPORT_IB : RDMA_TRANSPORT_RDMAOE;
+}
 
+static void ib_link_query_port(struct ib_device *ibdev, u8 port,
+			       struct ib_port_attr *props,
+			       struct ib_smp *out_mad)
+{
 	props->lid		= be16_to_cpup((__be16 *) (out_mad->data + 16));
 	props->lmc		= out_mad->data[34] & 0x7;
 	props->sm_lid		= be16_to_cpup((__be16 *) (out_mad->data + 18));
@@ -193,6 +196,67 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
 	props->subnet_timeout	= out_mad->data[51] & 0x1f;
 	props->max_vl_num	= out_mad->data[37] >> 4;
 	props->init_type_reply	= out_mad->data[41] >> 4;
+	props->transport	= RDMA_TRANSPORT_IB;
+}
+
+static void eth_link_query_port(struct ib_device *ibdev, u8 port,
+				struct ib_port_attr *props,
+				struct ib_smp *out_mad)
+{
+	struct mlx4_ib_rdmaoe *rdmaoe = &to_mdev(ibdev)->rdmaoe;
+	struct net_device *ndev;
+
+	props->port_cap_flags	= IB_PORT_CM_SUP;
+	props->gid_tbl_len	= to_mdev(ibdev)->dev->caps.gid_table_len[port];
+	props->max_msg_sz	= to_mdev(ibdev)->dev->caps.max_msg_sz;
+	props->pkey_tbl_len	= 1;
+	props->bad_pkey_cntr	= be16_to_cpup((__be16 *) (out_mad->data + 46));
+	props->qkey_viol_cntr	= be16_to_cpup((__be16 *) (out_mad->data + 48));
+	props->active_width	= 0;
+	props->active_speed	= 0;
+	props->max_mtu		= out_mad->data[41] & 0xf;
+	props->subnet_timeout	= 0;
+	props->max_vl_num	= out_mad->data[37] >> 4;
+	props->init_type_reply	= 0;
+	props->transport	= RDMA_TRANSPORT_RDMAOE;
+	spin_lock(&rdmaoe->lock);
+	ndev = rdmaoe->netdevs[port - 1];
+	if (!ndev)
+		goto out;
+
+	props->active_mtu	= rdmaoe_get_mtu(ndev->mtu);
+	props->state		= netif_running(ndev) &&  netif_oper_up(ndev) ?
+					IB_PORT_ACTIVE : IB_PORT_DOWN;
+	props->phys_state	= props->state;
+out:
+	spin_unlock(&rdmaoe->lock);
+}
+
+static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
+			      struct ib_port_attr *props)
+{
+	struct ib_smp *in_mad  = NULL;
+	struct ib_smp *out_mad = NULL;
+	int err = -ENOMEM;
+
+	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(props, 0, sizeof *props);
+
+	init_query_mad(in_mad);
+	in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
+	in_mad->attr_mod = cpu_to_be32(port);
+
+	err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad);
+	if (err)
+		goto out;
+
+	mlx4_ib_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB ?
+		ib_link_query_port(ibdev, port, props, out_mad) :
+		eth_link_query_port(ibdev, port, props, out_mad);
 
 out:
 	kfree(in_mad);
@@ -201,8 +265,8 @@ out:
 	return err;
 }
 
-static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
-			     union ib_gid *gid)
+static int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
+			       union ib_gid *gid)
 {
 	struct ib_smp *in_mad  = NULL;
 	struct ib_smp *out_mad = NULL;
@@ -239,6 +303,25 @@ out:
 	return err;
 }
 
+static int rdmaoe_query_gid(struct ib_device *ibdev, u8 port, int index,
+			    union ib_gid *gid)
+{
+	struct mlx4_ib_dev *dev = to_mdev(ibdev);
+
+	*gid = dev->rdmaoe.gid_table[port - 1][index];
+
+	return 0;
+}
+
+static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
+			     union ib_gid *gid)
+{
+	if (rdma_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB)
+		return __mlx4_ib_query_gid(ibdev, port, index, gid);
+	else
+		return rdmaoe_query_gid(ibdev, port, index, gid);
+}
+
 static int mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
 			      u16 *pkey)
 {
@@ -287,6 +370,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols,
 {
 	struct mlx4_cmd_mailbox *mailbox;
 	int err;
+	u8 is_eth = dev->dev->caps.port_type[port] == MLX4_PORT_TYPE_ETH;
 
 	mailbox = mlx4_alloc_cmd_mailbox(dev->dev);
 	if (IS_ERR(mailbox))
@@ -302,7 +386,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols,
 		((__be32 *) mailbox->buf)[1] = cpu_to_be32(cap_mask);
 	}
 
-	err = mlx4_cmd(dev->dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
+	err = mlx4_cmd(dev->dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT,
 		       MLX4_CMD_TIME_CLASS_B);
 
 	mlx4_free_cmd_mailbox(dev->dev, mailbox);
@@ -538,19 +622,146 @@ static struct device_attribute *mlx4_class_attributes[] = {
 	&dev_attr_board_id
 };
 
+static void mlx4_addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+{
+	memcpy(eui, dev->dev_addr, 3);
+	memcpy(eui + 5, dev->dev_addr + 3, 3);
+	eui[3] = 0xFF;
+	eui[4] = 0xFE;
+	eui[0] ^= 2;
+}
+
+static void update_gids_task(struct work_struct *work)
+{
+	struct update_gid_work *gw = container_of(work, struct update_gid_work, work);
+	struct mlx4_cmd_mailbox *mailbox;
+	union ib_gid *gids;
+	int err;
+	struct mlx4_dev	*dev = gw->dev->dev;
+	struct ib_event event;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox)) {
+		printk(KERN_WARNING "update gid table failed %ld\n", PTR_ERR(mailbox));
+		return;
+	}
+
+	gids = mailbox->buf;
+	memcpy(gids, gw->gids, sizeof gw->gids);
+
+	err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
+		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B);
+	if (err)
+		printk(KERN_WARNING "set port command failed\n");
+	else {
+		memcpy(gw->dev->rdmaoe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids);
+		event.device = &gw->dev->ib_dev;
+		event.element.port_num = gw->port;
+		event.event    = IB_EVENT_LID_CHANGE;
+		ib_dispatch_event(&event);
+	}
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	kfree(gw);
+}
+
+static int update_ipv6_gids(struct mlx4_ib_dev *dev, int port, int clear)
+{
+	struct net_device *ndev = dev->rdmaoe.netdevs[port - 1];
+	struct update_gid_work *work;
+
+	work = kzalloc(sizeof *work, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	if (!clear) {
+		mlx4_addrconf_ifid_eui48(&work->gids[0].raw[8], ndev);
+		work->gids[0].global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	}
+
+	INIT_WORK(&work->work, update_gids_task);
+	work->port = port;
+	work->dev = dev;
+	queue_work(wq, &work->work);
+
+	return 0;
+}
+
+static void handle_en_event(struct mlx4_ib_dev *dev, int port, unsigned long event)
+{
+	switch (event) {
+	case NETDEV_UP:
+		update_ipv6_gids(dev, port, 0);
+		break;
+
+	case NETDEV_DOWN:
+		update_ipv6_gids(dev, port, 1);
+	}
+}
+
+static void netdev_added(struct mlx4_ib_dev *dev, int port)
+{
+	update_ipv6_gids(dev, port, 0);
+}
+
+static void netdev_removed(struct mlx4_ib_dev *dev, int port)
+{
+	update_ipv6_gids(dev, port, 1);
+}
+
+static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event,
+				void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mlx4_ib_dev *ibdev;
+	struct net_device *oldnd;
+	struct mlx4_ib_rdmaoe *rdmaoe;
+	int port;
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return NOTIFY_DONE;
+
+	ibdev = container_of(this, struct mlx4_ib_dev, rdmaoe.nb);
+	rdmaoe = &ibdev->rdmaoe;
+
+	spin_lock(&rdmaoe->lock);
+	mlx4_foreach_ib_transport_port(port, ibdev->dev) {
+		oldnd = rdmaoe->netdevs[port - 1];
+		rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(ibdev->dev, MLX4_PROT_EN, port);
+		if (oldnd != rdmaoe->netdevs[port - 1]) {
+			if (rdmaoe->netdevs[port - 1])
+				netdev_added(ibdev, port);
+			else
+				netdev_removed(ibdev, port);
+		}
+	}
+
+	if (dev == rdmaoe->netdevs[0])
+		handle_en_event(ibdev, 1, event);
+	else if (dev == rdmaoe->netdevs[1])
+		handle_en_event(ibdev, 2, event);
+
+	spin_unlock(&rdmaoe->lock);
+
+	return NOTIFY_DONE;
+}
+
 static void *mlx4_ib_add(struct mlx4_dev *dev)
 {
 	static int mlx4_ib_version_printed;
 	struct mlx4_ib_dev *ibdev;
 	int num_ports = 0;
 	int i;
+	int err;
+	int port;
+	struct mlx4_ib_rdmaoe *rdmaoe;
 
 	if (!mlx4_ib_version_printed) {
 		printk(KERN_INFO "%s", mlx4_ib_version);
 		++mlx4_ib_version_printed;
 	}
 
-	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
+	mlx4_foreach_ib_transport_port(i, dev)
 		num_ports++;
 
 	/* No point in registering a device with no ports... */
@@ -563,6 +774,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		return NULL;
 	}
 
+	rdmaoe = &ibdev->rdmaoe;
+
 	if (mlx4_pd_alloc(dev, &ibdev->priv_pdn))
 		goto err_dealloc;
 
@@ -607,10 +820,12 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		(1ull << IB_USER_VERBS_CMD_CREATE_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
-		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
+		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
+		(1ull << IB_USER_VERBS_CMD_GET_MAC);
 
 	ibdev->ib_dev.query_device	= mlx4_ib_query_device;
 	ibdev->ib_dev.query_port	= mlx4_ib_query_port;
+	ibdev->ib_dev.get_port_transport = mlx4_ib_port_get_transport;
 	ibdev->ib_dev.query_gid		= mlx4_ib_query_gid;
 	ibdev->ib_dev.query_pkey	= mlx4_ib_query_pkey;
 	ibdev->ib_dev.modify_device	= mlx4_ib_modify_device;
@@ -654,15 +869,26 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	ibdev->ib_dev.map_phys_fmr	= mlx4_ib_map_phys_fmr;
 	ibdev->ib_dev.unmap_fmr		= mlx4_ib_unmap_fmr;
 	ibdev->ib_dev.dealloc_fmr	= mlx4_ib_fmr_dealloc;
+	ibdev->ib_dev.get_mac		= mlx4_ib_get_mac;
+
+	mlx4_foreach_ib_transport_port(port, dev)
+		rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(dev, MLX4_PROT_EN, port);
+	spin_lock_init(&rdmaoe->lock);
+	if (dev->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE && !rdmaoe->nb.notifier_call) {
+		rdmaoe->nb.notifier_call = mlx4_ib_netdev_event;
+		err = register_netdevice_notifier(&rdmaoe->nb);
+		if (err)
+			goto err_map;
+	}
 
 	if (init_node_data(ibdev))
-		goto err_map;
+		goto err_notif;
 
 	spin_lock_init(&ibdev->sm_lock);
 	mutex_init(&ibdev->cap_mask_mutex);
 
 	if (ib_register_device(&ibdev->ib_dev))
-		goto err_map;
+		goto err_notif;
 
 	if (mlx4_ib_mad_init(ibdev))
 		goto err_reg;
@@ -678,6 +904,10 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 err_reg:
 	ib_unregister_device(&ibdev->ib_dev);
 
+err_notif:
+	flush_workqueue(wq);
+	unregister_netdevice_notifier(&rdmaoe->nb);
+
 err_map:
 	iounmap(ibdev->uar_map);
 
@@ -700,11 +930,16 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 
 	mlx4_ib_mad_cleanup(ibdev);
 	ib_unregister_device(&ibdev->ib_dev);
+	if (ibdev->rdmaoe.nb.notifier_call) {
+		flush_workqueue(wq);
+		unregister_netdevice_notifier(&ibdev->rdmaoe.nb);
+		ibdev->rdmaoe.nb.notifier_call = NULL;
+	}
+	iounmap(ibdev->uar_map);
 
-	for (p = 1; p <= ibdev->num_ports; ++p)
+	mlx4_foreach_port(p, dev, MLX4_PORT_TYPE_IB)
 		mlx4_CLOSE_PORT(dev, p);
 
-	iounmap(ibdev->uar_map);
 	mlx4_uar_free(dev, &ibdev->priv_uar);
 	mlx4_pd_free(dev, ibdev->priv_pdn);
 	ib_dealloc_device(&ibdev->ib_dev);
@@ -745,17 +980,31 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr,
 static struct mlx4_interface mlx4_ib_interface = {
 	.add	= mlx4_ib_add,
 	.remove	= mlx4_ib_remove,
-	.event	= mlx4_ib_event
+	.event	= mlx4_ib_event,
+	.protocol	= MLX4_PROT_IB
 };
 
 static int __init mlx4_ib_init(void)
 {
-	return mlx4_register_interface(&mlx4_ib_interface);
+	int err;
+
+	wq = create_singlethread_workqueue("mlx4_ib");
+	if (!wq)
+		return -ENOMEM;
+
+	err = mlx4_register_interface(&mlx4_ib_interface);
+	if (err) {
+		destroy_workqueue(wq);
+		return err;
+	}
+
+	return 0;
 }
 
 static void __exit mlx4_ib_cleanup(void)
 {
 	mlx4_unregister_interface(&mlx4_ib_interface);
+	destroy_workqueue(wq);
 }
 
 module_init(mlx4_ib_init);
diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c
index 510633f..6f30eca 100644
--- a/drivers/net/mlx4/en_main.c
+++ b/drivers/net/mlx4/en_main.c
@@ -51,6 +51,13 @@ static const char mlx4_en_version[] =
 	DRV_NAME ": Mellanox ConnectX HCA Ethernet driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+static void *get_netdev(struct mlx4_dev *dev, void *ctx, u8 port)
+{
+	struct mlx4_en_dev *endev = ctx;
+
+	return endev->pndev[port];
+}
+
 static void mlx4_en_event(struct mlx4_dev *dev, void *endev_ptr,
 			  enum mlx4_dev_event event, int port)
 {
@@ -229,9 +236,11 @@ err_free_res:
 }
 
 static struct mlx4_interface mlx4_en_interface = {
-	.add	= mlx4_en_add,
-	.remove	= mlx4_en_remove,
-	.event	= mlx4_en_event,
+	.add		= mlx4_en_add,
+	.remove		= mlx4_en_remove,
+	.event		= mlx4_en_event,
+	.get_prot_dev	= get_netdev,
+	.protocol	= MLX4_PROT_EN,
 };
 
 static int __init mlx4_en_init(void)
diff --git a/drivers/net/mlx4/en_port.c b/drivers/net/mlx4/en_port.c
index a29abe8..a249887 100644
--- a/drivers/net/mlx4/en_port.c
+++ b/drivers/net/mlx4/en_port.c
@@ -127,8 +127,8 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn,
 	memset(context, 0, sizeof *context);
 
 	context->base_qpn = cpu_to_be32(base_qpn);
-	context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_SHIFT | base_qpn);
-	context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_SHIFT | base_qpn);
+	context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_EN_SHIFT | base_qpn);
+	context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_MODE_SHIFT | base_qpn);
 	context->intra_no_vlan = 0;
 	context->no_vlan = MLX4_NO_VLAN_IDX;
 	context->intra_vlan_miss = 0;
diff --git a/drivers/net/mlx4/en_port.h b/drivers/net/mlx4/en_port.h
index e6477f1..9354891 100644
--- a/drivers/net/mlx4/en_port.h
+++ b/drivers/net/mlx4/en_port.h
@@ -36,7 +36,8 @@
 
 
 #define SET_PORT_GEN_ALL_VALID	0x7
-#define SET_PORT_PROMISC_SHIFT	31
+#define SET_PORT_PROMISC_EN_SHIFT	31
+#define SET_PORT_PROMISC_MODE_SHIFT	30
 
 enum {
 	MLX4_CMD_SET_VLAN_FLTR  = 0x47,
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 0e7eb10..d64530e 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -159,3 +159,23 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 
 	mutex_unlock(&intf_mutex);
 }
+
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	struct mlx4_device_context *dev_ctx;
+	unsigned long flags;
+	void *result = NULL;
+
+	spin_lock_irqsave(&priv->ctx_lock, flags);
+
+	list_for_each_entry(dev_ctx, &priv->ctx_list, list)
+		if (dev_ctx->intf->protocol == proto && dev_ctx->intf->get_prot_dev) {
+			result = dev_ctx->intf->get_prot_dev(dev, dev_ctx->context, port);
+			break;
+		}
+
+	spin_unlock_irqrestore(&priv->ctx_lock, flags);
+
+	return result;
+}
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 30bea96..c72af51 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -100,6 +100,12 @@ module_param_named(use_prio, use_prio, bool, 0444);
 MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports "
 		  "(0/1, default 0)");
 
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	return mlx4_find_get_prot_dev(dev, proto, port);
+}
+EXPORT_SYMBOL(mlx4_get_prot_dev);
+
 int mlx4_check_port_params(struct mlx4_dev *dev,
 			   enum mlx4_port_type *port_type)
 {
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 5bd79c2..db068c9 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -364,6 +364,7 @@ int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port);
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 struct mlx4_dev_cap;
 struct mlx4_init_hca_param;
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 0f82293..22bd8d3 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -140,6 +140,7 @@ enum {
 	MLX4_SET_PORT_MAC_TABLE = 0x2,
 	MLX4_SET_PORT_VLAN_TABLE = 0x3,
 	MLX4_SET_PORT_PRIO_MAP  = 0x4,
+	MLX4_SET_PORT_GID_TABLE = 0x5,
 };
 
 struct mlx4_dev;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 53c5fdb..0083256 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -44,15 +44,23 @@ enum mlx4_dev_event {
 	MLX4_DEV_EVENT_PORT_REINIT,
 };
 
+enum mlx4_prot {
+	MLX4_PROT_IB,
+	MLX4_PROT_EN,
+};
+
 struct mlx4_interface {
-	void *			(*add)	 (struct mlx4_dev *dev);
-	void			(*remove)(struct mlx4_dev *dev, void *context);
-	void			(*event) (struct mlx4_dev *dev, void *context,
-					  enum mlx4_dev_event event, int port);
+	void *	(*add)	 (struct mlx4_dev *dev);
+	void   	(*remove)(struct mlx4_dev *dev, void *context);
+	void   	(*event) (struct mlx4_dev *dev, void *context,
+			  enum mlx4_dev_event event, int port);
+	void *	(*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port);
+	enum mlx4_prot		protocol;
 	struct list_head	list;
 };
 
 int mlx4_register_interface(struct mlx4_interface *intf);
 void mlx4_unregister_interface(struct mlx4_interface *intf);
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 #endif /* MLX4_DRIVER_H */
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:34:22 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:34:22 +0300
Subject: [ofa-general] [PATCHv4] libibverbs: Add RDMAoE support
Message-ID: <20090805083422.GA6659@mtls03>

Extend the ibv_query_port() verb to return a port transport protocol which can
be one of RDMA_TRANSPORT_IB, RDMA_TRANSPORT_IWARP or RDMA_TRANSPORT_RDMAOE.
This can be used by applications to know if they must use GRH as is the case in
RDMAoE.  Add a new system call to get the MAC address of the remote port that a
UD address vector refers to.  Update ibv_rc_pingpong and ibv_ud_pingpong to
accept a remote GID so that they can be used with an RDMAoE port.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changed the reference to a port from link type to protocol type. This
patch is tagged v4 to create correspondence with the kernel patches.


 examples/devinfo.c            |   15 ++++++++++++
 examples/pingpong.c           |    9 +++++++
 examples/pingpong.h           |    2 +
 examples/rc_pingpong.c        |   50 ++++++++++++++++++++++++++++++++--------
 examples/ud_pingpong.c        |   38 +++++++++++++++++++++++++++----
 include/infiniband/driver.h   |    1 +
 include/infiniband/kern-abi.h |   25 ++++++++++++++++++--
 include/infiniband/verbs.h    |   12 +++++++++
 src/cmd.c                     |   20 ++++++++++++++++
 src/libibverbs.map            |    1 +
 10 files changed, 155 insertions(+), 18 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index caa5d5f..a42a6dc 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -175,6 +175,20 @@ static int print_all_port_gids(struct ibv_context *ctx, uint8_t port_num, int tb
 	return rc;
 }
 
+static const char *transport_type_str(enum rdma_transport_type type)
+{
+	switch (type) {
+	case RDMA_TRANSPORT_IB:
+		return "IB";
+	case RDMA_TRANSPORT_IWARP:
+		return "IWARP";
+	case RDMA_TRANSPORT_RDMAOE:
+		return "RDMAOE";
+	default:
+		return "Unknown";
+	}
+}
+
 static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 {
 	struct ibv_context *ctx;
@@ -273,6 +287,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 		printf("\t\t\tsm_lid:\t\t\t%d\n", port_attr.sm_lid);
 		printf("\t\t\tport_lid:\t\t%d\n", port_attr.lid);
 		printf("\t\t\tport_lmc:\t\t0x%02x\n", port_attr.lmc);
+		printf("\t\t\ttrasnport_type:\t\t%s\n", transport_type_str(port_attr.transport));
 
 		if (verbose) {
 			printf("\t\t\tmax_msg_sz:\t\t0x%x\n", port_attr.max_msg_sz);
diff --git a/examples/pingpong.c b/examples/pingpong.c
index b916f59..d4a46e4 100644
--- a/examples/pingpong.c
+++ b/examples/pingpong.c
@@ -31,6 +31,8 @@
  */
 
 #include "pingpong.h"
+#include <arpa/inet.h>
+#include <stdlib.h>
 
 enum ibv_mtu pp_mtu_to_enum(int mtu)
 {
@@ -53,3 +55,10 @@ uint16_t pp_get_local_lid(struct ibv_context *context, int port)
 
 	return attr.lid;
 }
+
+int pp_get_port_info(struct ibv_context *context, int port,
+		     struct ibv_port_attr *attr)
+{
+	return ibv_query_port(context, port, attr);
+}
+
diff --git a/examples/pingpong.h b/examples/pingpong.h
index 71d7c3f..16d3466 100644
--- a/examples/pingpong.h
+++ b/examples/pingpong.h
@@ -37,5 +37,7 @@
 
 enum ibv_mtu pp_mtu_to_enum(int mtu);
 uint16_t pp_get_local_lid(struct ibv_context *context, int port);
+int pp_get_port_info(struct ibv_context *context, int port,
+		     struct ibv_port_attr *attr);
 
 #endif /* IBV_PINGPONG_H */
diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index 26fa45c..4250cdf 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -67,6 +67,8 @@ struct pingpong_context {
 	int			 size;
 	int			 rx_depth;
 	int			 pending;
+	struct ibv_port_attr     portinfo;
+	union ibv_gid		 dgid;
 };
 
 struct pingpong_dest {
@@ -94,6 +96,12 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
 			.port_num	= port
 		}
 	};
+
+	if (ctx->dgid.global.interface_id) {
+		attr.ah_attr.is_global = 1;
+		attr.ah_attr.grh.hop_limit = 1;
+		attr.ah_attr.grh.dgid = ctx->dgid;
+	}
 	if (ibv_modify_qp(ctx->qp, &attr,
 			  IBV_QP_STATE              |
 			  IBV_QP_AV                 |
@@ -289,11 +297,11 @@ out:
 
 static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 					    int rx_depth, int port,
-					    int use_event)
+					    int use_event, int is_server)
 {
 	struct pingpong_context *ctx;
 
-	ctx = malloc(sizeof *ctx);
+	ctx = calloc(1, sizeof *ctx);
 	if (!ctx)
 		return NULL;
 
@@ -306,7 +314,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 		return NULL;
 	}
 
-	memset(ctx->buf, 0, size);
+	memset(ctx->buf, 0x7b + is_server, size);
 
 	ctx->context = ibv_open_device(ib_dev);
 	if (!ctx->context) {
@@ -481,6 +489,7 @@ static void usage(const char *argv0)
 	printf("  -n, --iters=<iters>    number of exchanges (default 1000)\n");
 	printf("  -l, --sl=<sl>          service level value\n");
 	printf("  -e, --events           sleep on CQ events (default poll)\n");
+	printf("  -g, --gid=<remote gid> gid of the other port\n");
 }
 
 int main(int argc, char *argv[])
@@ -504,6 +513,7 @@ int main(int argc, char *argv[])
 	int                      rcnt, scnt;
 	int                      num_cq_events = 0;
 	int                      sl = 0;
+	char			*grh = NULL;
 
 	srand48(getpid() * time(NULL));
 
@@ -520,10 +530,11 @@ int main(int argc, char *argv[])
 			{ .name = "iters",    .has_arg = 1, .val = 'n' },
 			{ .name = "sl",       .has_arg = 1, .val = 'l' },
 			{ .name = "events",   .has_arg = 0, .val = 'e' },
+			{ .name = "gid",      .has_arg = 1, .val = 'g' },
 			{ 0 }
 		};
 
-		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:e", long_options, NULL);
+		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:", long_options, NULL);
 		if (c == -1)
 			break;
 
@@ -575,6 +586,10 @@ int main(int argc, char *argv[])
 			++use_event;
 			break;
 
+		case 'g':
+			grh = strdupa(optarg);
+			break;
+
 		default:
 			usage(argv[0]);
 			return 1;
@@ -614,7 +629,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	ctx = pp_init_ctx(ib_dev, size, rx_depth, ib_port, use_event);
+	ctx = pp_init_ctx(ib_dev, size, rx_depth, ib_port, use_event, !servername);
 	if (!ctx)
 		return 1;
 
@@ -630,17 +645,31 @@ int main(int argc, char *argv[])
 			return 1;
 		}
 
-	my_dest.lid = pp_get_local_lid(ctx->context, ib_port);
-	my_dest.qpn = ctx->qp->qp_num;
-	my_dest.psn = lrand48() & 0xffffff;
-	if (!my_dest.lid) {
-		fprintf(stderr, "Couldn't get local LID\n");
+
+	if (pp_get_port_info(ctx->context, ib_port, &ctx->portinfo)) {
+		fprintf(stderr, "Couldn't get port info\n");
 		return 1;
 	}
 
+	my_dest.lid = ctx->portinfo.lid;
+	if (ctx->portinfo.transport == RDMA_TRANSPORT_RDMAOE) {
+		if (!grh) {
+			fprintf(stderr, "Couldn't get local LID\n");
+			return 1;
+		}
+		inet_pton(AF_INET6, grh, &ctx->dgid);
+	} else {
+		if (!my_dest.lid) {
+			fprintf(stderr, "Couldn't get local LID\n");
+			return 1;
+		}
+	}
+	my_dest.qpn = ctx->qp->qp_num;
+	my_dest.psn = lrand48() & 0xffffff;
 	printf("  local address:  LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n",
 	       my_dest.lid, my_dest.qpn, my_dest.psn);
 
+
 	if (servername)
 		rem_dest = pp_client_exch_dest(servername, port, &my_dest);
 	else
@@ -705,6 +734,7 @@ int main(int argc, char *argv[])
 					fprintf(stderr, "poll CQ failed %d\n", ne);
 					return 1;
 				}
+
 			} while (!use_event && ne < 1);
 
 			for (i = 0; i < ne; ++i) {
diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c
index 8f3d50b..b3aa55d 100644
--- a/examples/ud_pingpong.c
+++ b/examples/ud_pingpong.c
@@ -68,6 +68,8 @@ struct pingpong_context {
 	int			 size;
 	int			 rx_depth;
 	int			 pending;
+	struct ibv_port_attr     portinfo;
+	union ibv_gid            dgid;
 };
 
 struct pingpong_dest {
@@ -105,6 +107,12 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
 		return 1;
 	}
 
+	if (ctx->dgid.global.interface_id) {
+		ah_attr.is_global = 1;
+		ah_attr.grh.hop_limit = 1;
+		ah_attr.grh.dgid = ctx->dgid;
+	}
+
 	ctx->ah = ibv_create_ah(ctx->pd, &ah_attr);
 	if (!ctx->ah) {
 		fprintf(stderr, "Failed to create AH\n");
@@ -478,6 +486,7 @@ static void usage(const char *argv0)
 	printf("  -r, --rx-depth=<dep>   number of receives to post at a time (default 500)\n");
 	printf("  -n, --iters=<iters>    number of exchanges (default 1000)\n");
 	printf("  -e, --events           sleep on CQ events (default poll)\n");
+	printf("  -g, --gid              specify remote gid\n");
 }
 
 int main(int argc, char *argv[])
@@ -500,6 +509,7 @@ int main(int argc, char *argv[])
 	int                      rcnt, scnt;
 	int                      num_cq_events = 0;
 	int                      sl = 0;
+	char 			*gid = NULL;
 
 	srand48(getpid() * time(NULL));
 
@@ -515,10 +525,11 @@ int main(int argc, char *argv[])
 			{ .name = "iters",    .has_arg = 1, .val = 'n' },
 			{ .name = "sl",       .has_arg = 1, .val = 'l' },
 			{ .name = "events",   .has_arg = 0, .val = 'e' },
+			{ .name = "gid",      .has_arg = 1, .val = 'g' },
 			{ 0 }
 		};
 
-		c = getopt_long(argc, argv, "p:d:i:s:r:n:l:e", long_options, NULL);
+		c = getopt_long(argc, argv, "p:d:i:s:r:n:l:eg:", long_options, NULL);
 		if (c == -1)
 			break;
 
@@ -563,6 +574,10 @@ int main(int argc, char *argv[])
 			++use_event;
 			break;
 
+		case 'g':
+                        gid = strdupa(optarg);
+			break;
+
 		default:
 			usage(argv[0]);
 			return 1;
@@ -618,12 +633,25 @@ int main(int argc, char *argv[])
 			return 1;
 		}
 
-	my_dest.lid = pp_get_local_lid(ctx->context, ib_port);
+	if (pp_get_port_info(ctx->context, ib_port, &ctx->portinfo)) {
+		fprintf(stderr, "Couldn't get port info\n");
+		return 1;
+	}
+	my_dest.lid = ctx->portinfo.lid;
+
 	my_dest.qpn = ctx->qp->qp_num;
 	my_dest.psn = lrand48() & 0xffffff;
-	if (!my_dest.lid) {
-		fprintf(stderr, "Couldn't get local LID\n");
-		return 1;
+	if (ctx->portinfo.transport == RDMA_TRANSPORT_IB) {
+		if (!my_dest.lid) {
+			fprintf(stderr, "Couldn't get local LID\n");
+			return 1;
+		}
+	} else {
+		if (!gid) {
+			fprintf(stderr, "must specify remote GID\n");
+			return 1;
+		}
+		inet_pton(AF_INET6, gid, &ctx->dgid);
 	}
 
 	printf("  local address:  LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n",
diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 67a3bf8..cbd261f 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -131,6 +131,7 @@ int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah,
 int ibv_cmd_destroy_ah(struct ibv_ah *ah);
 int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
 int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
+int ibv_cmd_get_mac(struct ibv_pd *pd, uint8_t port, uint8_t *gid, uint8_t *mac);
 
 int ibv_dontfork_range(void *base, size_t size);
 int ibv_dofork_range(void *base, size_t size);
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 0db083a..7823da8 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -46,7 +46,7 @@
  * The minimum and maximum kernel ABI that we can handle.
  */
 #define IB_USER_VERBS_MIN_ABI_VERSION	1
-#define IB_USER_VERBS_MAX_ABI_VERSION	6
+#define IB_USER_VERBS_MAX_ABI_VERSION	7
 
 enum {
 	IB_USER_VERBS_CMD_GET_CONTEXT,
@@ -85,7 +85,8 @@ enum {
 	IB_USER_VERBS_CMD_MODIFY_SRQ,
 	IB_USER_VERBS_CMD_QUERY_SRQ,
 	IB_USER_VERBS_CMD_DESTROY_SRQ,
-	IB_USER_VERBS_CMD_POST_SRQ_RECV
+	IB_USER_VERBS_CMD_POST_SRQ_RECV,
+	IB_USER_VERBS_CMD_GET_MAC,
 };
 
 /*
@@ -223,7 +224,8 @@ struct ibv_query_port_resp {
 	__u8  active_width;
 	__u8  active_speed;
 	__u8  phys_state;
-	__u8  reserved[3];
+	__u8  transport;
+	__u8  reserved[2];
 };
 
 struct ibv_alloc_pd {
@@ -798,6 +800,7 @@ enum {
 	IB_USER_VERBS_CMD_QUERY_SRQ_V2,
 	IB_USER_VERBS_CMD_DESTROY_SRQ_V2,
 	IB_USER_VERBS_CMD_POST_SRQ_RECV_V2,
+	IB_USER_VERBS_CMD_GET_MAC_V2 = -1,
 	/*
 	 * Set commands that didn't exist to -1 so our compile-time
 	 * trick opcodes in IBV_INIT_CMD() doesn't break.
@@ -878,4 +881,20 @@ struct ibv_create_srq_resp_v5 {
 	__u32 srq_handle;
 };
 
+struct ibv_get_mac {
+	__u32 command;
+	__u16 in_words;
+	__u16 out_words;
+	__u64 response;
+	__u32 pd_handle;
+	__u8  port;
+	__u8  reserved[3];
+	__u8  dgid[16];
+};
+
+struct ibv_get_mac_resp {
+	__u8	mac[6];
+	__u16	reserved;
+};
+
 #endif /* KERN_ABI_H */
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index a04cc62..f81f17f 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -61,6 +61,7 @@ union ibv_gid {
 		uint64_t	subnet_prefix;
 		uint64_t	interface_id;
 	} global;
+	uint32_t		dwords[4];
 };
 
 enum ibv_node_type {
@@ -161,6 +162,16 @@ enum ibv_port_state {
 	IBV_PORT_ACTIVE_DEFER	= 5
 };
 
+enum rdma_transport_type {
+	RDMA_TRANSPORT_IB,
+	RDMA_TRANSPORT_IWARP,
+	RDMA_TRANSPORT_RDMAOE
+};
+enum ibv_port_link_type {
+	PORT_LINK_IB,
+	PORT_LINK_ETH
+};
+
 struct ibv_port_attr {
 	enum ibv_port_state	state;
 	enum ibv_mtu		max_mtu;
@@ -181,6 +192,7 @@ struct ibv_port_attr {
 	uint8_t			active_width;
 	uint8_t			active_speed;
 	uint8_t			phys_state;
+	enum rdma_transport_type transport;
 };
 
 enum ibv_event_type {
diff --git a/src/cmd.c b/src/cmd.c
index 66d7134..30754ac 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -162,6 +162,7 @@ int ibv_cmd_query_device(struct ibv_context *context,
 	return 0;
 }
 
+#include <stdio.h>
 int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num,
 		       struct ibv_port_attr *port_attr,
 		       struct ibv_query_port *cmd, size_t cmd_size)
@@ -196,6 +197,7 @@ int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num,
 	port_attr->active_width    = resp.active_width;
 	port_attr->active_speed    = resp.active_speed;
 	port_attr->phys_state      = resp.phys_state;
+	port_attr->transport       = resp.transport;
 
 	return 0;
 }
@@ -1122,3 +1124,21 @@ int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
 
 	return 0;
 }
+
+int ibv_cmd_get_mac(struct ibv_pd *pd, uint8_t port, uint8_t *gid, uint8_t *mac)
+{
+	struct ibv_get_mac cmd;
+	struct ibv_get_mac_resp resp;
+
+	IBV_INIT_CMD_RESP(&cmd, sizeof cmd, GET_MAC, &resp, sizeof resp);
+	memcpy(cmd.dgid, gid, sizeof cmd.dgid);
+	cmd.pd_handle = pd->handle;
+	cmd.port = port;
+
+	if (write(pd->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd)
+		return errno;
+
+	memcpy(mac, resp.mac, 6);
+
+	return 0;
+}
diff --git a/src/libibverbs.map b/src/libibverbs.map
index 1827da0..1688e73 100644
--- a/src/libibverbs.map
+++ b/src/libibverbs.map
@@ -64,6 +64,7 @@ IBVERBS_1.0 {
 		ibv_cmd_destroy_ah;
 		ibv_cmd_attach_mcast;
 		ibv_cmd_detach_mcast;
+		ibv_cmd_get_mac;
 		ibv_copy_qp_attr_from_kern;
 		ibv_copy_path_rec_from_kern;
 		ibv_copy_path_rec_to_kern;
-- 
1.6.3.3


From eli at mellanox.co.il  Wed Aug  5 01:36:48 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 11:36:48 +0300
Subject: [ofa-general] [PATCHv4] libmlx4: Add RDMAoE support
Message-ID: <20090805083648.GA6696@mtls03>

Modify mlx4_create_ah() to check the port's transport protocol, and for the
case of RDMAoE ports, do a system call to retrieve the remote port's MAC
address. Make modifications to address vector data structs and code to
accomodate for RDMAoE.
---
Changed the reference to a port from link type to protocol type. This
patch is tagged v4 to create correspondence with the kernel patches.

 src/mlx4.h  |    3 +++
 src/qp.c    |    2 ++
 src/verbs.c |   29 +++++++++++++++++++++++++++++
 src/wqe.h   |    3 ++-
 4 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/src/mlx4.h b/src/mlx4.h
index 827a201..20d3fdd 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -236,11 +236,14 @@ struct mlx4_av {
 	uint8_t				hop_limit;
 	uint32_t			sl_tclass_flowlabel;
 	uint8_t				dgid[16];
+	uint8_t				mac[8];
 };
 
 struct mlx4_ah {
 	struct ibv_ah			ibv_ah;
 	struct mlx4_av			av;
+	uint16_t			vlan;
+	uint8_t				mac[6];
 };
 
 static inline unsigned long align(unsigned long val, unsigned long align)
diff --git a/src/qp.c b/src/qp.c
index d194ae3..cd8fab0 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -143,6 +143,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg,
 	memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av));
 	dseg->dqpn = htonl(wr->wr.ud.remote_qpn);
 	dseg->qkey = htonl(wr->wr.ud.remote_qkey);
+	dseg->vlan = htons(to_mah(wr->wr.ud.ah)->vlan);
+	memcpy(dseg->mac, to_mah(wr->wr.ud.ah)->mac, 6);
 }
 
 static void __set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg)
diff --git a/src/verbs.c b/src/verbs.c
index cc179a0..e60ab05 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -614,9 +614,21 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	return 0;
 }
 
+static int mcast_mac(uint8_t *mac)
+{
+	int i;
+	uint8_t val = 0xff;
+
+	for (i = 0; i < 6; ++i)
+		val &= mac[i];
+
+	return val == 0xff;
+}
+
 struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
 	struct mlx4_ah *ah;
+	struct ibv_port_attr port_attr;
 
 	ah = malloc(sizeof *ah);
 	if (!ah)
@@ -642,7 +654,24 @@ struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 		memcpy(ah->av.dgid, attr->grh.dgid.raw, 16);
 	}
 
+	if (ibv_query_port(pd->context, attr->port_num, &port_attr))
+		goto err;
+
+	if (port_attr.transport == RDMA_TRANSPORT_RDMAOE) {
+		if (ibv_cmd_get_mac(pd, attr->port_num, ah->av.dgid, ah->mac))
+			goto err;
+
+		ah->vlan = 0;
+		if (mcast_mac(ah->mac))
+			ah->av.dlid = htons(0xc000);
+
+	}
+
+
 	return &ah->ibv_ah;
+err:
+	free(ah);
+	return NULL;
 }
 
 int mlx4_destroy_ah(struct ibv_ah *ah)
diff --git a/src/wqe.h b/src/wqe.h
index 6f7f309..ea6f27f 100644
--- a/src/wqe.h
+++ b/src/wqe.h
@@ -78,7 +78,8 @@ struct mlx4_wqe_datagram_seg {
 	uint32_t		av[8];
 	uint32_t		dqpn;
 	uint32_t		qkey;
-	uint32_t		reserved[2];
+	__be16			vlan;
+	uint8_t			mac[6];
 };
 
 struct mlx4_wqe_data_seg {
-- 
1.6.3.3


From kliteyn at dev.mellanox.co.il  Wed Aug  5 01:55:43 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 11:55:43 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <20090804153509.GH7993@me>
References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me>
Message-ID: <4A79490F.4000704@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> On 17:32 Tue 04 Aug     , Yevgeny Kliteynik wrote:
>> opt.max_wire_smps is uint32, but then when it's propagated
>> into the VL15 poller it's casted to int32. Fixing the
>> parameter handling to protect it from wrong values.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/main.c       |    2 +-
>>  opensm/opensm/osm_subnet.c |    7 +++++++
>>  2 files changed, 8 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index 296d5d5..9cb9990 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -722,7 +722,7 @@ int main(int argc, char *argv[])
>>
>>  		case 'n':
>>  			opt.max_wire_smps = strtol(optarg, NULL, 0);
> 
> Then you likely want to use strtoul().

Right
 
>> -			if (opt.max_wire_smps <= 0)
>> +			if (opt.max_wire_smps > 0x7FFFFFFF)
>>  				opt.max_wire_smps = 0x7FFFFFFF;
> 
> What about opt.max_wire_smps == 0?

Good point.

> Sasha
> 
>>  			printf(" Max wire smp's = %d\n", opt.max_wire_smps);
>>  			break;
>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>> index ec15f8a..c07d823 100644
>> --- a/opensm/opensm/osm_subnet.c
>> +++ b/opensm/opensm/osm_subnet.c
>> @@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
>>  		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
>>  	}
>>
>> +	if (p_opts->max_wire_smps > 0x7FFFFFFF) {
>> +		log_report(" Invalid Cached Option Value: max_wire_smps = %u,"
>> +			   " Using Default: %u\n",
>> +			   p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE);
>> +		p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
>> +	}
> 
> Ditto.

Right again.
And since we're on this, perhaps the right thing here would
be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal
valid value (0x7FFFFFFF)?

-- Yevgeny

> Sasha
> 


From sashak at voltaire.com  Wed Aug  5 02:05:13 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 12:05:13 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <4A79490F.4000704@dev.mellanox.co.il>
References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me>
	<4A79490F.4000704@dev.mellanox.co.il>
Message-ID: <20090805090513.GO7993@me>

On 11:55 Wed 05 Aug     , Yevgeny Kliteynik wrote:
> And since we're on this, perhaps the right thing here would
> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal
> valid value (0x7FFFFFFF)?

In which case? When provided max_wire_smps is 0 or invalid?

Sasha


From sashak at voltaire.com  Wed Aug  5 02:32:28 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 12:32:28 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for
	lash
In-Reply-To: <20090722151615.GA24576@comcast.net>
References: <20090722151615.GA24576@comcast.net>
Message-ID: <20090805093228.GP7993@me>

Hi Hal,

On 11:16 Wed 22 Jul     , Hal Rosenstock wrote:
> 
> +/*
> + * sort_switches - reorder switch array
> + */
> +static void sort_switches(lash_t *p_lash, mesh_t *mesh)
> +{
> +	int i, j;
> +	int num_switches = p_lash->num_switches;
> +	sort_ctx_t sort_ctx;
> +	comp_t *index;
> +	int *reverse;
> +	switch_t *s;
> +	switch_t **switches;
> +
> +	index = malloc(num_switches * sizeof(comp_t));
> +	reverse = malloc(num_switches * sizeof(int));
> +	switches = malloc(num_switches * sizeof(switch_t *));
> +	if (!index || !reverse || !switches) {
> +		OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR,
> +			"Failed memory allocation - switches not sorted!\n");
> +		goto Exit;
> +	}
> +
> +	sort_ctx.mesh = mesh;
> +	sort_ctx.p_lash = p_lash;
> +	
> +	for (i = 0; i < num_switches; i++) {
> +		index[i].index = i;
> +		index[i].ctx = &sort_ctx;
> +	}
> +
> +	qsort(index, num_switches, sizeof(comp_t), compare_switch);
> +
> +	for (i = 0; i < num_switches; i++)
> +		reverse[index[i].index] = i;
> +
> +	for (i = 0; i < num_switches; i++) {
> +		s = p_lash->switches[index[i].index];
> +		switches[i] = s;
> +		s->id = i;
> +		for (j = 0; j < s->node->num_links; j++)
> +			s->node->links[j]->switch_id =
> +				reverse[s->node->links[j]->switch_id];

Isn't it the same as:

	s->node->links[j]->switch_id =
	    index[s->node->links[j]->switch_id].index;

(and then reverse array is obsolete)?

Sasha

> +	}
> +
> +	for (i = 0; i < num_switches; i++)
> +		p_lash->switches[i] = switches[i];
> +
> +Exit:
> +	if (switches)
> +		free(switches);
> +	if (index)
> +		free(index);
> +	if (reverse)
> +		free(reverse);
> +}
> +
> +/*
>   * osm_mesh_delete - free per mesh resources
>   */
>  static void mesh_delete(mesh_t *mesh)
> @@ -1470,6 +1561,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  		if (reorder_links(p_lash, mesh))
>  			goto err;
>  
> +		sort_switches(p_lash, mesh);
> +
>  		p = buf;
>  		p += sprintf(p, "found ");
>  		for (i = 0; i < mesh->dimension; i++)
> 


From sashak at voltaire.com  Wed Aug  5 02:44:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 12:44:33 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for
	lash
In-Reply-To: <20090722151615.GA24576@comcast.net>
References: <20090722151615.GA24576@comcast.net>
Message-ID: <20090805094433.GQ7993@me>

On 11:16 Wed 22 Jul     , Hal Rosenstock wrote:
> 
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 23fad87..dce2ea1 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -185,6 +185,16 @@ typedef struct _mesh {
>  	int dim_order[MAX_DIMENSION];
>  } mesh_t;
>  
> +typedef struct sort_ctx {
> +	lash_t *p_lash;
> +	mesh_t *mesh;
> +} sort_ctx_t;
> +
> +typedef struct comp {
> +	int index;
> +	sort_ctx_t *ctx;
> +} comp_t;

And wouldn't it be simpler to use:

struct comp {
	switch_t **s;
	sort_ctx_t ctx;
};

? So you will have already sorted switches and only will need to care
about s->id and s->links fixing (and will not need switches[] array too).

Sasha


From vlad at lists.openfabrics.org  Wed Aug  5 03:09:06 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed,  5 Aug 2009 03:09:06 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090805-0200 daily build status
Message-ID: <20090805100906.9E06DE616F0@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From kliteyn at dev.mellanox.co.il  Wed Aug  5 03:07:21 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 13:07:21 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <20090805090513.GO7993@me>
References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me>
	<4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me>
Message-ID: <4A7959D9.1080805@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 11:55 Wed 05 Aug     , Yevgeny Kliteynik wrote:
>> And since we're on this, perhaps the right thing here would
>> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal
>> valid value (0x7FFFFFFF)?
> 
> In which case? When provided max_wire_smps is 0 or invalid?

Both.

-- Yevgeny
 
> Sasha
> 


From sashak at voltaire.com  Wed Aug  5 04:03:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 14:03:35 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <4A7959D9.1080805@dev.mellanox.co.il>
References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me>
	<4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me>
	<4A7959D9.1080805@dev.mellanox.co.il>
Message-ID: <20090805110335.GR7993@me>

On 13:07 Wed 05 Aug     , Yevgeny Kliteynik wrote:
> Sasha Khapyorsky wrote:
> > On 11:55 Wed 05 Aug     , Yevgeny Kliteynik wrote:
> >> And since we're on this, perhaps the right thing here would
> >> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal
> >> valid value (0x7FFFFFFF)?
> > 
> > In which case? When provided max_wire_smps is 0 or invalid?
> 
> Both.

I think that for case of providing invalid value fallback to the default
is better. Of course we can discuss about what the default value could
be, but it is different story.

Sasha


From kliteyn at dev.mellanox.co.il  Wed Aug  5 04:10:42 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 14:10:42 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <20090805110335.GR7993@me>
References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me>
	<4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me>
	<4A7959D9.1080805@dev.mellanox.co.il> <20090805110335.GR7993@me>
Message-ID: <4A7968B2.1010808@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 13:07 Wed 05 Aug     , Yevgeny Kliteynik wrote:
>> Sasha Khapyorsky wrote:
>>> On 11:55 Wed 05 Aug     , Yevgeny Kliteynik wrote:
>>>> And since we're on this, perhaps the right thing here would
>>>> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal
>>>> valid value (0x7FFFFFFF)?
>>> In which case? When provided max_wire_smps is 0 or invalid?
>> Both.
> 
> I think that for case of providing invalid value fallback to the default
> is better. Of course we can discuss about what the default value could
> be, but it is different story.

OK, so 0 will go to 0x7FFFFFFF, and invalid value will
fall back to default.

Patch in 3...2...1...

-- Yevgeny
 
> Sasha
> 


From kliteyn at dev.mellanox.co.il  Wed Aug  5 04:20:53 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 05 Aug 2009 14:20:53 +0300
Subject: [ofa-general] [PATCH v2] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <4A784698.10803@dev.mellanox.co.il>
References: <4A784698.10803@dev.mellanox.co.il>
Message-ID: <4A796B15.7000802@dev.mellanox.co.il>

Hi Sasha,

V2 of this patch:

opt.max_wire_smps is uint32, but then when it's propagated
into the VL15 poller it's casted to int32. Fixing the
parameter handling to protect it from wrong values.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/main.c       |    5 +++--
 opensm/opensm/osm_subnet.c |   12 ++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 296d5d5..ca20ff9 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -721,8 +721,9 @@ int main(int argc, char *argv[])
 			break;

 		case 'n':
-			opt.max_wire_smps = strtol(optarg, NULL, 0);
-			if (opt.max_wire_smps <= 0)
+			opt.max_wire_smps = strtoul(optarg, NULL, 0);
+			if (opt.max_wire_smps == 0 ||
+			    opt.max_wire_smps > 0x7FFFFFFF)
 				opt.max_wire_smps = 0x7FFFFFFF;
 			printf(" Max wire smp's = %d\n", opt.max_wire_smps);
 			break;
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ec15f8a..c43bef7 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1066,6 +1066,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
 		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
 	}

+	if (p_opts->max_wire_smps == 0) {
+		log_report(" Invalid Cached Option Value: max_wire_smps = 0,"
+			   " Using unlimited: 0x7FFFFFFF\n");
+		p_opts->max_wire_smps = 0x7FFFFFFF;
+	}
+	else if (p_opts->max_wire_smps > 0x7FFFFFFF) {
+		log_report(" Invalid Cached Option Value: max_wire_smps = %u,"
+			   " Using Default: %u\n",
+			   p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE);
+		p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
+	}
+
 	if (strcmp(p_opts->console, OSM_DISABLE_CONSOLE)
 	    && strcmp(p_opts->console, OSM_LOCAL_CONSOLE)
 #ifdef ENABLE_OSM_CONSOLE_SOCKET
-- 
1.5.1.4


From hal.rosenstock at gmail.com  Wed Aug  5 04:24:45 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 07:24:45 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090804201505.GI7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
Message-ID: <f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>

On Tue, Aug 4, 2009 at 4:15 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

<snip...>

> You can setup new_lfts arrays in routing engines and at the end of cycle
> call single osm_*setup*_lfts() which will do everything - setup TOPs and
> start to run LFT blocks update.


Are you saying to move the calls in the individual routing engines to
osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing
so consolidates the changes I had made to the various routing engines in one
place) ? Just wanted to be sure I understand what you mean.

-- Hal

<snip...>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/8e16bf85/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug  5 05:59:16 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 08:59:16 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches 
	for lash
In-Reply-To: <20090805093228.GP7993@me>
References: <20090722151615.GA24576@comcast.net> <20090805093228.GP7993@me>
Message-ID: <f0e08f230908050559nc74e78ake066784343beabd6@mail.gmail.com>

Hi Sasha,

On Wed, Aug 5, 2009 at 5:32 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi Hal,
>
> On 11:16 Wed 22 Jul     , Hal Rosenstock wrote:
> >
> > +/*
> > + * sort_switches - reorder switch array
> > + */
> > +static void sort_switches(lash_t *p_lash, mesh_t *mesh)
> > +{
> > +     int i, j;
> > +     int num_switches = p_lash->num_switches;
> > +     sort_ctx_t sort_ctx;
> > +     comp_t *index;
> > +     int *reverse;
> > +     switch_t *s;
> > +     switch_t **switches;
> > +
> > +     index = malloc(num_switches * sizeof(comp_t));
> > +     reverse = malloc(num_switches * sizeof(int));
> > +     switches = malloc(num_switches * sizeof(switch_t *));
> > +     if (!index || !reverse || !switches) {
> > +             OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR,
> > +                     "Failed memory allocation - switches not
> sorted!\n");
> > +             goto Exit;
> > +     }
> > +
> > +     sort_ctx.mesh = mesh;
> > +     sort_ctx.p_lash = p_lash;
> > +
> > +     for (i = 0; i < num_switches; i++) {
> > +             index[i].index = i;
> > +             index[i].ctx = &sort_ctx;
> > +     }
> > +
> > +     qsort(index, num_switches, sizeof(comp_t), compare_switch);
> > +
> > +     for (i = 0; i < num_switches; i++)
> > +             reverse[index[i].index] = i;
> > +
> > +     for (i = 0; i < num_switches; i++) {
> > +             s = p_lash->switches[index[i].index];
> > +             switches[i] = s;
> > +             s->id = i;
> > +             for (j = 0; j < s->node->num_links; j++)
> > +                     s->node->links[j]->switch_id =
> > +                             reverse[s->node->links[j]->switch_id];
>
> Isn't it the same as:
>
>        s->node->links[j]->switch_id =
>            index[s->node->links[j]->switch_id].index;


No.

-- Hal


>
> (and then reverse array is obsolete)?
>
> Sasha
>
> > +     }
> > +
> > +     for (i = 0; i < num_switches; i++)
> > +             p_lash->switches[i] = switches[i];
> > +
> > +Exit:
> > +     if (switches)
> > +             free(switches);
> > +     if (index)
> > +             free(index);
> > +     if (reverse)
> > +             free(reverse);
> > +}
> > +
> > +/*
> >   * osm_mesh_delete - free per mesh resources
> >   */
> >  static void mesh_delete(mesh_t *mesh)
> > @@ -1470,6 +1561,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
> >               if (reorder_links(p_lash, mesh))
> >                       goto err;
> >
> > +             sort_switches(p_lash, mesh);
> > +
> >               p = buf;
> >               p += sprintf(p, "found ");
> >               for (i = 0; i < mesh->dimension; i++)
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/50fc859e/attachment.html>

From sashak at voltaire.com  Wed Aug  5 06:43:52 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 16:43:52 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
Message-ID: <20090805134352.GS7993@me>

On 07:24 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Are you saying to move the calls in the individual routing engines to
> osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing
> so consolidates the changes I had made to the various routing engines in one
> place) ?

Yes.

Sasha


From slavas at Voltaire.COM  Wed Aug  5 06:47:02 2009
From: slavas at Voltaire.COM (Slava Strebkov)
Date: Wed, 05 Aug 2009 16:47:02 +0300
Subject: [ofa-general] [PATCH 1/2 v3] opensm: Storage organization for
	multicast groups
Message-ID: <4A798D56.2020408@Voltaire.COM>


Subject: [PATCH 1/2] Storage organization for multicast groups

Main purpose is to prepare infrastructure for (many) mgids
to one mlid compression. Proposed the following changes:
1. Element in mlid array is now a multicast group holder.
2. mgrp_holder keeps a list of mgroups sharing same mlid.
        With introduction of compression, there will be many
        multicast groups per mlid. Current implementation keeps
        one mgid to one mlid ratio.
3. mgrp_holder has a map of ports sharing same mlid. Ports sorted
        by port guid. Port map is necessary for building spanning
        tree per mgroup_holder, not just for single mgroup.
4. Element in port map keeps a list of mgroups opened by this port.
        This allows quick deletion of mgroups when port changes
         state to DOWN.
5. Multicast processing functions use mgroup_holder object instead
        of mgroup.

Signed-off-by: Slava Strebkov <slavas at voltaire.com>
---
 opensm/include/opensm/osm_multicast.h  |  343 +++++++++++++++++++++++++++++---
 opensm/include/opensm/osm_sm.h         |   10 +-
 opensm/include/opensm/osm_subnet.h     |   38 ++--
 opensm/opensm/osm_drop_mgr.c           |   14 +-
 opensm/opensm/osm_mcast_mgr.c          |  228 +++++++++++++---------
 opensm/opensm/osm_multicast.c          |  198 +++++++++++++++++--
 opensm/opensm/osm_qos_policy.c         |   38 ++--
 opensm/opensm/osm_sa.c                 |   31 +--
 opensm/opensm/osm_sa_mcmember_record.c |   94 +++++----
 opensm/opensm/osm_sa_path_record.c     |   13 +-
 opensm/opensm/osm_sm.c                 |   81 +++++++-
 opensm/opensm/osm_subnet.c             |   31 +++-
 12 files changed, 855 insertions(+), 264 deletions(-)

diff --git a/opensm/include/opensm/osm_multicast.h b/opensm/include/opensm/osm_multicast.h
index 9a47de5..61d1ba6 100644
--- a/opensm/include/opensm/osm_multicast.h
+++ b/opensm/include/opensm/osm_multicast.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -107,6 +107,82 @@ typedef struct osm_mcast_mgr_ctxt {
 *
 * SEE ALSO
 *********/
+/****s* OpenSM: Multicast Group Holder/osm_mgrp_holder_t
+* NAME
+*       osm_mgrp_holder_t
+*
+* DESCRIPTION
+*       Holder for mgroups.
+*
+*       The osm_mgrp_t object should be treated as opaque and should
+*       be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+
+typedef struct osm_mgrp_holder {
+	cl_qmap_t mgrp_port_map;
+	cl_qlist_t mgrp_list;
+	osm_mtree_node_t *p_root;
+	ib_net16_t mlid;
+	boolean_t to_be_deleted;
+	uint32_t last_tree_id;
+	uint32_t last_change_id;
+} osm_mgrp_holder_t;
+
+/*
+* FIELDS
+*	mgrp_port_map
+*		Map of  all ports joined same mlid
+*
+*	mgrp_list
+*		List of mgroups having same mlid
+*
+*	p_root
+*		Pointer to the root "tree node" in the single spanning tree
+*		for this multicast group holder.The nodes of the tree represent
+*		switches.  Member ports are not represented in the tree.
+*
+*	mlid
+*		mlid of current group holder
+*
+*	to_be_deleted
+*		Since holders  are deleted when there are no mgroups in.
+*
+*	last_change_id
+*		a counter for the number of changes applied to the group in this holder.
+*		This counter shuold be incremented on any modification
+*		to the group: joining or leaving of ports.
+*
+*	last_tree_id
+*		the last change id used for building the current tree.
+*/
+ /****s* OpenSM: Multicast group Port /osm_mgrp_port _t
+* NAME
+*	osm_mgrp_port _t
+*
+* DESCRIPTION
+*	Holder for pointers to mgroups and port guid.
+*
+*
+* SYNOPSIS
+*/
+typedef struct _osm_mgrp_port {
+	cl_map_item_t guid_item;
+	cl_qlist_t mgroups;
+	ib_net64_t port_guid;
+} osm_mgrp_port_t;
+/*
+* FIELDS
+*	guid_item
+*		Map for ports. Must be first element
+*
+*	mgroups
+*		List  of  mgroups opened by this port.
+*
+*	portguid
+*		guid of  port representing current structure
+*/
 
 /****s* OpenSM: Multicast Group/osm_mgrp_t
 * NAME
@@ -122,14 +198,13 @@ typedef struct osm_mcast_mgr_ctxt {
 */
 typedef struct osm_mgrp {
 	cl_fmap_item_t map_item;
+	cl_list_item_t mlid_item;
+	cl_list_item_t port_item;
 	ib_net16_t mlid;
-	osm_mtree_node_t *p_root;
 	cl_qmap_t mcm_port_tbl;
 	ib_member_rec_t mcmember_rec;
 	boolean_t well_known;
 	boolean_t to_be_deleted;
-	uint32_t last_change_id;
-	uint32_t last_tree_id;
 	unsigned full_members;
 } osm_mgrp_t;
 /*
@@ -141,10 +216,11 @@ typedef struct osm_mgrp {
 *		The network ordered LID of this Multicast Group (must be
 *		>= 0xC000).
 *
-*	p_root
-*		Pointer to the root "tree node" in the single spanning tree
-*		for this multicast group.  The nodes of the tree represent
-*		switches.  Member ports are not represented in the tree.
+*	mlid_item
+*		List item for groups with same MLID
+*
+*	port_item
+*		List item for groups opened on same port
 *
 *	mcm_port_tbl
 *		Table (sorted by port GUID) of osm_mcm_port_t objects
@@ -163,14 +239,6 @@ typedef struct osm_mgrp {
 *		track the fact the group is about to be deleted so we can
 *		track the fact a new join is actually a create request.
 *
-*	last_change_id
-*		a counter for the number of changes applied to the group.
-*		This counter shuold be incremented on any modification
-*		to the group: joining or leaving of ports.
-*
-*	last_tree_id
-*		the last change id used for building the current tree.
-*
 * SEE ALSO
 *********/
 
@@ -456,30 +524,111 @@ osm_mgrp_delete_port(IN osm_subn_t * const p_subn,
 int osm_mgrp_remove_port(osm_subn_t *subn, osm_log_t *log, osm_mgrp_t *mgrp,
 			 osm_mcm_port_t *mcm, uint8_t join_state);
 
-/****f* OpenSM: Multicast Group/osm_mgrp_apply_func
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_new
 * NAME
-*	osm_mgrp_apply_func
+*	osm_mgrp_holder_new
 *
 * DESCRIPTION
-*	Calls the specified function for each element in the tree.
-*	Elements are passed to the callback function in no particular order.
+*	Allocates and initializes a Multicast Group Holder for use.
 *
 * SYNOPSIS
 */
-void
-osm_mgrp_apply_func(const osm_mgrp_t * const p_mgrp,
-		    osm_mgrp_func_t p_func, void *context);
+osm_mgrp_holder_t *osm_mgrp_holder_new(IN osm_subn_t * p_subn,
+					IN ib_net16_t mlid);
+/*
+* PARAMETERS
+*	p_subn
+*		(in) pointer to osm_subnet
+*	mlid
+*		[in] Multicast LID for this multicast group holder.
+*
+* RETURN VALUES
+*	pointer to initialized osm_mgrp_holder_t
+*	or NULL, if unsuccessful
+*
+* SEE ALSO
+*	Multicast Group Holder, osm_mgrp_holder_delete
+*********/
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete
+* NAME
+*	osm_mgrp_holder_delete
+*
+* DESCRIPTION
+*	Removes  entry from  array of holders
+*	Removes port from mgroup port list
+*
+* SYNOPSIS
+*/
+void osm_mgrp_holder_delete(IN osm_subn_t * p_subn,
+				IN ib_net16_t mlid);
+
 /*
 * PARAMETERS
+*
+*	p_subn
+*		[in] Pointer to  osm_subnet
+*
+*	mlid
+*		[in] holder's mlid
+*
+* RETURN VALUES
+*	None.
+*
+* NOTES
+*
+* SEE ALSO
+*
+*********/
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_add_mgrp_port
+* NAME
+*	osm_mgrp_holder_port_add_mgrp
+*
+* DESCRIPTION
+*	Allocates  osm_mgrp_port_t for new port joined to mgroup with mlid of this holder,
+*	and adds mgroup to mgroup map of  existed osm_mgrp_port_t object.
+*
+* SYNOPSIS
+*/
+ib_api_status_t osm_mgrp_holder_port_add_mgrp(IN osm_mgrp_holder_t *
+						p_mgrp_holder,
+						IN osm_mgrp_t * p_mgrp,
+						IN ib_net64_t port_guid);
+/*
+* PARAMETERS
+*	p_mgrp_holder
+*		(in) pointer to osm_mgrp_holder_t
 *	p_mgrp
-*		[in] Pointer to an osm_mgrp_t object.
+*		(in)  pointer to  osm_mgrp_t
 *
-*	p_func
-*		[in] Pointer to the users callback function.
+* RETURN VALUES
+*	IB_SUCCESS or
+*	IB_INSUFFICIENT_MEMORY
 *
-*	context
-*		[in] User context passed to the callback function.
+* SEE ALSO
+*	Multicast Group Holder, osm_mgrp_holder_delete_mgrp_port
+*********/
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete_mgrp_port
+* NAME
+*	osm_mgrp_holder_port_delete_mgrp
 *
+* DESCRIPTION
+*	Deletes  osm_mgrp_port_t for specified port
+*
+* SYNOPSIS
+*/
+void osm_mgrp_holder_port_delete_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder,
+					IN osm_mgrp_t * p_mgrp,
+					IN ib_net64_t port_guid);
+/*
+* PARAMETERS
+*	p_mgrp_holder
+*		[in] Pointer to an osm_mgrp_holder_t object.
+*
+*	p_mgrp
+*		(in) Pointer to osm_mgrp_t object
+*
+*	port_guid
+*		[in] Port guid of the departing port.
 *
 * RETURN VALUES
 *	None.
@@ -487,8 +636,144 @@ osm_mgrp_apply_func(const osm_mgrp_t * const p_mgrp,
 * NOTES
 *
 * SEE ALSO
-*	Multicast Group
+Multicast Group Holder,osm_holder_add_mgrp_port
+*********/
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_add_mgrp
+* NAME
+*	osm_mgrp_holder_add_mgrp
+*
+* DESCRIPTION
+*	Adds mgroup to holder according to its mgid
+*
+*
+* SYNOPSIS
+*/
+void osm_mgrp_holder_add_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder,
+				IN osm_mgrp_t * p_mgrp,
+				IN osm_log_t * const p_log);
+/*
+* PARAMETERS
+*
+*	p_mgrp_holder
+*		[in] Pointer to an osm_mgrp_holder_t object.
+*
+*	p_mgrp
+*		[in] mgroup to add.
+*
+* RETURN VALUES
+*	None.
+*
+* NOTES
+* Updates common_mgid when holder is being reused
+* SEE ALSO
+*	Multicast Group Holder,osm_mgrp_holder_delete_mgrp
+*********/
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete_mgrp
+* NAME
+*	osm_mgrp_holder_delete_mgrp
+*
+* DESCRIPTION
+*	Deletes mgroup from holder according to its mgid
+*
+*
+* SYNOPSIS
+*/
+void osm_mgrp_holder_delete_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder,
+					IN osm_mgrp_t * p_mgrp);
+/*
+* PARAMETERS
+*
+*	p_mgrp_holder
+*		[in] Pointer to an osm_mgrp_holder_t object.
+*
+*	p_mgrp
+*		[in] mgroup to delete.
+*
+* RETURN VALUES
+*	None.
+*
+* NOTES
+*
+* SEE ALSO
+*	Multicast Group Holder,osm_mgrp_holder_add_mgrp
 *********/
 
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_remove_port
+* NAME
+*	osm_mgrp_holder_remove_port
+*
+* DESCRIPTION
+*	Removes  osm_mgrp_port_t from mgrp_port_map of holder
+*	Removes port from mgroup port list
+*
+* SYNOPSIS
+*/
+void osm_mgrp_holder_remove_port(IN osm_subn_t * const p_subn,
+				IN osm_log_t * const p_log,
+				IN osm_mgrp_holder_t * const p_mgrp_holder,
+				IN const ib_net64_t port_guid);
+/*
+* PARAMETERS
+*
+*	p_subn
+*		[in] Pointer to the subnet object
+*
+*	p_log
+*		[in] The log object pointer
+*
+*	p_mgrp_holder
+*		[in] Pointer to an osm_mgrp_holder_t object.
+*
+*	port_guid
+*		[in] Port guid of the departing port.
+*
+* RETURN VALUES
+*	None.
+*
+* NOTES
+*
+* SEE ALSO
+*
+*********/
+/****f* OpenSM: Subnet/osm_get_mgrp_by_mlid
+* NAME
+*	osm_get_mgrp_by_mlid
+*
+* DESCRIPTION
+*	The looks for the given multicast group in the subnet table by mlid.
+*	NOTE: this code is not thread safe. Need to grab the lock before
+*	calling it.
+*
+* SYNOPSIS
+*/
+static inline struct osm_mgrp_holder *osm_get_mgrp_holder_by_mlid(osm_subn_t const
+									*p_subn,
+									ib_net16_t mlid)
+{
+	return p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO];
+}
+/*
+* PARAMETERS
+*	p_subn
+*		[in] Pointer to an osm_subn_t object
+*
+*	mlid
+*		[in] The multicast group mlid in network order
+*
+* RETURN VALUES
+*	The multicast group structure pointer if found. NULL otherwise.
+*********/
+static inline ib_net16_t osm_mgrp_holder_get_mlid(IN osm_mgrp_holder_t *
+							const p_mgrp_holder)
+{
+	return (p_mgrp_holder->mlid);
+}
+
+static inline boolean_t osm_mgrp_holder_is_empty(IN const osm_mgrp_holder_t *
+							const p_mgrp_holder)
+{
+	return (cl_qmap_count(&p_mgrp_holder->mgrp_port_map) == 0);
+}
+
 END_C_DECLS
 #endif				/* _OSM_MULTICAST_H_ */
diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
index cc8321d..7f898ad 100644
--- a/opensm/include/opensm/osm_sm.h
+++ b/opensm/include/opensm/osm_sm.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -61,6 +61,7 @@
 #include <opensm/osm_port.h>
 #include <opensm/osm_db.h>
 #include <opensm/osm_remote_sm.h>
+#include <opensm/osm_multicast.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -539,7 +540,8 @@ osm_resp_send(IN osm_sm_t * sm,
 ib_api_status_t
 osm_sm_mcgrp_join(IN osm_sm_t * const p_sm,
 		  IN const ib_net16_t mlid,
-		  IN const ib_net64_t port_guid);
+		  IN const ib_net64_t port_guid,
+		  IN const ib_gid_t * p_mgid);
 /*
 * PARAMETERS
 *	p_sm
@@ -551,6 +553,8 @@ osm_sm_mcgrp_join(IN osm_sm_t * const p_sm,
 *	port_guid
 *		[in] Port GUID to add to the group.
 *
+* 	p_mgid
+*		[in] MGID to add to the group holder.
 * RETURN VALUES
 *	None
 *
@@ -572,7 +576,7 @@ osm_sm_mcgrp_join(IN osm_sm_t * const p_sm,
 */
 ib_api_status_t
 osm_sm_mcgrp_leave(IN osm_sm_t * const p_sm,
-		   IN const ib_net16_t mlid, IN const ib_net64_t port_guid);
+		   IN osm_mgrp_t * p_mgrp, IN ib_net64_t port_guid);
 /*
 * PARAMETERS
 *	p_sm
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 6c20de8..fad8780 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -513,7 +513,7 @@ typedef struct osm_subn {
 	boolean_t coming_out_of_standby;
 	unsigned need_update;
 	cl_fmap_t mgrp_mgid_tbl;
-	void *mgroups[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1];
+	void *mgroup_holders[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1];
 } osm_subn_t;
 /*
 * FIELDS
@@ -634,8 +634,8 @@ typedef struct osm_subn {
 *		This flag should be on during first non-master heavy
 *		(including pre-master discovery stage)
 *
-*	mgroups
-*		Array of pointers to all Multicast Group objects in the subnet.
+*	mgroup_holders
+*		Array of pointers to all Multicast Group Holder objects in the subnet.
 *		Indexed by MLID offset from base MLID.
 *
 * SEE ALSO
@@ -935,32 +935,34 @@ struct osm_port *osm_get_port_by_guid(IN osm_subn_t const *p_subn,
 *	osm_port_t
 *********/
 
-/****f* OpenSM: Subnet/osm_get_mgrp_by_mlid
+/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_get_mlid_by_mgid
 * NAME
-*	osm_get_mgrp_by_mlid
+*	osm_mgrp_holder_get_mlid_by_mgid
 *
 * DESCRIPTION
-*	The looks for the given multicast group in the subnet table by mlid.
-*	NOTE: this code is not thread safe. Need to grab the lock before
-*	calling it.
+*	Searches mgroup with given mgid
+*	Returns mlid of the found mgroup
 *
 * SYNOPSIS
 */
-static inline
-struct osm_mgrp *osm_get_mgrp_by_mlid(osm_subn_t const *p_subn, ib_net16_t mlid)
-{
-	return p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO];
-}
+ib_net16_t osm_mgrp_holder_get_mlid_by_mgid(IN osm_subn_t const *p_subn,
+					IN const ib_gid_t * const p_mgid);
 /*
 * PARAMETERS
+*
 *	p_subn
-*		[in] Pointer to an osm_subn_t object
+*		[in] Pointer to osm_subn_t object
 *
-*	mlid
-*		[in] The multicast group mlid in network order
+*	p_mgid
+*		[in] pointer to mgid
 *
 * RETURN VALUES
-*	The multicast group structure pointer if found. NULL otherwise.
+*	mlid of found holder, or zero.
+*
+* NOTES
+*
+* SEE ALSO
+*
 *********/
 
 /****f* OpenSM: Helper/osm_get_physp_by_mad_addr
diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c
index c9a4f33..e1f2bd3 100644
--- a/opensm/opensm/osm_drop_mgr.c
+++ b/opensm/opensm/osm_drop_mgr.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -158,7 +158,6 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port)
 	osm_port_t *p_port_check;
 	cl_qmap_t *p_sm_guid_tbl;
 	osm_mcm_info_t *p_mcm;
-	osm_mgrp_t *p_mgrp;
 	cl_ptr_vector_t *p_port_lid_tbl;
 	uint16_t min_lid_ho;
 	uint16_t max_lid_ho;
@@ -168,6 +167,7 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port)
 	ib_gid_t port_gid;
 	ib_mad_notice_attr_t notice;
 	ib_api_status_t status;
+	osm_mgrp_holder_t *p_mgrp_holder;
 
 	OSM_LOG_ENTER(sm->p_log);
 
@@ -212,10 +212,12 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port)
 
 	p_mcm = (osm_mcm_info_t *) cl_qlist_remove_head(&p_port->mcm_list);
 	while (p_mcm != (osm_mcm_info_t *) cl_qlist_end(&p_port->mcm_list)) {
-		p_mgrp = osm_get_mgrp_by_mlid(sm->p_subn, p_mcm->mlid);
-		if (p_mgrp) {
-			osm_mgrp_delete_port(sm->p_subn, sm->p_log,
-					     p_mgrp, p_port->guid);
+		p_mgrp_holder =
+		    osm_get_mgrp_holder_by_mlid(sm->p_subn, p_mcm->mlid);
+		if (p_mgrp_holder) {
+			osm_mgrp_holder_remove_port(sm->p_subn, sm->p_log,
+						    p_mgrp_holder,
+						    p_port->guid);
 			osm_mcm_info_delete((osm_mcm_info_t *) p_mcm);
 		}
 		p_mcm =
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 4dbbaa0..f506393 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -55,6 +55,7 @@
 #include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
+#include <arpa/inet.h>
 
 /**********************************************************************
  **********************************************************************/
@@ -111,14 +112,15 @@ static void mcast_mgr_purge_tree_node(IN osm_mtree_node_t * p_mtn)
 
 /**********************************************************************
  **********************************************************************/
-static void mcast_mgr_purge_tree(osm_sm_t * sm, IN osm_mgrp_t * p_mgrp)
+static void mcast_mgr_purge_tree(osm_sm_t * sm,
+				 IN osm_mgrp_holder_t * p_mgrp_holder)
 {
 	OSM_LOG_ENTER(sm->p_log);
 
-	if (p_mgrp->p_root)
-		mcast_mgr_purge_tree_node(p_mgrp->p_root);
+	if (p_mgrp_holder->p_root)
+		mcast_mgr_purge_tree_node(p_mgrp_holder->p_root);
 
-	p_mgrp->p_root = NULL;
+	p_mgrp_holder->p_root = NULL;
 
 	OSM_LOG_EXIT(sm->p_log);
 }
@@ -126,41 +128,40 @@ static void mcast_mgr_purge_tree(osm_sm_t * sm, IN osm_mgrp_t * p_mgrp)
 /**********************************************************************
  **********************************************************************/
 static float osm_mcast_mgr_compute_avg_hops(osm_sm_t * sm,
-					    const osm_mgrp_t * p_mgrp,
+					    const osm_mgrp_holder_t *
+					    p_mgrp_holder,
 					    const osm_switch_t * p_sw)
 {
 	float avg_hops = 0;
 	uint32_t hops = 0;
 	uint32_t num_ports = 0;
 	const osm_port_t *p_port;
-	const osm_mcm_port_t *p_mcm_port;
-	const cl_qmap_t *p_mcm_tbl;
+	const osm_mgrp_port_t *p_holder_port;
 
 	OSM_LOG_ENTER(sm->p_log);
 
-	p_mcm_tbl = &p_mgrp->mcm_port_tbl;
 
 	/*
 	   For each member of the multicast group, compute the
 	   number of hops to its base LID.
 	 */
-	for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl);
-	     p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl);
-	     p_mcm_port =
-	     (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) {
+	for (p_holder_port =
+	     (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map);
+	     p_holder_port !=
+	     (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map);
+	     p_holder_port =
+	     (osm_mgrp_port_t *) cl_qmap_next(&p_holder_port->guid_item)) {
 		/*
 		   Acquire the port object for this port guid, then create
 		   the new worker object to build the list.
 		 */
 		p_port = osm_get_port_by_guid(sm->p_subn,
-					      ib_gid_get_guid(&p_mcm_port->
-							      port_gid));
+					      p_holder_port->port_guid);
 
 		if (!p_port) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A18: "
 				"No port object for port 0x%016" PRIx64 "\n",
-				cl_ntoh64(ib_gid_get_guid
-					  (&p_mcm_port->port_gid)));
+				cl_ntoh64(p_holder_port->port_guid));
 			continue;
 		}
 
@@ -185,40 +186,39 @@ static float osm_mcast_mgr_compute_avg_hops(osm_sm_t * sm,
  of the group HCAs
  **********************************************************************/
 static float osm_mcast_mgr_compute_max_hops(osm_sm_t * sm,
-					    const osm_mgrp_t * p_mgrp,
+					    const osm_mgrp_holder_t *
+					    p_mgrp_holder,
 					    const osm_switch_t * p_sw)
 {
 	uint32_t max_hops = 0;
 	uint32_t hops = 0;
 	const osm_port_t *p_port;
-	const osm_mcm_port_t *p_mcm_port;
-	const cl_qmap_t *p_mcm_tbl;
+	const osm_mgrp_port_t *p_mgrp_holder_port;
 
 	OSM_LOG_ENTER(sm->p_log);
 
-	p_mcm_tbl = &p_mgrp->mcm_port_tbl;
 
 	/*
 	   For each member of the multicast group, compute the
 	   number of hops to its base LID.
 	 */
-	for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl);
-	     p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl);
-	     p_mcm_port =
-	     (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) {
+	for (p_mgrp_holder_port =
+	     (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map);
+	     p_mgrp_holder_port !=
+	     (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map);
+	     p_mgrp_holder_port =
+	     (osm_mgrp_port_t *) cl_qmap_next(&p_mgrp_holder_port->guid_item)) {
 		/*
 		   Acquire the port object for this port guid, then create
 		   the new worker object to build the list.
 		 */
 		p_port = osm_get_port_by_guid(sm->p_subn,
-					      ib_gid_get_guid(&p_mcm_port->
-							      port_gid));
+					      p_mgrp_holder_port->port_guid);
 
 		if (!p_port) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A1A: "
 				"No port object for port 0x%016" PRIx64 "\n",
-				cl_ntoh64(ib_gid_get_guid
-					  (&p_mcm_port->port_gid)));
+				cl_ntoh64(p_mgrp_holder_port->port_guid));
 			continue;
 		}
 
@@ -244,7 +244,8 @@ static float osm_mcast_mgr_compute_max_hops(osm_sm_t * sm,
    of the multicast group.
 **********************************************************************/
 static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm,
-						   const osm_mgrp_t * p_mgrp)
+						   const osm_mgrp_holder_t *
+						   p_mgrp_holder)
 {
 	cl_qmap_t *p_sw_tbl;
 	const osm_switch_t *p_sw;
@@ -252,7 +253,7 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm,
 	float hops = 0;
 	float best_hops = 10000;	/* any big # will do */
 #ifdef OSM_VENDOR_INTF_ANAFA
-	boolean_t use_avg_hops = TRUE;	/* anafa2 - bug hca on switch *//* use max hops for root */
+	boolean_t use_avg_hops = TRUE; /* anafa2 - bug hca on switch *//* use max hops for root */
 #else
 	boolean_t use_avg_hops = FALSE;	/* use max hops for root */
 #endif
@@ -261,7 +262,7 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm,
 
 	p_sw_tbl = &sm->p_subn->sw_guid_tbl;
 
-	CL_ASSERT(!osm_mgrp_is_empty(p_mgrp));
+	CL_ASSERT(!osm_mgrp_holder_is_empty(p_mgrp_holder));
 
 	for (p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
 	     p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl);
@@ -270,9 +271,13 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm,
 			continue;
 
 		if (use_avg_hops)
-			hops = osm_mcast_mgr_compute_avg_hops(sm, p_mgrp, p_sw);
+			hops =
+			    osm_mcast_mgr_compute_avg_hops(sm, p_mgrp_holder,
+							   p_sw);
 		else
-			hops = osm_mcast_mgr_compute_max_hops(sm, p_mgrp, p_sw);
+			hops =
+			    osm_mcast_mgr_compute_max_hops(sm, p_mgrp_holder,
+							   p_sw);
 
 		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
 			"Switch 0x%016" PRIx64 ", hops = %f\n",
@@ -301,7 +306,8 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm,
    This function returns the existing or optimal root swtich for the tree.
 **********************************************************************/
 static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm,
-						const osm_mgrp_t * p_mgrp)
+						const osm_mgrp_holder_t *
+						p_mgrp_holder)
 {
 	const osm_switch_t *p_sw = NULL;
 
@@ -313,7 +319,7 @@ static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm,
 	   the root will be always on the first switch attached to it.
 	   - Very bad ...
 	 */
-	p_sw = mcast_mgr_find_optimal_switch(sm, p_mgrp);
+	p_sw = mcast_mgr_find_optimal_switch(sm, p_mgrp_holder);
 
 	OSM_LOG_EXIT(sm->p_log);
 	return (osm_switch_t *) p_sw;
@@ -393,7 +399,8 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
   spanning tree that eminate from this switch.  On input, the p_list
   contains the group members that must be routed from this switch.
 **********************************************************************/
-static void mcast_mgr_subdivide(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
+static void mcast_mgr_subdivide(osm_sm_t * sm,
+				osm_mgrp_holder_t * p_mgrp_holder,
 				osm_switch_t * p_sw, cl_qlist_t * p_list,
 				cl_qlist_t * list_array, uint8_t array_size)
 {
@@ -404,7 +411,7 @@ static void mcast_mgr_subdivide(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
 
 	OSM_LOG_ENTER(sm->p_log);
 
-	mlid_ho = cl_ntoh16(osm_mgrp_get_mlid(p_mgrp));
+	mlid_ho = cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder));
 
 	/*
 	   For Multicast Groups, we want not to count on previous
@@ -494,7 +501,8 @@ static void mcast_mgr_purge_list(osm_sm_t * sm, cl_qlist_t * p_list)
 
   The function returns the newly created mtree node element.
 **********************************************************************/
-static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
+static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm,
+					  osm_mgrp_holder_t * p_mgrp_holder,
 					  osm_switch_t * p_sw,
 					  cl_qlist_t * p_list, uint8_t depth,
 					  uint8_t upstream_port,
@@ -520,7 +528,7 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
 
 	node_guid = osm_node_get_node_guid(p_sw->p_node);
 	node_guid_ho = cl_ntoh64(node_guid);
-	mlid_ho = cl_ntoh16(osm_mgrp_get_mlid(p_mgrp));
+	mlid_ho = cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder));
 
 	OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
 		"Routing MLID 0x%X through switch 0x%" PRIx64
@@ -597,7 +605,8 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
 	for (i = 0; i < max_children; i++)
 		cl_qlist_init(&list_array[i]);
 
-	mcast_mgr_subdivide(sm, p_mgrp, p_sw, p_list, list_array, max_children);
+	mcast_mgr_subdivide(sm, p_mgrp_holder, p_sw, p_list, list_array,
+			    max_children);
 
 	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
 
@@ -680,8 +689,9 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp,
 			CL_ASSERT(p_remote_physp);
 
 			p_mtn->child_array[i] =
-			    mcast_mgr_branch(sm, p_mgrp, p_remote_node->sw,
-					     p_port_list, depth,
+			    mcast_mgr_branch(sm, p_mgrp_holder,
+					     p_remote_node->sw, p_port_list,
+					     depth,
 					     osm_physp_get_port_num
 					     (p_remote_physp), p_max_depth);
 		} else {
@@ -716,11 +726,11 @@ Exit:
 /**********************************************************************
  **********************************************************************/
 static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
-						     osm_mgrp_t * p_mgrp)
+						     osm_mgrp_holder_t *
+						     p_mgrp_holder)
 {
-	const cl_qmap_t *p_mcm_tbl;
 	const osm_port_t *p_port;
-	const osm_mcm_port_t *p_mcm_port;
+	const osm_mgrp_port_t *p_mgrp_port;
 	uint32_t num_ports;
 	cl_qlist_t port_list;
 	osm_switch_t *p_sw;
@@ -739,14 +749,13 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
 	   on multicast forwarding table information if the user wants to
 	   preserve existing multicast routes.
 	 */
-	mcast_mgr_purge_tree(sm, p_mgrp);
+	mcast_mgr_purge_tree(sm, p_mgrp_holder);
 
-	p_mcm_tbl = &p_mgrp->mcm_port_tbl;
-	num_ports = cl_qmap_count(p_mcm_tbl);
+	num_ports = cl_qmap_count(&p_mgrp_holder->mgrp_port_map);
 	if (num_ports == 0) {
 		OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
 			"MLID 0x%X has no members - nothing to do\n",
-			cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)));
+			cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)));
 		goto Exit;
 	}
 
@@ -766,11 +775,11 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
 	   Locate the switch around which to create the spanning
 	   tree for this multicast group.
 	 */
-	p_sw = mcast_mgr_find_root_switch(sm, p_mgrp);
+	p_sw = mcast_mgr_find_root_switch(sm, p_mgrp_holder);
 	if (p_sw == NULL) {
 		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A08: "
 			"Unable to locate a suitable switch for group 0x%X\n",
-			cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)));
+			cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)));
 		status = IB_ERROR;
 		goto Exit;
 	}
@@ -778,22 +787,22 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
 	/*
 	   Build the first "subset" containing all member ports.
 	 */
-	for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl);
-	     p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl);
-	     p_mcm_port =
-	     (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) {
+	for (p_mgrp_port =
+	     (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map);
+	     p_mgrp_port !=
+	     (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map);
+	     p_mgrp_port =
+	     (osm_mgrp_port_t *) cl_qmap_next(&p_mgrp_port->guid_item)) {
 		/*
 		   Acquire the port object for this port guid, then create
 		   the new worker object to build the list.
 		 */
-		p_port = osm_get_port_by_guid(sm->p_subn,
-					      ib_gid_get_guid(&p_mcm_port->
-							      port_gid));
+		p_port =
+		    osm_get_port_by_guid(sm->p_subn, p_mgrp_port->port_guid);
 		if (!p_port) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A09: "
 				"No port object for port 0x%016" PRIx64 "\n",
-				cl_ntoh64(ib_gid_get_guid
-					  (&p_mcm_port->port_gid)));
+				cl_ntoh64(p_mgrp_port->port_guid));
 			continue;
 		}
 
@@ -801,8 +810,7 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
 		if (p_wobj == NULL) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A10: "
 				"Insufficient memory to route port 0x%016"
-				PRIx64 "\n",
-				cl_ntoh64(osm_port_get_guid(p_port)));
+				PRIx64 "\n", cl_ntoh64(p_mgrp_port->port_guid));
 			continue;
 		}
 
@@ -810,12 +818,14 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm,
 	}
 
 	count = cl_qlist_count(&port_list);
-	p_mgrp->p_root = mcast_mgr_branch(sm, p_mgrp, p_sw, &port_list, 0, 0,
-					  &max_depth);
+	p_mgrp_holder->p_root =
+	    mcast_mgr_branch(sm, p_mgrp_holder, p_sw, &port_list, 0, 0,
+			     &max_depth);
 
 	OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
 		"Configured MLID 0x%X for %u ports, max tree depth = %u\n",
-		cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)), count, max_depth);
+		cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)), count,
+		max_depth);
 
 Exit:
 	OSM_LOG_EXIT(sm->p_log);
@@ -1023,17 +1033,20 @@ Exit:
  NOTE : The lock should be held externally!
  **********************************************************************/
 static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm,
-					      IN osm_mgrp_t * p_mgrp)
+					      IN osm_mgrp_holder_t * p_mgrp_holder)
 {
 	ib_api_status_t status = IB_SUCCESS;
 	ib_net16_t mlid;
+	osm_mgrp_t *p_mgrp;
+	cl_list_item_t *p_item;
+	unsigned has_full_members = 0;
 
 	OSM_LOG_ENTER(sm->p_log);
 
-	mlid = osm_mgrp_get_mlid(p_mgrp);
+	mlid = osm_mgrp_holder_get_mlid(p_mgrp_holder);
 
 	OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-		"Processing multicast group 0x%X\n", cl_ntoh16(mlid));
+		"Processing multicast group_holder 0x%X\n", cl_ntoh16(mlid));
 
 	/*
 	   Clear the multicast tables to start clean, then build
@@ -1042,27 +1055,52 @@ static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm,
 	 */
 	mcast_mgr_clear(sm, cl_ntoh16(mlid));
 
-	if (p_mgrp->full_members) {
-		status = mcast_mgr_build_spanning_tree(sm, p_mgrp);
+	p_item = cl_qlist_head(&p_mgrp_holder->mgrp_list);
+	while (p_item != cl_qlist_end(&p_mgrp_holder->mgrp_list)) {
+		char gid_str[INET6_ADDRSTRLEN];
+		p_mgrp = (osm_mgrp_t *)
+			PARENT_STRUCT(p_item, osm_mgrp_t, mlid_item);
+			OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+				"MLID  0x%x has mgrp  %s\n",cl_ntoh16(p_mgrp->mlid),
+				inet_ntop(AF_INET6,
+				p_mgrp->mcmember_rec.mgid.raw,
+				gid_str, sizeof(gid_str)));
+		p_item = cl_qlist_next(p_item);
+		if (p_mgrp->to_be_deleted) {
+					osm_mcm_port_t *p_mcm_port;
+					OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+						"Destroying mgrp  %s with lid:0x%x\n",
+						inet_ntop(AF_INET6,
+						p_mgrp->mcmember_rec.mgid.raw,
+						gid_str, sizeof(gid_str)),
+						cl_ntoh16(p_mgrp->mlid));
+					osm_mgrp_holder_delete_mgrp(p_mgrp_holder, p_mgrp);
+					p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(&p_mgrp->mcm_port_tbl);
+					while (p_mcm_port !=
+						(osm_mcm_port_t *) cl_qmap_end(&p_mgrp->mcm_port_tbl)) {
+						osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp,
+							p_mcm_port->port_gid.unicast.interface_id);
+						p_mcm_port =
+							(osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item);
+					}
+					cl_fmap_remove_item(&sm->p_subn->mgrp_mgid_tbl,
+						&p_mgrp->map_item);
+					osm_mgrp_delete(p_mgrp);
+		}
+		else if (!has_full_members)
+						has_full_members = p_mgrp->full_members;
+	}
+	if (has_full_members) {
+		status = mcast_mgr_build_spanning_tree(sm, p_mgrp_holder);
 		if (status != IB_SUCCESS) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
 				"Unable to create spanning tree (%s)\n",
 				ib_get_err_str(status));
 			goto Exit;
 		}
-	} else  if (p_mgrp->to_be_deleted) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-			"Destroying mgrp with lid:0x%x\n",
-			cl_ntoh16(p_mgrp->mlid));
-		sm->p_subn->mgroups[cl_ntoh16(p_mgrp->mlid) -
-				    IB_LID_MCAST_START_HO] = NULL;
-		cl_fmap_remove_item(&sm->p_subn->mgrp_mgid_tbl,
-				    &p_mgrp->map_item);
-		osm_mgrp_delete(p_mgrp);
-		goto Exit;
+	    p_mgrp_holder->last_tree_id = p_mgrp_holder->last_change_id;
 	}
 
-	p_mgrp->last_tree_id = p_mgrp->last_change_id;
 
 Exit:
 	OSM_LOG_EXIT(sm->p_log);
@@ -1076,7 +1114,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
 	osm_switch_t *p_sw;
 	cl_qmap_t *p_sw_tbl;
 	cl_qlist_t *p_list = &sm->mgrp_list;
-	osm_mgrp_t *p_mgrp;
+	osm_mgrp_holder_t *p_mgrp_holder;
 	int i, ret = 0;
 
 	OSM_LOG_ENTER(sm->p_log);
@@ -1104,9 +1142,10 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
 		   of the subnet. Not due to a specific multicast request.
 		   So the request type is subnet_change and the port guid is 0.
 		 */
-		p_mgrp = sm->p_subn->mgroups[i];
-		if (p_mgrp)
-			mcast_mgr_process_mgrp(sm, p_mgrp);
+		p_mgrp_holder = sm->p_subn->mgroup_holders[i];
+		if (p_mgrp_holder) {
+			mcast_mgr_process_mgrp(sm, p_mgrp_holder);
+		}
 	}
 
 	/*
@@ -1141,7 +1180,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 	cl_qlist_t *p_list = &sm->mgrp_list;
 	osm_switch_t *p_sw;
 	cl_qmap_t *p_sw_tbl;
-	osm_mgrp_t *p_mgrp;
+	osm_mgrp_holder_t *p_mgrp_holder;
 	ib_net16_t mlid;
 	osm_mcast_mgr_ctxt_t *ctx;
 	int ret = 0;
@@ -1169,24 +1208,25 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 
 		/* since we delayed the execution we prefer to pass the
 		   mlid as the mgrp identifier and then find it or abort */
-		p_mgrp = osm_get_mgrp_by_mlid(sm->p_subn, mlid);
-		if (!p_mgrp)
+		p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sm->p_subn, mlid);
+		if (!p_mgrp_holder)
 			continue;
 
 		/* if there was no change from the last time
 		 * we processed the group we can skip doing anything
 		 */
-		if (p_mgrp->last_change_id == p_mgrp->last_tree_id) {
+		if (p_mgrp_holder->last_change_id ==
+		    p_mgrp_holder->last_tree_id) {
 			OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-				"Skip processing mgrp with lid:0x%X change id:%u\n",
-				cl_ntoh16(mlid), p_mgrp->last_change_id);
+				"Skip processing p_mgrp_holder with lid:0x%X change id:%u\n",
+				cl_ntoh16(mlid), p_mgrp_holder->last_change_id);
 			continue;
 		}
 
 		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
 			"Processing mgrp with lid:0x%X change id:%u\n",
-			cl_ntoh16(mlid), p_mgrp->last_change_id);
-		mcast_mgr_process_mgrp(sm, p_mgrp);
+			cl_ntoh16(mlid), p_mgrp_holder->last_change_id);
+		mcast_mgr_process_mgrp(sm, p_mgrp_holder);
 	}
 
 	/*
diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
index d2733c4..072b591 100644
--- a/opensm/opensm/osm_multicast.c
+++ b/opensm/opensm/osm_multicast.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -48,6 +48,7 @@
 #include <opensm/osm_mcm_port.h>
 #include <opensm/osm_mtree.h>
 #include <opensm/osm_inform.h>
+#include <arpa/inet.h>
 
 /**********************************************************************
  **********************************************************************/
@@ -67,8 +68,6 @@ void osm_mgrp_delete(IN osm_mgrp_t * p_mgrp)
 		    (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item);
 		osm_mcm_port_delete(p_mcm_port);
 	}
-	/* destroy the mtree_node structure */
-	osm_mtree_destroy(p_mgrp->p_root);
 
 	free(p_mgrp);
 }
@@ -86,9 +85,6 @@ osm_mgrp_t *osm_mgrp_new(IN const ib_net16_t mlid)
 	memset(p_mgrp, 0, sizeof(*p_mgrp));
 	cl_qmap_init(&p_mgrp->mcm_port_tbl);
 	p_mgrp->mlid = mlid;
-	p_mgrp->last_change_id = 0;
-	p_mgrp->last_tree_id = 0;
-	p_mgrp->to_be_deleted = FALSE;
 
 	return p_mgrp;
 }
@@ -133,6 +129,7 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t * subn, osm_log_t * log,
 	ib_net64_t port_guid;
 	osm_mcm_port_t *p_mcm_port;
 	cl_map_item_t *prev_item;
+	osm_mgrp_holder_t *p_mgrp_holder;
 	uint8_t prev_join_state = 0;
 	uint8_t prev_scope;
 
@@ -167,9 +164,18 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t * subn, osm_log_t * log,
 		p_mcm_port->scope_state =
 		    ib_member_set_scope_state(prev_scope,
 					      prev_join_state | join_state);
-	} else {
-		/* track the fact we modified the group ports */
-		p_mgrp->last_change_id++;
+	}
+
+	p_mgrp_holder = osm_get_mgrp_holder_by_mlid(subn, p_mgrp->mlid);
+	if (! p_mgrp_holder ||
+			 (IB_SUCCESS != osm_mgrp_holder_port_add_mgrp(p_mgrp_holder,
+						p_mgrp, port_guid)) ) {
+			/*  if  the above failed and added port is new one, remove port also from mcm_port_tbl */
+			if (! prev_join_state) {
+				cl_qmap_remove_item(&p_mgrp->mcm_port_tbl, &p_mcm_port->map_item);
+				osm_mcm_port_delete(p_mcm_port);
+			}
+			return NULL;
 	}
 
 	if ((join_state & IB_JOIN_STATE_FULL) &&
@@ -212,7 +218,6 @@ int osm_mgrp_remove_port(osm_subn_t * subn, osm_log_t * log, osm_mgrp_t * mgrp,
 			cl_ntoh64(mcm->port_gid.unicast.interface_id));
 		osm_mcm_port_delete(mcm);
 		/* track the fact we modified the group */
-		mgrp->last_change_id++;
 		ret = 1;
 	}
 
@@ -285,16 +290,173 @@ static void mgrp_apply_func_sub(const osm_mgrp_t * p_mgrp,
 
 /**********************************************************************
  **********************************************************************/
-void osm_mgrp_apply_func(const osm_mgrp_t * p_mgrp, osm_mgrp_func_t p_func,
-			 void *context)
+static osm_mgrp_port_t *osm_mgrp_port_new(ib_net64_t port_guid)
+{
+	osm_mgrp_port_t *p_mgrp_port =
+	(osm_mgrp_port_t *) malloc(sizeof(osm_mgrp_port_t));
+	if (!p_mgrp_port) {
+		return NULL;
+	}
+	memset(p_mgrp_port, 0, sizeof(*p_mgrp_port));
+	p_mgrp_port->port_guid = port_guid;
+	cl_qlist_init(&p_mgrp_port->mgroups);
+	return p_mgrp_port;
+}
+
+/**********************************************************************
+ **********************************************************************/
+osm_mgrp_holder_t *osm_mgrp_holder_new(IN osm_subn_t * p_subn,
+					ib_net16_t mlid)
 {
-	osm_mtree_node_t *p_mtn;
+	osm_mgrp_holder_t *p_mgrp_holder;
+	p_mgrp_holder =
+		p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] =
+		(osm_mgrp_holder_t *) malloc(sizeof(*p_mgrp_holder));
+	if (!p_mgrp_holder)
+		return NULL;
 
-	CL_ASSERT(p_mgrp);
-	CL_ASSERT(p_func);
+	memset(p_mgrp_holder, 0, sizeof(*p_mgrp_holder));
+	p_mgrp_holder->mlid = mlid;
+	cl_qmap_init(&p_mgrp_holder->mgrp_port_map);
+	cl_qlist_init(&p_mgrp_holder->mgrp_list);
+	return p_mgrp_holder;
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mgrp_holder_delete(IN osm_subn_t *p_subn, ib_net16_t mlid)
+{
+	osm_mgrp_port_t *p_osm_mgr_port;
+	cl_map_item_t *p_item;
+
+	osm_mgrp_holder_t *p_mgrp_holder =
+		p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO];
+	p_item = cl_qmap_head(&p_mgrp_holder->mgrp_port_map);
+	/* Delete ports shared same MLID */
+	while (p_item != cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) {
+		p_osm_mgr_port = (osm_mgrp_port_t *) p_item;
+		cl_qlist_remove_all(&p_osm_mgr_port->mgroups);
+		cl_qmap_remove_item(&p_mgrp_holder->mgrp_port_map, p_item);
+		p_item = cl_qmap_head(&p_mgrp_holder->mgrp_port_map);
+		free(p_osm_mgr_port);
+	}
+	/* Remove mgrp from this MLID */
+	cl_qlist_remove_all(&p_mgrp_holder->mgrp_list);
+	/* Destroy the mtree_node structure */
+	osm_mtree_destroy(p_mgrp_holder->p_root);
+	p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = NULL;
+	free(p_mgrp_holder);
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mgrp_holder_remove_port(osm_subn_t * subn, osm_log_t * p_log,
+				osm_mgrp_holder_t * p_mgrp_holder,
+				ib_net64_t port_guid)
+{
+	osm_mgrp_t *p_mgrp;
+	cl_list_item_t *p_item;
+
+	OSM_LOG_ENTER(p_log);
+
+	osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *)
+		cl_qmap_remove(&p_mgrp_holder->mgrp_port_map, port_guid);
+	if (p_mgrp_port !=
+		(osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) {
+		char gid_str[INET6_ADDRSTRLEN];
+		OSM_LOG(p_log, OSM_LOG_DEBUG,
+		"port  0x%" PRIx64 " removed from  mlid 0x%X\n",
+		port_guid, cl_ntoh16(p_mgrp_holder->mlid));
+		while ((p_item =
+			cl_qlist_remove_head(&p_mgrp_port->mgroups)) !=
+			cl_qlist_end(&p_mgrp_port->mgroups)) {
+			p_mgrp = (osm_mgrp_t *)
+				PARENT_STRUCT(p_item, osm_mgrp_t,port_item);
+			OSM_LOG(p_log, OSM_LOG_DEBUG,
+				"removing mgrp mgid %s from port  0x%" PRIx64"\n",
+				 inet_ntop(AF_INET6,p_mgrp->mcmember_rec.mgid.raw,
+					gid_str, sizeof(gid_str)),
+					cl_ntoh64(port_guid));
+			osm_mgrp_delete_port(subn, p_log, p_mgrp, port_guid);
+		}
+		free(p_mgrp_port);
+	}
+	OSM_LOG_EXIT(p_log);
+}
 
-	p_mtn = p_mgrp->p_root;
+/**********************************************************************
+ **********************************************************************/
+void osm_mgrp_holder_add_mgrp(osm_mgrp_holder_t * p_mgrp_holder,
+				osm_mgrp_t * p_mgrp, osm_log_t *  p_log)
+{
+	char gid_str[INET6_ADDRSTRLEN];
+
+	OSM_LOG_ENTER(p_log);
+	p_mgrp_holder->to_be_deleted = 0;
+	cl_qlist_insert_tail(&p_mgrp_holder->mgrp_list, &p_mgrp->mlid_item);
+	OSM_LOG(p_log, OSM_LOG_DEBUG,
+		"mgrp with MGID:%s added to holder with mlid = 0x%X\n",
+		inet_ntop(AF_INET6, p_mgrp->mcmember_rec.mgid.raw, gid_str,
+		sizeof(gid_str)), cl_ntoh16(p_mgrp_holder->mlid));
+	p_mgrp_holder->last_change_id++;
+	OSM_LOG_EXIT(p_log);
+}
 
-	if (p_mtn)
-		mgrp_apply_func_sub(p_mgrp, p_mtn, p_func, context);
+/**********************************************************************
+ **********************************************************************/
+void osm_mgrp_holder_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder,
+				 osm_mgrp_t * p_mgrp)
+{
+	p_mgrp->to_be_deleted = 1;
+	cl_qlist_remove_item(&p_mgrp_holder->mgrp_list, &p_mgrp->mlid_item);
+	if (0 == cl_qlist_count(&p_mgrp_holder->mgrp_list)) {
+		/* No more mgroups on this mlid */
+		p_mgrp_holder->to_be_deleted = 1;
+		p_mgrp_holder->last_tree_id = 0;
+		p_mgrp_holder->last_change_id = 0;
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t osm_mgrp_holder_port_add_mgrp(osm_mgrp_holder_t * p_mgrp_holder,
+						osm_mgrp_t * p_mgrp,
+						ib_net64_t port_guid)
+{
+	osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *)
+		cl_qmap_get(&p_mgrp_holder->mgrp_port_map, port_guid);
+	if (p_mgrp_port ==
+		(osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) {
+		/* new port to mlid */
+		p_mgrp_port = osm_mgrp_port_new(port_guid);
+		if (!p_mgrp_port) {
+			return IB_INSUFFICIENT_MEMORY;
+		}
+		cl_qmap_insert(&p_mgrp_holder->mgrp_port_map,
+			p_mgrp_port->port_guid, &p_mgrp_port->guid_item);
+	}
+	cl_qlist_insert_tail(&p_mgrp_port->mgroups, &p_mgrp->port_item);
+    p_mgrp_holder->last_change_id++;
+	return IB_SUCCESS;
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mgrp_holder_port_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder,
+				      osm_mgrp_t * p_mgrp,
+				      ib_net64_t port_guid)
+{
+	osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *)
+	cl_qmap_get(&p_mgrp_holder->mgrp_port_map, port_guid);
+	if (p_mgrp_port !=
+		(osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) {
+		cl_qlist_remove_item(&p_mgrp_port->mgroups, &p_mgrp->port_item);
+		if (0 == cl_qlist_count(&p_mgrp_port->mgroups)) {
+			/* No mgroups registered on this port for current mlid */
+			cl_qmap_remove_item(&p_mgrp_holder->mgrp_port_map,
+			&p_mgrp_port->guid_item);
+			free(p_mgrp_port);
+		}
+	p_mgrp_holder->last_change_id++;
+	}
 }
diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index 7826578..041377f 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -785,7 +785,9 @@ static void __qos_policy_validate_pkey(
 	uint8_t sl;
 	uint32_t flow;
 	uint8_t hop;
+	osm_mgrp_holder_t * p_mgrp_holder;
 	osm_mgrp_t * p_mgrp;
+	cl_list_item_t *p_item;
 
 	if (!p_qos_policy || !p_qos_match_rule || !p_prtn)
 		return;
@@ -809,31 +811,35 @@ static void __qos_policy_validate_pkey(
 	if (!p_prtn->mlid)
 		return;
 
-	p_mgrp = osm_get_mgrp_by_mlid(p_qos_policy->p_subn, p_prtn->mlid);
-	if (!p_mgrp) {
+	p_mgrp_holder =
+		osm_get_mgrp_holder_by_mlid(p_qos_policy->p_subn, p_prtn->mlid);
+	if (!p_mgrp_holder) {
 		OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_ERROR,
-			"ERR AC16: MCast group for partition with "
-			"pkey 0x%04X not found\n",
-			cl_ntoh16(p_prtn->pkey));
+		"ERR AC16: MCast mgrp_holder for partition with pkey 0x%04X not found\n",
+		cl_ntoh16(p_prtn->pkey));
 		return;
 	}
 
-	CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) ==
-		  (cl_ntoh16(p_prtn->pkey) & 0x7fff));
-
-	ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop,
-				  &sl, &flow, &hop);
-	if (sl != p_prtn->sl) {
-		OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+	p_item = cl_qlist_head(&p_mgrp_holder->mgrp_list);
+	while (p_item != cl_qlist_end(&p_mgrp_holder->mgrp_list)) {
+		p_mgrp = (osm_mgrp_t *) PARENT_STRUCT(p_item, osm_mgrp_t,
+			mlid_item);
+		p_item = cl_qlist_next(p_item);
+		CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) ==
+			(cl_ntoh16(p_prtn->pkey) & 0x7fff));
+		ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop,
+			&sl, &flow, &hop);
+		if (sl != p_prtn->sl) {
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
 			"Updating MCGroup (MLID 0x%04x) SL to "
 			"match partition SL (%u)\n",
 			cl_hton16(p_mgrp->mcmember_rec.mlid),
 			p_prtn->sl);
-		p_mgrp->mcmember_rec.sl_flow_hop =
-			ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop);
+			p_mgrp->mcmember_rec.sl_flow_hop =
+				ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop);
+		}
 	}
 }
-
 /***************************************************
  ***************************************************/
 
diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c
index fcc3f27..22dd495 100644
--- a/opensm/opensm/osm_sa.c
+++ b/opensm/opensm/osm_sa.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -706,17 +706,15 @@ static void sa_dump_all_sa(osm_opensm_t * p_osm, FILE * file)
 {
 	struct opensm_dump_context dump_context;
 	osm_mgrp_t *p_mgrp;
-	int i;
 
 	dump_context.p_osm = p_osm;
 	dump_context.file = file;
 	OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump multicast\n");
 	cl_plock_acquire(&p_osm->lock);
-	for (i = 0; i <= p_osm->subn.max_mcast_lid_ho - IB_LID_MCAST_START_HO;
-	     i++) {
-		p_mgrp = p_osm->subn.mgroups[i];
-		if (p_mgrp)
-			sa_dump_one_mgrp(p_mgrp, &dump_context);
+	p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_osm->subn.mgrp_mgid_tbl);
+	while (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_osm->subn.mgrp_mgid_tbl)) {
+		sa_dump_one_mgrp(p_mgrp, &dump_context);
+		p_mgrp = (osm_mgrp_t*) cl_fmap_next(&p_mgrp->map_item);
 	}
 	OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump inform\n");
 	cl_qlist_apply_func(&p_osm->subn.sa_infr_list,
@@ -740,23 +738,16 @@ static osm_mgrp_t *load_mcgroup(osm_opensm_t * p_osm, ib_net16_t mlid,
 				unsigned well_known)
 {
 	ib_net64_t comp_mask;
-	osm_mgrp_t *p_mgrp;
 
+	cl_fmap_item_t *p_fitem;
+	osm_mgrp_t *p_mgrp = NULL;
 	cl_plock_excl_acquire(&p_osm->lock);
 
-	p_mgrp = osm_get_mgrp_by_mlid(&p_osm->subn, mlid);
-	if (p_mgrp) {
-		if (!memcmp(&p_mgrp->mcmember_rec.mgid, &p_mcm_rec->mgid,
-			    sizeof(ib_gid_t))) {
-			OSM_LOG(&p_osm->log, OSM_LOG_DEBUG,
-				"mgrp %04x is already here.", cl_ntoh16(mlid));
+	p_fitem = cl_fmap_get(&p_osm->subn.mgrp_mgid_tbl, &p_mcm_rec->mgid);
+	if (p_fitem != cl_fmap_end(&p_osm->subn.mgrp_mgid_tbl)) {
+		OSM_LOG(&p_osm->log, OSM_LOG_DEBUG,
+			"mgrp %04x is already here.", cl_ntoh16(mlid));
 			goto _out;
-		}
-		OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE,
-			"mlid %04x is already used by another MC group. Will "
-			"request clients reregistration.\n", cl_ntoh16(mlid));
-		p_mgrp = NULL;
-		goto _out;
 	}
 
 	comp_mask = IB_MCR_COMPMASK_MTU | IB_MCR_COMPMASK_MTU_SEL
diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index a9e0a3b..3838a08 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -121,14 +121,17 @@ static ib_net16_t get_new_mlid(osm_sa_t * sa, ib_net16_t requested_mlid)
 
 	if (requested_mlid && cl_ntoh16(requested_mlid) >= IB_LID_MCAST_START_HO
 	    && cl_ntoh16(requested_mlid) <= p_subn->max_mcast_lid_ho
-	    && !osm_get_mgrp_by_mlid(p_subn, requested_mlid))
+	    && !osm_get_mgrp_holder_by_mlid(p_subn, requested_mlid))
 		return requested_mlid;
 
 	max = p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO + 1;
 	for (i = 0; i < max; i++) {
-		osm_mgrp_t *p_mgrp = sa->p_subn->mgroups[i];
-		if (!p_mgrp || p_mgrp->to_be_deleted)
-			return cl_hton16(i + IB_LID_MCAST_START_HO);
+		osm_mgrp_holder_t *p_mgrp_holder = sa->p_subn->mgroup_holders[i];
+		if (!p_mgrp_holder || p_mgrp_holder->to_be_deleted) {
+				OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "returning mgrp_holder to_be_deleted =%d\n",
+						p_mgrp_holder ? p_mgrp_holder->to_be_deleted : 0);
+				return cl_hton16(i + IB_LID_MCAST_START_HO);
+		}
 	}
 
 	return 0;
@@ -146,8 +149,9 @@ static void cleanup_mgrp(IN osm_sa_t * sa, osm_mgrp_t * mgrp)
 	/* Remove MGRP only if osm_mcm_port_t count is 0 and
 	   not a well known group */
 	if (cl_is_qmap_empty(&mgrp->mcm_port_tbl) && !mgrp->well_known) {
-		sa->p_subn->mgroups[cl_ntoh16(mgrp->mlid) -
-				    IB_LID_MCAST_START_HO] = NULL;
+		osm_mgrp_holder_t *p_mgrp_holder =
+			osm_get_mgrp_holder_by_mlid(sa->p_subn, mgrp->mlid);
+		osm_mgrp_holder_delete_mgrp(p_mgrp_holder, mgrp);
 		cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl,
 				    &mgrp->map_item);
 		osm_mgrp_delete(mgrp);
@@ -802,19 +806,19 @@ static boolean_t mgrp_request_is_realizable(IN osm_sa_t * sa,
  Call this function to create a new mgrp.
 **********************************************************************/
 ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
-					     IN ib_net64_t comp_mask,
-					     IN const ib_member_rec_t *
-					     const p_recvd_mcmember_rec,
-					     IN const osm_physp_t * p_physp,
-					     OUT osm_mgrp_t ** pp_mgrp)
+						IN ib_net64_t comp_mask,
+						IN const ib_member_rec_t *
+						const p_recvd_mcmember_rec,
+						IN const osm_physp_t * p_physp,
+						OUT osm_mgrp_t ** pp_mgrp)
 {
-	ib_net16_t mlid;
+	ib_net16_t mlid, existed_mlid;
 	unsigned zero_mgid, i;
 	uint8_t scope;
 	ib_gid_t *p_mgid;
-	osm_mgrp_t *p_prev_mgrp;
 	ib_api_status_t status = IB_SUCCESS;
 	ib_member_rec_t mcm_rec = *p_recvd_mcmember_rec;	/* copy for modifications */
+	osm_mgrp_holder_t * p_mgrp_holder;
 
 	OSM_LOG_ENTER(sa->p_log);
 
@@ -890,6 +894,15 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 		goto Exit;
 	}
 
+	if (0 != (existed_mlid = osm_mgrp_holder_get_mlid_by_mgid(sa->p_subn, p_mgid))) {
+		char gid_str[INET6_ADDRSTRLEN];
+		mlid = existed_mlid;
+		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
+			"found existed  mlid  0x%04x for mgid %s\n",
+			cl_ntoh16(mlid), inet_ntop(AF_INET6, p_mgid->raw,
+						   gid_str, sizeof gid_str));
+	}
+
 	/* create a new MC Group */
 	*pp_mgrp = osm_mgrp_new(mlid);
 	if (*pp_mgrp == NULL) {
@@ -914,25 +927,26 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 
 	/* Insert the new group in the data base */
 
-	/* since we might have an old group by that mlid
-	   one whose deletion was delayed for an idle time
-	   we need to deallocate it first */
-	p_prev_mgrp = osm_get_mgrp_by_mlid(sa->p_subn, mlid);
-	if (p_prev_mgrp) {
+
+	p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sa->p_subn, mlid);
+	if (!p_mgrp_holder) {
 		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
-			"Found previous group for mlid:0x%04x - "
-			"Destroying it first\n", cl_ntoh16(mlid));
-		sa->p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] =
-		    NULL;
-		cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl,
-				    &p_prev_mgrp->map_item);
-		osm_mgrp_delete(p_prev_mgrp);
+			"Creating new mgrp_holder  for mlid:0x%04x\n",
+			cl_ntoh16(mlid));
+		p_mgrp_holder = osm_mgrp_holder_new(sa->p_subn,  mlid);
 	}
 
+	if (!p_mgrp_holder) {
+		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B08: "
+			"osm_mgrp_holder_new failed\n");
+		free_mlid(sa, mlid);
+		status = IB_INSUFFICIENT_MEMORY;
+		goto Exit;
+	}
 	cl_fmap_insert(&sa->p_subn->mgrp_mgid_tbl,
 		       &(*pp_mgrp)->mcmember_rec.mgid, &(*pp_mgrp)->map_item);
 
-	sa->p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = *pp_mgrp;
+	osm_mgrp_holder_add_mgrp(p_mgrp_holder, *pp_mgrp, sa->p_log);
 
 Exit:
 	OSM_LOG_EXIT(sa->p_log);
@@ -1074,7 +1088,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	CL_PLOCK_RELEASE(sa->p_lock);
 
 	/* we can leave if port was deleted from MCG */
-	if (removed && osm_sm_mcgrp_leave(sa->sm, mlid, portguid))
+	if (removed && osm_sm_mcgrp_leave(sa->sm, p_mgrp, portguid))
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B09: "
 			"osm_sm_mcgrp_leave failed\n");
 
@@ -1102,6 +1116,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	osm_physp_t *p_request_physp;
 	uint8_t is_new_group;	/* TRUE = there is a need to create a group */
 	uint8_t join_state;
+	osm_mgrp_holder_t *p_mgrp_holder;
 
 	OSM_LOG_ENTER(sa->p_log);
 
@@ -1275,6 +1290,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 		goto Exit;
 	}
 
+	p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sa->p_subn, mlid);
+	CL_ASSERT(p_mgrp_holder);
 	/* create or update existing port (join-state will be updated) */
 	status = add_new_mgrp_port(sa, p_mgrp, p_recvd_mcmember_rec,
 				   osm_madw_get_mad_addr_ptr(p_madw),
@@ -1282,6 +1299,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 
 	if (status != IB_SUCCESS) {
 		/* we fail to add the port so we might need to delete the group */
+		osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp,
+					p_recvd_mcmember_rec->port_gid.unicast.interface_id);
 		cleanup_mgrp(sa, p_mgrp);
 
 		CL_PLOCK_RELEASE(sa->p_lock);
@@ -1304,7 +1323,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	/* do the actual routing (actually schedule the update) */
 	status = osm_sm_mcgrp_join(sa->sm, mlid,
 				   p_recvd_mcmember_rec->port_gid.unicast.
-				   interface_id);
+				   interface_id, &p_recvd_mcmember_rec->mgid);
 
 	if (status != IB_SUCCESS) {
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: "
@@ -1315,9 +1334,10 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 		CL_PLOCK_EXCL_ACQUIRE(sa->p_lock);
 
 		/* the request for routing failed so we need to remove the port */
+		osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp,
+				p_recvd_mcmember_rec->port_gid.unicast.interface_id);
 		osm_mgrp_delete_port(sa->p_subn, sa->p_log, p_mgrp,
-				     p_recvd_mcmember_rec->port_gid.
-				     unicast.interface_id);
+				p_recvd_mcmember_rec->port_gid.unicast.interface_id);
 		cleanup_mgrp(sa, p_mgrp);
 		CL_PLOCK_RELEASE(sa->p_lock);
 		osm_sa_send_error(sa, p_madw, IB_SA_MAD_STATUS_NO_RESOURCES);
@@ -1549,7 +1569,6 @@ static void mcmr_query_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	osm_physp_t *p_req_physp;
 	boolean_t trusted_req;
 	osm_mgrp_t *p_mgrp;
-	int i;
 
 	OSM_LOG_ENTER(sa->p_log);
 
@@ -1578,12 +1597,11 @@ static void mcmr_query_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	CL_PLOCK_ACQUIRE(sa->p_lock);
 
 	/* simply go over all MCGs and match */
-	for (i = 0; i <= sa->p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO;
-	     i++) {
-		p_mgrp = sa->p_subn->mgroups[i];
-		if (p_mgrp)
-			mcmr_by_comp_mask(sa, p_rcvd_rec, comp_mask, p_mgrp,
-					  p_req_physp, trusted_req, &rec_list);
+	p_mgrp = (osm_mgrp_t *) cl_fmap_head(&sa->p_subn->mgrp_mgid_tbl);
+	while (p_mgrp != (osm_mgrp_t *) cl_fmap_end(&sa->p_subn->mgrp_mgid_tbl)) {
+		mcmr_by_comp_mask(sa, p_rcvd_rec, comp_mask, p_mgrp,
+				  p_req_physp, trusted_req, &rec_list);
+		p_mgrp = (osm_mgrp_t *) cl_fmap_next(&p_mgrp->map_item);
 	}
 
 	CL_PLOCK_RELEASE(sa->p_lock);
diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c
index 75d9516..aa63d78 100644
--- a/opensm/opensm/osm_sa_path_record.c
+++ b/opensm/opensm/osm_sa_path_record.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
@@ -1468,11 +1468,14 @@ static osm_mgrp_t *pr_get_mgrp(IN osm_sa_t * sa, IN const osm_madw_t * p_madw)
 				mgrp = NULL;
 				goto Exit;
 			}
-		} else
-		    if (!(mgrp = osm_get_mgrp_by_mlid(sa->p_subn, p_pr->dlid)))
-			OSM_LOG(sa->p_log, OSM_LOG_ERROR,
-				"ERR 1F11: " "No MC group found for PathRecord "
+		} else {
+			mgrp = osm_get_mgrp_by_mgid(sa, &p_pr->dgid);
+			if (!mgrp)
+				OSM_LOG(sa->p_log, OSM_LOG_ERROR,
+				"ERR 1F11: "
+				"No MC group found for PathRecord "
 				"destination LID 0x%x\n", p_pr->dlid);
+		}
 	}
 
 Exit:
diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
index b3ce69a..d990450 100644
--- a/opensm/opensm/osm_sm.c
+++ b/opensm/opensm/osm_sm.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -47,6 +47,7 @@
 
 #include <stdlib.h>
 #include <string.h>
+#include <arpa/inet.h>
 #include <iba/ib_types.h>
 #include <complib/cl_qmap.h>
 #include <complib/cl_passivelock.h>
@@ -468,12 +469,15 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
 /**********************************************************************
  **********************************************************************/
 ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
-				  IN const ib_net64_t port_guid)
+				  IN const ib_net64_t port_guid,
+				  IN const ib_gid_t * p_mgid)
 {
-	osm_mgrp_t *p_mgrp;
+	osm_mgrp_t *p_mgrp = NULL;
 	osm_port_t *p_port;
 	ib_api_status_t status = IB_SUCCESS;
 	osm_mcm_info_t *p_mcm;
+	cl_list_item_t *p_item;
+	osm_mgrp_holder_t *p_mgrp_holder;
 
 	OSM_LOG_ENTER(p_sm->p_log);
 
@@ -497,8 +501,44 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 	/*
 	 * If this multicast group does not already exist, create it.
 	 */
-	p_mgrp = osm_get_mgrp_by_mlid(p_sm->p_subn, mlid);
-	if (!p_mgrp || !osm_mgrp_is_guid(p_mgrp, port_guid)) {
+	p_mgrp_holder = osm_get_mgrp_holder_by_mlid(p_sm->p_subn, mlid);
+	if (p_mgrp_holder) {
+		char gid_str[INET6_ADDRSTRLEN];
+		if (TRUE) {
+			size_t gr_count = cl_qlist_count(&p_mgrp_holder->mgrp_list);
+			OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+				"mlid 0x%X has  %lu mgroups\n", cl_ntoh16(mlid), gr_count);
+			if (gr_count) {
+				p_item =
+				    cl_qlist_head(&p_mgrp_holder->mgrp_list);
+				while (p_item !=
+				       cl_qlist_end(&p_mgrp_holder->mgrp_list)) {
+					p_mgrp = (osm_mgrp_t *)
+					    PARENT_STRUCT(p_item, osm_mgrp_t,
+							  mlid_item);
+					OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+						"mlid  0x%X has mgrp with MGID: %s\n",
+						cl_ntoh16(mlid),
+						inet_ntop(AF_INET6,
+							  p_mgrp->mcmember_rec.
+							  mgid.raw, gid_str,
+							  sizeof gid_str));
+					p_item = cl_qlist_next(p_item);
+				}
+			}
+		}
+		p_mgrp  = (osm_mgrp_t *)cl_fmap_get(&p_sm->p_subn->mgrp_mgid_tbl, p_mgid);
+		if (p_mgrp == (osm_mgrp_t *)cl_fmap_end(&p_sm->p_subn->mgrp_mgid_tbl)) {
+			p_mgrp = NULL;
+			OSM_LOG(p_sm->p_log, OSM_LOG_ERROR,
+				"group with MGID: %s not found on mlid 0x%X\n",
+				inet_ntop(AF_INET6,
+					  p_mgid->raw,
+					  gid_str, sizeof gid_str),
+				cl_ntoh16(mlid));
+		}
+	}
+	if (!p_mgrp_holder || !p_mgrp || !osm_mgrp_is_guid(p_mgrp, port_guid)) {
 		/*
 		 * The group removed or the port is not a
 		 * member of the group, then fail immediately.
@@ -513,6 +553,22 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 		goto Exit;
 	}
 
+	/* if there was no change from the last time
+	 * we processed the group we can skip doing anything
+	 */
+	if (p_mgrp_holder->last_change_id == p_mgrp_holder->last_tree_id) {
+		OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE,
+			"Skip processing mgrp holder with lid:0x%X last change id:%u\n",
+			cl_ntoh16(mlid), p_mgrp_holder->last_change_id);
+		goto Exit;
+	} else {
+		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+			"processing mgrp holder with lid:0x%X port: 0x%016"
+			PRIx64 " last change id:%u tree id:%u\n",
+			cl_ntoh16(mlid), cl_ntoh64(port_guid),
+			p_mgrp_holder->last_change_id,
+			p_mgrp_holder->last_tree_id);
+	}
 	/*
 	 * Check if the object (according to mlid) already exists on this port.
 	 * If it does - then no need to update it again, and no need to
@@ -549,12 +605,13 @@ Exit:
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
+ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN osm_mgrp_t * p_mgrp,
 				   IN const ib_net64_t port_guid)
 {
-	osm_mgrp_t *p_mgrp;
 	osm_port_t *p_port;
 	ib_api_status_t status;
+	osm_mgrp_holder_t *p_mgrp_holder;
+	ib_net16_t mlid = p_mgrp->mlid;
 
 	OSM_LOG_ENTER(p_sm->p_log);
 
@@ -577,21 +634,25 @@ ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 	}
 
 	/*
-	 * Get the multicast group object for this group.
+	 * Get the multicast group holder object for this group.
 	 */
-	p_mgrp = osm_get_mgrp_by_mlid(p_sm->p_subn, mlid);
-	if (!p_mgrp) {
+	p_mgrp_holder = osm_get_mgrp_holder_by_mlid(p_sm->p_subn, mlid);
+	if (!p_mgrp_holder) {
 		OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E08: "
 			"No multicast group for MLID 0x%X\n", cl_ntoh16(mlid));
 		status = IB_INVALID_PARAMETER;
 		goto Exit;
 	}
 
+	osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp, port_guid);
 	/*
 	 * Walk the list of ports in the group, and remove the appropriate one.
 	 */
 	osm_port_remove_mgrp(p_port, mlid);
 
+	OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+		" Calling sm_mgrp_process for mgrp with mlid = 0x%X\n",
+		cl_ntoh16(mlid));
 	status = sm_mgrp_process(p_sm, p_mgrp);
 Exit:
 	CL_PLOCK_RELEASE(p_sm->p_lock);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 0d11811..6ed95d4 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
@@ -428,8 +428,9 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn)
 	osm_switch_t *p_sw, *p_next_sw;
 	osm_remote_sm_t *p_rsm, *p_next_rsm;
 	osm_prtn_t *p_prtn, *p_next_prtn;
-	osm_mgrp_t *p_mgrp;
+	osm_mgrp_holder_t *p_mgrp_holder;
 	osm_infr_t *p_infr, *p_next_infr;
+	osm_mgrp_t *p_mgrp;
 
 	/* it might be a good idea to de-allocate all known objects */
 	p_next_node = (osm_node_t *) cl_qmap_head(&p_subn->node_guid_tbl);
@@ -471,14 +472,20 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn)
 		osm_prtn_delete(&p_prtn);
 	}
 
-	cl_fmap_remove_all(&p_subn->mgrp_mgid_tbl);
 
 	for (i = 0; i <= p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO;
 	     i++) {
-		p_mgrp = p_subn->mgroups[i];
-		p_subn->mgroups[i] = NULL;
-		if (p_mgrp)
-			osm_mgrp_delete(p_mgrp);
+		p_mgrp_holder = p_subn->mgroup_holders[i];
+		if (p_mgrp_holder){
+				osm_mgrp_holder_delete(p_subn, p_mgrp_holder->mlid);
+		}
+	}
+
+	p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_subn->mgrp_mgid_tbl);
+	while (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_subn->mgrp_mgid_tbl)) {
+		cl_fmap_remove_item(&p_subn->mgrp_mgid_tbl, (cl_fmap_item_t*)p_mgrp);
+		osm_mgrp_delete(p_mgrp);
+		p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_subn->mgrp_mgid_tbl);
 	}
 
 	p_next_infr = (osm_infr_t *) cl_qlist_head(&p_subn->sa_infr_list);
@@ -1646,3 +1653,13 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
 
 	return 0;
 }
+
+ib_net16_t osm_mgrp_holder_get_mlid_by_mgid(IN osm_subn_t const *p_subn,
+						IN const ib_gid_t * const p_mgid)
+{
+	osm_mgrp_t *p_mgrp = (osm_mgrp_t*)cl_fmap_get(&p_subn->mgrp_mgid_tbl, p_mgid);
+	if (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_subn->mgrp_mgid_tbl)) {
+		return p_mgrp->mlid;
+	}
+	return 0;
+}
-- 
1.6.3.3


From slavas at Voltaire.COM  Wed Aug  5 06:48:35 2009
From: slavas at Voltaire.COM (Slava Strebkov)
Date: Wed, 05 Aug 2009 16:48:35 +0300
Subject: [ofa-general] [PATCH 2/2 v3] opensm: Compression of multicast group
 according to pkey
Message-ID: <4A798DB3.90604@Voltaire.COM>


Subject: [PATCH 2/2] Compression of multicast group according to pkey

Additional data structure added:
1. Map of all partition keys opened in the fabric.
2. Map of all multicast group holders shared same pkey.
MLID assignment for multicast groups works in a usual
manner, allocating free entry for newly created group.
Proposed compression algorithm starts working when there
are no more free entries in the mlid array. List of MLIDs
for new multicast group will be chosen from the pkey
indexed map according to the requested pkey. MLID which
shares minimum number of ports will be given to newly
created multicast group.

Signed-off-by: Slava Strebkov <slavas at voltaire.com>
---
 opensm/include/opensm/osm_multicast.h  |  133 ++++++++++++++++++++++++++++++++
 opensm/include/opensm/osm_subnet.h     |   36 +++++++++
 opensm/opensm/osm_mcast_mgr.c          |    4 +
 opensm/opensm/osm_multicast.c          |  109 +++++++++++++++++++++++++-
 opensm/opensm/osm_sa_mcmember_record.c |   38 +++++----
 opensm/opensm/osm_subnet.c             |    8 ++
 6 files changed, 308 insertions(+), 20 deletions(-)

diff --git a/opensm/include/opensm/osm_multicast.h b/opensm/include/opensm/osm_multicast.h
index 61d1ba6..7bd2f81 100644
--- a/opensm/include/opensm/osm_multicast.h
+++ b/opensm/include/opensm/osm_multicast.h
@@ -128,6 +128,7 @@ typedef struct osm_mgrp_holder {
 	boolean_t to_be_deleted;
 	uint32_t last_tree_id;
 	uint32_t last_change_id;
+	cl_map_item_t mlid_item;
 } osm_mgrp_holder_t;
 
 /*
@@ -156,6 +157,9 @@ typedef struct osm_mgrp_holder {
 *
 *	last_tree_id
 *		the last change id used for building the current tree.
+*
+*	mlid_item
+*		list item in list of holders shared same pkey.
 */
  /****s* OpenSM: Multicast group Port /osm_mgrp_port _t
 * NAME
@@ -775,5 +779,134 @@ static inline boolean_t osm_mgrp_holder_is_empty(IN const osm_mgrp_holder_t *
 	return (cl_qmap_count(&p_mgrp_holder->mgrp_port_map) == 0);
 }
 
+/****f* OpenSM: Subnet/osm_mlid_pkey_delete
+* NAME
+*	osm_mlid_pkey_delete
+*
+* DESCRIPTION
+*	Frees the objects.
+*
+* SYNOPSIS
+*/
+void osm_mlid_pkey_delete(osm_mlid_pkey_t *p_mlid_pkey);
+/*
+* PARAMETERS
+*	p_mlid_pkey
+*		[in] Pointer to an osm_mlid_pkey_t object
+*
+* RETURN VALUES
+*	None.
+*
+*
+* SEE ALSO
+*	osm_mlid_pkey_new
+*********/
+
+/****f* OpenSM: Subnet/osm_mlid_pkey_new
+* NAME
+*	osm_mlid_pkey_new
+*
+* DESCRIPTION
+*	Creates new object of osm_mlid_pkey_t.
+*
+* SYNOPSIS
+*/
+osm_mlid_pkey_t *osm_mlid_pkey_new(IN ib_net16_t pkey);
+/*
+* PARAMETERS
+*	pkey
+*		[in] Partition key for the object
+*
+* RETURN VALUES
+*	Pointer to osm_mlid_pkey_t, or NULL.
+*
+* SEE ALSO
+*	osm_mlid_pkey_delete
+*********/
+
+/****f* OpenSM: Subnet/osm_mlid_pkey_add_holder
+* NAME
+*	osm_mlid_pkey_add_holder
+*
+* DESCRIPTION
+*	Adds osm_mlid_pkey_t object to map
+*
+* SYNOPSIS
+*/
+void osm_mlid_pkey_add_holder(osm_mgrp_holder_t *p_mgrp_holder,
+				ib_net16_t pkey,  osm_subn_t  *p_subn);
+/*
+* PARAMETERS
+*	p_mgrp_holder
+*		[in] Pointer to osm_mgrp_holder_t
+*
+*	pkey
+*		[in] Partition key for the object
+*
+*	p_subn
+*		[in] Pointer to an osm_subn_t object
+*
+* RETURN VALUES
+*	None.
+*
+* SEE ALSO
+*	osm_mlid_pkey_remove_holder
+*********/
+
+/****f* OpenSM: Subnet/osm_mlid_pkey_remove_holder
+* NAME
+*	osm_mlid_pkey_remove_holder
+*
+* DESCRIPTION
+*	removes osm_mlid_pkey_t object from map
+*
+* SYNOPSIS
+*/
+void osm_mlid_pkey_remove_holder(osm_mgrp_holder_t *p_mgrp_holder,
+				ib_net16_t pkey, osm_subn_t  *p_subn);
+/*
+* PARAMETERS
+*	p_mgrp_holder
+*		[in] Pointer to osm_mgrp_holder_t
+*
+*	pkey
+*		[in] Partition key for the object
+*
+*	p_subn
+*		[in] Pointer to an osm_subn_t object
+*
+* RETURN VALUES
+*	None.
+*
+* SEE ALSO
+*	osm_mlid_pkey_add_holder
+*********/
+
+/****f* OpenSM: Subnet/osm_mlid_pkey_get_existed_mlid
+* NAME
+*	osm_mlid_pkey_get_existed_mlid
+*
+* DESCRIPTION
+*	return used mlid  with miminum ports, matched by pkey
+*
+* SYNOPSIS
+*/
+ib_net16_t osm_mlid_pkey_get_existed_mlid(IN osm_subn_t  *p_subn, IN ib_net16_t pkey);
+/*
+* PARAMETERS
+*
+*	p_subn
+*		[in] Pointer to an osm_subn_t object
+*
+*	pkey
+*		[in] Partition key for the object
+*
+* RETURN VALUES
+*	matched mlid or 0 if not found
+*
+* SEE ALSO
+*	osm_mlid_pkey_add_holder
+*********/
+
 END_C_DECLS
 #endif				/* _OSM_MULTICAST_H_ */
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index fad8780..aea6c45 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -469,6 +469,37 @@ typedef struct osm_subn_opt {
 *	Subnet object
 *********/
 
+/****s* OpenSM: Subnet/osm_mlid_pkey_t
+* NAME
+*       osm_mlid_pkey_t
+*
+* DESCRIPTION
+*	Structure combines all MLIDs opened on same pkey value.
+*	Used for mgid to mlid compresion
+*
+* SYNOPSIS
+*/
+typedef struct osm_mlid_pkey {
+	cl_map_item_t pkey_item;
+	ib_net16_t pkey;
+	cl_qmap_t mlid_holder_map;
+} osm_mlid_pkey_t;
+/*
+* FIELDS
+*	pkey_item
+*		Map Item for qmap linkage.  Must be first element!!
+*		Indexed by pkey.
+*
+*	pkey
+*		Partition key (P_Key) for multicast group(s).
+*
+*	mlid_holder_map
+*		Map of osm_mgrp_holder_t objects. Indexed by mlid
+*
+* SEE ALSO
+*	osm_mgrp_holder_t
+*********/
+
 /****s* OpenSM: Subnet/osm_subn_t
 * NAME
 *	osm_subn_t
@@ -514,6 +545,7 @@ typedef struct osm_subn {
 	unsigned need_update;
 	cl_fmap_t mgrp_mgid_tbl;
 	void *mgroup_holders[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1];
+	cl_qmap_t mlid_pkey_tbl;
 } osm_subn_t;
 /*
 * FIELDS
@@ -638,6 +670,10 @@ typedef struct osm_subn {
 *		Array of pointers to all Multicast Group Holder objects in the subnet.
 *		Indexed by MLID offset from base MLID.
 *
+*	mlid_pkey_tbl;
+*		Map of osm_pkey_mlid_t objects. Arranged by mgrp pkey value.
+*		Contains MLIDs for mgroups with same pkey.
+*
 * SEE ALSO
 *	Subnet object
 *********/
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index f506393..ec3dec6 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1075,6 +1075,10 @@ static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm,
 						gid_str, sizeof(gid_str)),
 						cl_ntoh16(p_mgrp->mlid));
 					osm_mgrp_holder_delete_mgrp(p_mgrp_holder, p_mgrp);
+					if (p_mgrp_holder->to_be_deleted) {
+						osm_mlid_pkey_remove_holder(p_mgrp_holder,
+							p_mgrp->mcmember_rec.pkey,sm->p_subn);
+					}
 					p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(&p_mgrp->mcm_port_tbl);
 					while (p_mcm_port !=
 						(osm_mcm_port_t *) cl_qmap_end(&p_mgrp->mcm_port_tbl)) {
diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
index 072b591..4724bd3 100644
--- a/opensm/opensm/osm_multicast.c
+++ b/opensm/opensm/osm_multicast.c
@@ -366,10 +366,9 @@ void osm_mgrp_holder_remove_port(osm_subn_t * subn, osm_log_t * p_log,
 		char gid_str[INET6_ADDRSTRLEN];
 		OSM_LOG(p_log, OSM_LOG_DEBUG,
 		"port  0x%" PRIx64 " removed from  mlid 0x%X\n",
-		port_guid, cl_ntoh16(p_mgrp_holder->mlid));
-		while ((p_item =
-			cl_qlist_remove_head(&p_mgrp_port->mgroups)) !=
-			cl_qlist_end(&p_mgrp_port->mgroups)) {
+		cl_ntoh64(port_guid), cl_ntoh16(p_mgrp_holder->mlid));
+		while (!cl_is_qlist_empty(&p_mgrp_port->mgroups)) {
+			p_item = cl_qlist_remove_head(&p_mgrp_port->mgroups);
 			p_mgrp = (osm_mgrp_t *)
 				PARENT_STRUCT(p_item, osm_mgrp_t,port_item);
 			OSM_LOG(p_log, OSM_LOG_DEBUG,
@@ -460,3 +459,105 @@ void osm_mgrp_holder_port_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder,
 	p_mgrp_holder->last_change_id++;
 	}
 }
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mlid_pkey_delete(osm_mlid_pkey_t *p_mlid_pkey)
+{
+	cl_qmap_remove_all(&p_mlid_pkey->mlid_holder_map);
+	free(p_mlid_pkey);
+}
+
+/**********************************************************************
+ **********************************************************************/
+osm_mlid_pkey_t *osm_mlid_pkey_new(ib_net16_t pkey)
+{
+	osm_mlid_pkey_t *p_mlid_pkey = malloc(sizeof(osm_mlid_pkey_t));
+	if (!p_mlid_pkey) {
+		return NULL;
+	}
+	memset(p_mlid_pkey, 0, sizeof(*p_mlid_pkey));
+	cl_qmap_init(&p_mlid_pkey->mlid_holder_map);
+	p_mlid_pkey->pkey = pkey;
+	return p_mlid_pkey;
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mlid_pkey_add_holder(osm_mgrp_holder_t *p_mgrp_holder,
+				ib_net16_t pkey,  osm_subn_t  *p_subn)
+{
+	osm_mlid_pkey_t *p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_get(&p_subn->mlid_pkey_tbl,
+							0x7fff & pkey);
+	if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) {
+		cl_qmap_insert(&p_mlid_pkey->mlid_holder_map, p_mgrp_holder->mlid,&p_mgrp_holder->mlid_item);
+	}
+	else {
+		p_mlid_pkey = osm_mlid_pkey_new(pkey);
+		if (p_mlid_pkey) {
+			cl_qmap_insert(&p_mlid_pkey->mlid_holder_map, p_mgrp_holder->mlid,
+						   &p_mgrp_holder->mlid_item);
+			cl_qmap_insert(&p_subn->mlid_pkey_tbl, 0x7fff & pkey,&p_mlid_pkey->pkey_item);
+		}
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_mlid_pkey_remove_holder(osm_mgrp_holder_t *p_mgrp_holder,
+				ib_net16_t pkey, osm_subn_t  *p_subn)
+{
+	osm_mlid_pkey_t *p_mlid_pkey = (osm_mlid_pkey_t*)
+		cl_qmap_get(&p_subn->mlid_pkey_tbl, 0x7fff & pkey);
+	if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) {
+		cl_qmap_remove_item(&p_mlid_pkey->mlid_holder_map, &p_mgrp_holder->mlid_item);
+		if (!cl_qmap_count(&p_mlid_pkey->mlid_holder_map)) {
+			/* no more groups with given pkey exist */
+			osm_mlid_pkey_delete(p_mlid_pkey);
+		}
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
+static ib_net16_t osm_mlid_pkey_get_mlid(IN osm_mlid_pkey_t *p_mlid_pkey)
+{
+	cl_map_item_t *p_item;
+	osm_mgrp_holder_t *p_mgrp_holder;
+	osm_mgrp_holder_t *p_matched_holder = NULL;
+	size_t port_count = 0;
+	for (p_item = cl_qmap_head(&p_mlid_pkey->mlid_holder_map);
+			p_item != cl_qmap_end(&p_mlid_pkey->mlid_holder_map);
+			p_item = cl_qmap_next(p_item)) {
+		p_mgrp_holder = (osm_mgrp_holder_t*)
+			PARENT_STRUCT(p_item, osm_mgrp_holder_t,mlid_item);
+		if (!port_count) {
+			/* init p_matched_holder and count */
+			port_count = cl_qmap_count(&p_mgrp_holder->mgrp_port_map);
+			p_matched_holder = p_mgrp_holder;
+		}
+		else {
+			if (port_count > cl_qmap_count(&p_mgrp_holder->mgrp_port_map)) {
+				port_count = cl_qmap_count(&p_mgrp_holder->mgrp_port_map);
+				p_matched_holder = p_mgrp_holder;
+			}
+		}
+	}
+	if (p_matched_holder) {
+		return p_matched_holder->mlid;
+	}
+	return 0;
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_net16_t osm_mlid_pkey_get_existed_mlid(IN osm_subn_t  *p_subn, IN ib_net16_t pkey)
+{
+	osm_mlid_pkey_t *p_mlid_pkey =
+		(osm_mlid_pkey_t*)cl_qmap_get(&p_subn->mlid_pkey_tbl, 0x7fff & pkey);
+	if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) {
+		/* found obect with mgroups matched requested pkey */
+		return osm_mlid_pkey_get_mlid(p_mlid_pkey);
+	}
+	return 0;
+}
diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 3838a08..3c34592 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -152,6 +152,10 @@ static void cleanup_mgrp(IN osm_sa_t * sa, osm_mgrp_t * mgrp)
 		osm_mgrp_holder_t *p_mgrp_holder =
 			osm_get_mgrp_holder_by_mlid(sa->p_subn, mgrp->mlid);
 		osm_mgrp_holder_delete_mgrp(p_mgrp_holder, mgrp);
+		if (p_mgrp_holder->to_be_deleted) {
+			osm_mlid_pkey_remove_holder(p_mgrp_holder,
+				mgrp->mcmember_rec.pkey,sa->p_subn);
+		}
 		cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl,
 				    &mgrp->map_item);
 		osm_mgrp_delete(mgrp);
@@ -812,13 +816,14 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 						IN const osm_physp_t * p_physp,
 						OUT osm_mgrp_t ** pp_mgrp)
 {
-	ib_net16_t mlid, existed_mlid;
+	ib_net16_t mlid;
 	unsigned zero_mgid, i;
 	uint8_t scope;
 	ib_gid_t *p_mgid;
 	ib_api_status_t status = IB_SUCCESS;
 	ib_member_rec_t mcm_rec = *p_recvd_mcmember_rec;	/* copy for modifications */
 	osm_mgrp_holder_t * p_mgrp_holder;
+	boolean_t new_mlid = TRUE;
 
 	OSM_LOG_ENTER(sa->p_log);
 
@@ -836,15 +841,22 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 	 */
 	mlid = get_new_mlid(sa, mcm_rec.mlid);
 	if (mlid == 0) {
-		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
-			"get_new_mlid failed request mlid 0x%04x\n",
-			cl_ntoh16(mcm_rec.mlid));
-		status = IB_SA_MAD_STATUS_NO_RESOURCES;
-		goto Exit;
+		/* try to add mcgroup to existed mlid */
+		mlid = osm_mlid_pkey_get_existed_mlid(sa->p_subn, mcm_rec.pkey);
+		if (mlid ==  0) {
+			OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
+				"get_new_mlid failed request mlid 0x%04x\n",
+				cl_ntoh16(mcm_rec.mlid));
+			status = IB_SA_MAD_STATUS_NO_RESOURCES;
+			goto Exit;
+		}
+		new_mlid = FALSE;
+		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
+			"Found existed mlid 0x%X\n", cl_ntoh16(mlid));
 	}
 
 	OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
-		"Obtained new mlid 0x%X\n", cl_ntoh16(mlid));
+		"Obtained  mlid 0x%X\n", cl_ntoh16(mlid));
 
 	/* we need to create the new MGID if it was not defined */
 	if (zero_mgid) {
@@ -894,15 +906,6 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 		goto Exit;
 	}
 
-	if (0 != (existed_mlid = osm_mgrp_holder_get_mlid_by_mgid(sa->p_subn, p_mgid))) {
-		char gid_str[INET6_ADDRSTRLEN];
-		mlid = existed_mlid;
-		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
-			"found existed  mlid  0x%04x for mgid %s\n",
-			cl_ntoh16(mlid), inet_ntop(AF_INET6, p_mgid->raw,
-						   gid_str, sizeof gid_str));
-	}
-
 	/* create a new MC Group */
 	*pp_mgrp = osm_mgrp_new(mlid);
 	if (*pp_mgrp == NULL) {
@@ -947,6 +950,9 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 		       &(*pp_mgrp)->mcmember_rec.mgid, &(*pp_mgrp)->map_item);
 
 	osm_mgrp_holder_add_mgrp(p_mgrp_holder, *pp_mgrp, sa->p_log);
+	if (new_mlid)
+		osm_mlid_pkey_add_holder(p_mgrp_holder,
+			(*pp_mgrp)->mcmember_rec.pkey, sa->p_subn);
 
 Exit:
 	OSM_LOG_EXIT(sa->p_log);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 6ed95d4..1826219 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -416,6 +416,7 @@ void osm_subn_construct(IN osm_subn_t * const p_subn)
 	cl_qmap_init(&p_subn->rtr_guid_tbl);
 	cl_qmap_init(&p_subn->prtn_pkey_tbl);
 	cl_fmap_init(&p_subn->mgrp_mgid_tbl, compar_mgids);
+	cl_qmap_init(&p_subn->mlid_pkey_tbl);
 }
 
 /**********************************************************************
@@ -431,6 +432,7 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn)
 	osm_mgrp_holder_t *p_mgrp_holder;
 	osm_infr_t *p_infr, *p_next_infr;
 	osm_mgrp_t *p_mgrp;
+	osm_mlid_pkey_t *p_mlid_pkey;
 
 	/* it might be a good idea to de-allocate all known objects */
 	p_next_node = (osm_node_t *) cl_qmap_head(&p_subn->node_guid_tbl);
@@ -472,6 +474,12 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn)
 		osm_prtn_delete(&p_prtn);
 	}
 
+	p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_head(&p_subn->mlid_pkey_tbl);
+	while (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) {
+		cl_qmap_remove_item(&p_subn->mlid_pkey_tbl, (cl_map_item_t*)p_mlid_pkey);
+		osm_mlid_pkey_delete(p_mlid_pkey);
+		p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_head(&p_subn->mlid_pkey_tbl);
+	}
 
 	for (i = 0; i <= p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO;
 	     i++) {
-- 
1.6.3.3


From sashak at voltaire.com  Wed Aug  5 07:12:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 17:12:56 +0300
Subject: [ofa-general] Re: [PATCH v2] opensm: fixing handling of
	opt.max_wire_smps
In-Reply-To: <4A796B15.7000802@dev.mellanox.co.il>
References: <4A784698.10803@dev.mellanox.co.il>
	<4A796B15.7000802@dev.mellanox.co.il>
Message-ID: <20090805141256.GT7993@me>

On 14:20 Wed 05 Aug     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> V2 of this patch:
> 
> opt.max_wire_smps is uint32, but then when it's propagated
> into the VL15 poller it's casted to int32. Fixing the
> parameter handling to protect it from wrong values.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied with change noted below. Thanks.

> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..c43bef7 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1066,6 +1066,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
>  		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
>  	}
> 
> +	if (p_opts->max_wire_smps == 0) {
> +		log_report(" Invalid Cached Option Value: max_wire_smps = 0,"
> +			   " Using unlimited: 0x7FFFFFFF\n");
> +		p_opts->max_wire_smps = 0x7FFFFFFF;
> +	}

'0' is not an invalid value, it is means "unlimited", so I'm removing
this error message.

Sasha


From hal.rosenstock at gmail.com  Wed Aug  5 07:21:15 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 10:21:15 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches 
	for lash
In-Reply-To: <20090805094433.GQ7993@me>
References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me>
Message-ID: <f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>

On Wed, Aug 5, 2009 at 5:44 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 11:16 Wed 22 Jul     , Hal Rosenstock wrote:
> >
> > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> > index 23fad87..dce2ea1 100644
> > --- a/opensm/opensm/osm_mesh.c
> > +++ b/opensm/opensm/osm_mesh.c
> > @@ -185,6 +185,16 @@ typedef struct _mesh {
> >       int dim_order[MAX_DIMENSION];
> >  } mesh_t;
> >
> > +typedef struct sort_ctx {
> > +     lash_t *p_lash;
> > +     mesh_t *mesh;
> > +} sort_ctx_t;
> > +
> > +typedef struct comp {
> > +     int index;
> > +     sort_ctx_t *ctx;
> > +} comp_t;
>
> And wouldn't it be simpler to use:
>
> struct comp {
>        switch_t **s;


Are you thinking this is:
           s = &p_lash->switches[i];


>
>        sort_ctx_t ctx;
> };


> ? So you will have already sorted switches and only will need to care
> about s->id and s->links fixing (and will not need switches[] array too).


Then comp would contain an ordered list of p_lash->switches array pointers
which would need to be walked through for actually reordering that array. If
so, it's the cost of the new switches array v. the cost of reordering the
original lash switches array. I haven't thought that through yet.

Is this what you mean or am I missing your idea on how the p_lash->switches
array is to be reordered ?

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/852c748a/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug  5 07:43:55 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 10:43:55 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090805134352.GS7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
	<20090805134352.GS7993@me>
Message-ID: <f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>

On Wed, Aug 5, 2009 at 9:43 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 07:24 Wed 05 Aug     , Hal Rosenstock wrote:
> >
> > Are you saying to move the calls in the individual routing engines to
> > osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing
> > so consolidates the changes I had made to the various routing engines in
> one
> > place) ?
>
> Yes.


Should this be done as a separate step on the way to the LFT parallelization
across switches ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/e2f6632a/attachment.html>

From bart.vanassche at gmail.com  Wed Aug  5 08:00:03 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Wed, 5 Aug 2009 17:00:03 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adaljlzl0ne.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	<adaljlzl0ne.fsf@cisco.com>
Message-ID: <e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>

On Tue, Aug 4, 2009 at 11:39 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > By the way, Vladislav Bolkhovitin was so kind to inform me that this
>  > issue is not specific to the SRP initiator. For more information, see
>  > also http://thread.gmane.org/gmane.linux.scsi/26166.
>
> I'm not sure I follow this exactly -- the idea is that sg_reset
> generates SCSI commands that are somehow different?  What does the LLD
> have to do to handle them?
>
> Is the problem that we get a command with bogus host_scribble (since SRP
> never saw it before) and so srp_find_req() gets confused?

A search with grep for the text '->eh_device_reset_handler' through
the kernel sources learned me that this handler can be invoked from
the following two functions:
* scsi_try_bus_device_reset() in drivers/scsi/scsi_error.c;
* try_to_reset_cmd_device() in drivers/scsi/libsas/sas_scsi_host.c.

So if the function srp_reset_device() is called, it is called from
scsi_try_bus_device_reset(). This last function can be invoked by
scsi_abort_eh_cmnd(), by scsi_eh_bus_device_reset() or by
scsi_reset_provider(). The last function, scsi_reset_provider(), is
invoked by the sg_reset command by issuing an SG_SCSI_RESET ioctl.

The NULL pointer dereference happens when srp_reset_device() calls
srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
req->scmnd->device == NULL. When the sg_reset command issues an
SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
scmnd structure and sets scmnd->device to NULL. It is this scmnd
structure that is passed to srp_reset_device(). What I'm not sure
about is whether scsi_reset_provider() should set req->scmnd->device
to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
handle the condition req->scmnd->device == NULL.

Bart.


From hnrose at comcast.net  Wed Aug  5 08:22:58 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 11:22:58 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Handle calloc failure
	in generate_cdg_for_sp
Message-ID: <20090805152258.GA16417@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 168a758..b3107f0 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -323,8 +323,8 @@ static int generate_routing_func_for_mst(lash_t * p_lash, int sw_id,
 	return 0;
 }
 
-static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch,
-				int lane)
+static int generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch,
+			       int lane)
 {
 	unsigned num_switches = p_lash->num_switches;
 	switch_t **switches = p_lash->switches;
@@ -339,6 +339,8 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch,
 
 		if (cdg_vertex_matrix[lane][sw][next_switch] == NULL) {
 			v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
+			if (!v)
+				return -1;
 			v->from = sw;
 			v->to = next_switch;
 			v->temp = 1;
@@ -380,6 +382,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch,
 
 		prev = v;
 	}
+	return 0;
 }
 
 static void set_temp_depend_to_permanent_for_sp(lash_t * p_lash, int sw,
@@ -448,7 +451,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash, int sw, int dest_switch,
 	}
 }
 
-static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed)
+static int balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed)
 {
 	unsigned num_switches = p_lash->num_switches;
 	cdg_vertex_t ****cdg_vertex_matrix = p_lash->cdg_vertex_matrix;
@@ -499,8 +502,9 @@ static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed)
 			}
 		}
 
-		generate_cdg_for_sp(p_lash, src, dest, min_filled_lane);
-		generate_cdg_for_sp(p_lash, dest, src, min_filled_lane);
+		if (generate_cdg_for_sp(p_lash, src, dest, min_filled_lane) ||
+		    generate_cdg_for_sp(p_lash, dest, src, min_filled_lane))
+			return -1;
 
 		output_link = p_lash->switches[src]->routing_table[dest].out_link;
 		next_switch = get_next_switch(p_lash, src, output_link);
@@ -596,6 +600,7 @@ static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed)
 						virtual_location[i][j][old_max_filled_lane] = 1;
 		}
 	}
+	return 0;
 }
 
 static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw)
@@ -837,8 +842,12 @@ static int lash_core(lash_t * p_lash)
 				v_lane = 0;
 				stop = 0;
 				while (v_lane < lanes_needed && stop == 0) {
-					generate_cdg_for_sp(p_lash, i, dest_switch, v_lane);
-					generate_cdg_for_sp(p_lash, dest_switch, i, v_lane);
+					if (generate_cdg_for_sp(p_lash, i, dest_switch, v_lane) ||
+					    generate_cdg_for_sp(p_lash, dest_switch, i, v_lane)) {
+						OSM_LOG(p_log, OSM_LOG_ERROR,
+							"ERR 4D07: generate_cdg_for_sp failed\n");
+						goto Exit;
+					}
 
 					output_link =
 					    switches[i]->routing_table[dest_switch].out_link;
@@ -903,8 +912,12 @@ static int lash_core(lash_t * p_lash)
 					if (++lanes_needed > p_lash->vl_min)
 						goto Error_Not_Enough_Lanes;
 
-					generate_cdg_for_sp(p_lash, i, dest_switch, v_lane);
-					generate_cdg_for_sp(p_lash, dest_switch, i, v_lane);
+					if (generate_cdg_for_sp(p_lash, i, dest_switch, v_lane) ||
+					    generate_cdg_for_sp(p_lash, dest_switch, i, v_lane)) {
+						OSM_LOG(p_log, OSM_LOG_ERROR,
+							"ERR 4D08: generate_cdg_for_sp failed\n");
+						goto Exit;
+					}
 
 					set_temp_depend_to_permanent_for_sp(p_lash, i, dest_switch,
 									    v_lane);
@@ -929,7 +942,10 @@ static int lash_core(lash_t * p_lash)
 	OSM_LOG(p_log, OSM_LOG_INFO,
 		"Lanes needed: %d, Balancing\n", lanes_needed);
 
-	balance_virtual_lanes(p_lash, lanes_needed);
+	if (balance_virtual_lanes(p_lash, lanes_needed)) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D09: Balancing failed\n");
+		goto Exit;
+	}
 
 	for (i = 0; i < lanes_needed; i++)
 		OSM_LOG(p_log, OSM_LOG_INFO, "Lanes in layer %d: %d\n",


From sean.hefty at intel.com  Wed Aug  5 08:46:43 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 5 Aug 2009 08:46:43 -0700
Subject: [ofa-general] [PATCH] cma: fix access to freed memory
In-Reply-To: <20090803092528.GA25528@mtls03>
References: <20090803092528.GA25528@mtls03>
Message-ID: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com>

>rdma_join_multicast() allocates struct cma_multicast and then proceeds to join
>to a multicast address. However, the join operation completes in another
>context and the allocated struct could be released if the user destroys either
>the rdma_id object or decides to leave the multicast group while the join is in
>progress. This patch uses reference counting to to avoid such situation. It
>also protects removal from id_priv->mc_list in cma_leave_mc_groups().

rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast.  This call
will block until the join callback completes or is canceled.  Can you describe
the race with cma_ib_mc_handler in more detail?

Also, cma_leave_mc_groups is only called from rdma_destroy_id.  Locking around
the mc->list shouldn't be required, since calls to join/leave aren't allowed.

- Sean


From eli at dev.mellanox.co.il  Wed Aug  5 09:16:10 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 5 Aug 2009 19:16:10 +0300
Subject: [ofa-general] [PATCH] cma: fix access to freed memory
In-Reply-To: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com>
References: <20090803092528.GA25528@mtls03>
	<04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com>
Message-ID: <20090805161610.GA13892@mtls03>

On Wed, Aug 05, 2009 at 08:46:43AM -0700, Sean Hefty wrote:

> rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast.  This call
> will block until the join callback completes or is canceled.  Can you describe
> the race with cma_ib_mc_handler in more detail?

That explains it. I was using a different "join" implementation for
RDMAoE without a "leave" operation so I had to use this kref solution.
So no need to this patch. I will provide a distinct solution for the RDMAoE
case.

Thanks.


From sashak at voltaire.com  Wed Aug  5 09:22:36 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 19:22:36 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder
	switches for lash
In-Reply-To: <f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>
References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me>
	<f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>
Message-ID: <20090805162236.GU7993@me>

On 10:21 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Is this what you mean or am I missing your idea on how the p_lash->switches
> array is to be reordered ?

Thinking more about this I suppose that an original structure is good
enough for doing what you need without intermediate buffers. It could be
something like this:

	qsort(index....);

	for (i = 0; i < num_switches; i++)
		lash->switches[index[i].index]->id = i;

	for (i = 0; i < num_switches; i++) {
		s = lash->switches[i];
		for (j = 0; j < s->num_links; j++)
			s->links[j]->switch_id =
			    lash->switches[s->links[j]->switch_id]->id;
	}

	for (i = 0; i < num_switches; i++) {
		s = lash->switches[i];
		while (s->id != i) {
			s1 = lash->switches[s->id];
			lash->switches[s->id] = s;
			s = s1;
		}
	}

Would it work?

Sasha


From sashak at voltaire.com  Wed Aug  5 09:25:52 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 19:25:52 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Handle calloc
 failure in generate_cdg_for_sp
In-Reply-To: <20090805152258.GA16417@comcast.net>
References: <20090805152258.GA16417@comcast.net>
Message-ID: <20090805162552.GV7993@me>

On 11:22 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Aug  5 09:31:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 19:31:40 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
	<20090805134352.GS7993@me>
	<f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>
Message-ID: <20090805163140.GW7993@me>

On 10:43 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Should this be done as a separate step on the way to the LFT parallelization
> across switches ?

What do you mean by "separate step" (separate from what)?

I'm trying to replay the idea again: each routing engine calculates LFTs
and fill sw->new_lfts array accordingly, after all it calls a procedure
for sending switches' LFT blocks (and TOPs). So routing engine itself
should not care about how exactly LFT blocks update MADs submission is
actually implemented.

Sasha


From hal.rosenstock at gmail.com  Wed Aug  5 10:03:55 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 13:03:55 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches 
	for lash
In-Reply-To: <20090805162236.GU7993@me>
References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me>
	<f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>
	<20090805162236.GU7993@me>
Message-ID: <f0e08f230908051003m5d9009a7w8ebd232a436e41d2@mail.gmail.com>

On Wed, Aug 5, 2009 at 12:22 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 10:21 Wed 05 Aug     , Hal Rosenstock wrote:
> >
> > Is this what you mean or am I missing your idea on how the
> p_lash->switches
> > array is to be reordered ?
>
> Thinking more about this I suppose that an original structure is good
> enough for doing what you need without intermediate buffers. It could be
> something like this:
>
>        qsort(index....);
>
>        for (i = 0; i < num_switches; i++)
>                lash->switches[index[i].index]->id = i;
>
>        for (i = 0; i < num_switches; i++) {
>                s = lash->switches[i];
>                for (j = 0; j < s->num_links; j++)
>                        s->links[j]->switch_id =
>                            lash->switches[s->links[j]->switch_id]->id;
>        }
>
>        for (i = 0; i < num_switches; i++) {
>                s = lash->switches[i];
>                while (s->id != i) {
>                        s1 = lash->switches[s->id];
>                        lash->switches[s->id] = s;
>                        s = s1;
>                }
>        }
>
> Would it work?


Even if something like this works (haven't played with it yet), is it worth
iterating over lash->switches array to save a memory allocation ? That seems
to be what is being optimized to me. Also, couldn't this be a subsequent
step in the evolution of this code ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/0c7297fa/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug  5 10:07:10 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 13:07:10 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090805163140.GW7993@me>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
	<20090805134352.GS7993@me>
	<f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>
	<20090805163140.GW7993@me>
Message-ID: <f0e08f230908051007t66c799adgcebe61f15ed10c80@mail.gmail.com>

On Wed, Aug 5, 2009 at 12:31 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 10:43 Wed 05 Aug     , Hal Rosenstock wrote:
> >
> > Should this be done as a separate step on the way to the LFT
> parallelization
> > across switches ?
>
> What do you mean by "separate step" (separate from what)?


Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a
level and a second one to the implement the LFT parallelization across
switches underneath that.


>
>
> I'm trying to replay the idea again: each routing engine calculates LFTs
> and fill sw->new_lfts array accordingly, after all it calls a procedure
> for sending switches' LFT blocks (and TOPs). So routing engine itself
> should not care about how exactly LFT blocks update MADs submission is
> actually implemented.
>


Yes, understood.

-- Hal


>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/ca2cce78/attachment.html>

From rdreier at cisco.com  Wed Aug  5 10:32:02 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 10:32:02 -0700
Subject: [ofa-general] Re: [PATCH 5/5] RDMA/nes: Rework the disconn routine
	for terminate and flushing
In-Reply-To: <20090723220051.GA5304@dewood-MOBL> (Don Wood's message of "Thu, 
	23 Jul 2009 17:00:51 -0500")
References: <20090723220051.GA5304@dewood-MOBL>
Message-ID: <adar5vqjhf1.fsf@cisco.com>

thanks, applied all 10 pending patches.


From rdreier at cisco.com  Wed Aug  5 10:40:38 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 10:40:38 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP initiator triggered by a SCSI reset after the SRP
	connection has been closed
In-Reply-To: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	(Bart Van Assche's message of "Mon, 3 Aug 2009 15:21:21 +0200")
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
Message-ID: <adamy6ejh0p.fsf@cisco.com>

Now I'm confused about this patch for another reason:

 > @@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_
 >  		return FAILED;
 >  	if (req->tsk_status)
 >  		return FAILED;
 > +	if (!req->scmnd->device)
 > +		return FAILED;
 > 
 >  	spin_lock_irq(target->scsi_host->host_lock);

This adds the check *after* the call to srp_send_tsk_mgmt() -- which is
where scmnd->device will be dereferenced.  So how does this fix the bug?

 - R.


From sashak at voltaire.com  Wed Aug  5 10:45:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 20:45:30 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT
	sets across switches
In-Reply-To: <f0e08f230908051007t66c799adgcebe61f15ed10c80@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
	<20090805134352.GS7993@me>
	<f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>
	<20090805163140.GW7993@me>
	<f0e08f230908051007t66c799adgcebe61f15ed10c80@mail.gmail.com>
Message-ID: <20090805174530.GX7993@me>

On 13:07 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a
> level and a second one to the implement the LFT parallelization across
> switches underneath that.

Basically I'm fine with single patch too. And yes, it could be done as
you are proposing - it is up to you.

Sasha


From rdreier at cisco.com  Wed Aug  5 10:44:23 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 10:44:23 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP initiator triggered by a SCSI reset after the SRP
	connection has been closed
In-Reply-To: <e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>
	(Bart Van Assche's message of "Wed, 5 Aug 2009 17:00:03 +0200")
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	<adaljlzl0ne.fsf@cisco.com>
	<e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>
Message-ID: <adaiqh2jgug.fsf@cisco.com>


 > The NULL pointer dereference happens when srp_reset_device() calls
 > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
 > req->scmnd->device == NULL. When the sg_reset command issues an
 > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
 > scmnd structure and sets scmnd->device to NULL. It is this scmnd
 > structure that is passed to srp_reset_device(). What I'm not sure
 > about is whether scsi_reset_provider() should set req->scmnd->device
 > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
 > handle the condition req->scmnd->device == NULL.

Well, I don't see how the reset ioctl can do anything useful unless it
passes a device in with the scsi command -- otherwise for example
srp_reset_device() has no idea what LUN to try and reset.

 - R.


From bart.vanassche at gmail.com  Wed Aug  5 10:48:40 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Wed, 5 Aug 2009 19:48:40 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adamy6ejh0p.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adamy6ejh0p.fsf@cisco.com>
Message-ID: <e2e108260908051048w1366047bg5d9932bf6a8396ac@mail.gmail.com>

On Wed, Aug 5, 2009 at 7:40 PM, Roland Dreier<rdreier at cisco.com> wrote:
> Now I'm confused about this patch for another reason:
>
>  > @@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_
>  >              return FAILED;
>  >      if (req->tsk_status)
>  >              return FAILED;
>  > +    if (!req->scmnd->device)
>  > +            return FAILED;
>  >
>  >      spin_lock_irq(target->scsi_host->host_lock);
>
> This adds the check *after* the call to srp_send_tsk_mgmt() -- which is
> where scmnd->device will be dereferenced.  So how does this fix the bug?

I made a mistake while preparing and posting the patch. The check
should have been inserted before the call to srp_send_tsk_mgmt() of
course.

Bart.


From sashak at voltaire.com  Wed Aug  5 10:50:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 20:50:00 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder
	switches for lash
In-Reply-To: <f0e08f230908051003m5d9009a7w8ebd232a436e41d2@mail.gmail.com>
References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me>
	<f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>
	<20090805162236.GU7993@me>
	<f0e08f230908051003m5d9009a7w8ebd232a436e41d2@mail.gmail.com>
Message-ID: <20090805175000.GY7993@me>

On 13:03 Wed 05 Aug     , Hal Rosenstock wrote:
> >
> > Thinking more about this I suppose that an original structure is good
> > enough for doing what you need without intermediate buffers. It could be
> > something like this:
> >
> >        qsort(index....);
> >
> >        for (i = 0; i < num_switches; i++)
> >                lash->switches[index[i].index]->id = i;
> >
> >        for (i = 0; i < num_switches; i++) {
> >                s = lash->switches[i];
> >                for (j = 0; j < s->num_links; j++)
> >                        s->links[j]->switch_id =
> >                            lash->switches[s->links[j]->switch_id]->id;
> >        }
> >
> >        for (i = 0; i < num_switches; i++) {
> >                s = lash->switches[i];
> >                while (s->id != i) {
> >                        s1 = lash->switches[s->id];
> >                        lash->switches[s->id] = s;
> >                        s = s1;
> >                }
> >        }
> >
> > Would it work?
> 
> 
> Even if something like this works (haven't played with it yet), is it worth
> iterating over lash->switches array to save a memory allocation ?

It is single pass finally - just put everything in the places. I don't
think that this introduces more calculations than the original code did.

> Also, couldn't this be a subsequent
> step in the evolution of this code ?

Yes, I think it could.

Sasha


From bart.vanassche at gmail.com  Wed Aug  5 10:48:47 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Wed, 5 Aug 2009 19:48:47 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adaiqh2jgug.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	<adaljlzl0ne.fsf@cisco.com>
	<e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>
	<adaiqh2jgug.fsf@cisco.com>
Message-ID: <e2e108260908051048l41939aesd8b4769aae22f4b@mail.gmail.com>

On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > The NULL pointer dereference happens when srp_reset_device() calls
>  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
>  > req->scmnd->device == NULL. When the sg_reset command issues an
>  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
>  > scmnd structure and sets scmnd->device to NULL. It is this scmnd
>  > structure that is passed to srp_reset_device(). What I'm not sure
>  > about is whether scsi_reset_provider() should set req->scmnd->device
>  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
>  > handle the condition req->scmnd->device == NULL.
>
> Well, I don't see how the reset ioctl can do anything useful unless it
> passes a device in with the scsi command -- otherwise for example
> srp_reset_device() has no idea what LUN to try and reset.

(added linux-scsi in CC)

I hope one of the SCSI people can tell us why scsi_reset_provider()
passes the value NULL in req->scmnd->device to


From rdreier at cisco.com  Wed Aug  5 10:48:53 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 10:48:53 -0700
Subject: [ofa-general] [PATCH linux-next 2/5] RDMA/cxgb3: Don't free the
	endpoint early.
In-Reply-To: <20090731193230.2550.42865.stgit@build.ogc.int> (Steve Wise's
	message of "Fri, 31 Jul 2009 14:32:30 -0500")
References: <20090731193225.2550.35448.stgit@build.ogc.int>
	<20090731193230.2550.42865.stgit@build.ogc.int>
Message-ID: <adaeirqjgmy.fsf@cisco.com>


 > - Endpoint flags now need to be set via atomic bitops because they can
 > be set on both the iw_cxgb3 workqueue thread and user disconnect threads.

 > +	if (!test_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) {
 > +		set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags);

for atomicity, should all the places that do test_bit then set_bit
really be using test_and_set_bit()?

it would be cleaner anyway.

 - R.


From swise at opengridcomputing.com  Wed Aug  5 10:54:41 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 05 Aug 2009 12:54:41 -0500
Subject: [ofa-general] [PATCH linux-next 2/5] RDMA/cxgb3: Don't free the
	endpoint early.
In-Reply-To: <adaeirqjgmy.fsf@cisco.com>
References: <20090731193225.2550.35448.stgit@build.ogc.int>	<20090731193230.2550.42865.stgit@build.ogc.int>
	<adaeirqjgmy.fsf@cisco.com>
Message-ID: <4A79C761.6010105@opengridcomputing.com>

Roland Dreier wrote:
>  > - Endpoint flags now need to be set via atomic bitops because they can
>  > be set on both the iw_cxgb3 workqueue thread and user disconnect threads.
>
>  > +	if (!test_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) {
>  > +		set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags);
>
> for atomicity, should all the places that do test_bit then set_bit
> really be using test_and_set_bit()?
>
> it would be cleaner anyway.
>
>  - R.
>   


This particular bit is only set/read on the workq thread.  But I agree I 
should be using test_and_set_bit().   

I'll resend.


Steve.


From bart.vanassche at gmail.com  Wed Aug  5 10:54:17 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Wed, 5 Aug 2009 19:54:17 +0200
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
	by SRP 
	initiator triggered by a SCSI reset after the SRP connection has been
	closed
In-Reply-To: <adaiqh2jgug.fsf@cisco.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	<adaljlzl0ne.fsf@cisco.com>
	<e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>
	<adaiqh2jgug.fsf@cisco.com>
Message-ID: <e2e108260908051054r11262096j3b659de24c820967@mail.gmail.com>

On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > The NULL pointer dereference happens when srp_reset_device() calls
>  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
>  > req->scmnd->device == NULL. When the sg_reset command issues an
>  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
>  > scmnd structure and sets scmnd->device to NULL. It is this scmnd
>  > structure that is passed to srp_reset_device(). What I'm not sure
>  > about is whether scsi_reset_provider() should set req->scmnd->device
>  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
>  > handle the condition req->scmnd->device == NULL.
>
> Well, I don't see how the reset ioctl can do anything useful unless it
> passes a device in with the scsi command -- otherwise for example
> srp_reset_device() has no idea what LUN to try and reset.

(added linux-scsi in CC)

I hope one of the SCSI people can tell us whether the behavior that
scsi_reset_provider()
passes the value NULL in req->scmnd->device to
scsi_try_bus_device_reset() is correct ?

Bart.


From rdreier at cisco.com  Wed Aug  5 11:31:34 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 11:31:34 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion
	dependency detected
In-Reply-To: <e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	(Bart Van Assche's message of "Thu, 23 Jul 2009 08:35:59 +0200")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
Message-ID: <adatz0mi03d.fsf@cisco.com>

So I queued up the patch below for 2.6.32... this is almost the same as
the patch I proposed before except that I fixed two places where I
dropped the lock *after* calling ipoib_send() -- which missed the whole
point of what I was trying to do.  So this patch has a much better
chance of actually working!

[PATCH] IPoIB: Drop priv->lock before calling ipoib_send()

IPoIB currently must use irqsave locking for priv->lock, since it is
taken from interrupt context in one path.  However, ipoib_send() does
skb_orphan(), and the network stack locking is not IRQ-safe.
Therefore we need to make sure we don't hold priv->lock when calling
ipoib_send() to avoid lockdep warnings (the code was almost certainly
safe in practice, since the only code path that takes priv->lock from
interrupt context would never call into the network stack).

Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757
Reported-by: Bart Van Assche <bart.vanassche at gmail.com>
Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |    7 ++++++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    2 ++
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e319d91..2bf5116 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -604,8 +604,11 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
 					   skb_queue_len(&neigh->queue));
 				goto err_drop;
 			}
-		} else
+		} else {
+			spin_unlock_irqrestore(&priv->lock, flags);
 			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb_dst(skb)->neighbour->ha));
+			return;
+		}
 	} else {
 		neigh->ah  = NULL;
 
@@ -688,7 +691,9 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
+		spin_unlock_irqrestore(&priv->lock, flags);
 		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		return;
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a0e9753..a0825fe 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -720,7 +720,9 @@ out:
 			}
 		}
 
+		spin_unlock_irqrestore(&priv->lock, flags);
 		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		return;
 	}
 
 unlock:
-- 
1.6.3.3


From hal.rosenstock at gmail.com  Wed Aug  5 11:49:56 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 14:49:56 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches 
	for lash
In-Reply-To: <20090805175000.GY7993@me>
References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me>
	<f0e08f230908050721q25ebe7e8ufa3d204d7d48c2f3@mail.gmail.com>
	<20090805162236.GU7993@me>
	<f0e08f230908051003m5d9009a7w8ebd232a436e41d2@mail.gmail.com>
	<20090805175000.GY7993@me>
Message-ID: <f0e08f230908051149m25408af2ve850a1c0f934485@mail.gmail.com>

On Wed, Aug 5, 2009 at 1:50 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 13:03 Wed 05 Aug     , Hal Rosenstock wrote:
> > >
> > > Thinking more about this I suppose that an original structure is good
> > > enough for doing what you need without intermediate buffers. It could
> be
> > > something like this:
> > >
> > >        qsort(index....);
> > >
> > >        for (i = 0; i < num_switches; i++)
> > >                lash->switches[index[i].index]->id = i;
> > >
> > >        for (i = 0; i < num_switches; i++) {
> > >                s = lash->switches[i];
> > >                for (j = 0; j < s->num_links; j++)
> > >                        s->links[j]->switch_id =
> > >                            lash->switches[s->links[j]->switch_id]->id;
> > >        }
> > >
> > >        for (i = 0; i < num_switches; i++) {
> > >                s = lash->switches[i];
> > >                while (s->id != i) {
> > >                        s1 = lash->switches[s->id];
> > >                        lash->switches[s->id] = s;
> > >                        s = s1;
> > >                }
> > >        }
> > >
> > > Would it work?
> >
> >
> > Even if something like this works (haven't played with it yet), is it
> worth
> > iterating over lash->switches array to save a memory allocation ?
>
> It is single pass finally - just put everything in the places. I don't
> think that this introduces more calculations than the original code did.


I'll work on in it the background (and note this in the updated patch
description).


>
> > Also, couldn't this be a subsequent
> > step in the evolution of this code ?
>
> Yes, I think it could.


Good; I'll resubmit a slightly updated version shortly.

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/5ffa295b/attachment.html>

From hnrose at comcast.net  Wed Aug  5 11:48:22 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 14:48:22 -0400
Subject: [ofa-general] [PATCHv3] opensm/osm_mesh.c: Reorder switches for lash
Message-ID: <20090805184822.GA21614@comcast.net>


The goal of this patch is to change the order of the switches in the array kept
in the lash context from the original order to one in which the switches are
presented in 'odometer order'.

When the main routine in lash is called the switches are in an order that is
likely based on the order that the switches were originally visited by SM 
topology discovery which is some sort of tree walk. All of the analysis up to
this point is independent of the actual order of the switches, but lash will
use that order to enumerate the paths in the fabric and add them to the VL bins.

Odometer order means that the switches are labelled s[X0, ..., Xn-1] and
ordered s[0, ..., 0], s[0, ..., 1], s[0, ..., Ln-1], s[0, .. 1, 0] etc.
The dimensions are also reordered so that the dimension changing the fastest
has the largest length, i.e. Ln >= Ln-1 >= ... >= L1. [All this is modulo
possible end to end reversal but the basic idea is that the longest axis
changes fastest.]

TO INVESTIGATE: Rather than using an additional switches array in
sort_switches whether it can be done in place using p_lash->switches.

Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v2:
Made completion struct contain context rather than context pointer
Renamed variable from index to comp in sort_switches for better code clarity

Changes since v1:
Made change reentrant FWIW
Added more to patch description
Added memory allocation failure handling

diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 23fad87..72a9aa9 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -185,6 +185,16 @@ typedef struct _mesh {
 	int dim_order[MAX_DIMENSION];
 } mesh_t;
 
+typedef struct sort_ctx {
+	lash_t *p_lash;
+	mesh_t *mesh;
+} sort_ctx_t;
+
+typedef struct comp {
+	int index;
+	sort_ctx_t ctx;
+} comp_t;
+
 /*
  * poly_alloc
  *
@@ -1272,6 +1282,84 @@ static int reorder_links(lash_t *p_lash, mesh_t *mesh)
 }
 
 /*
+ * compare two switches in a sort
+ */
+static int compare_switches(const void *p1, const void *p2)
+{
+	int i, j, d;
+	const comp_t *cp1 = p1, *cp2 = p2;
+	const sort_ctx_t *ctx = &cp1->ctx;
+	switch_t *s1 = ctx->p_lash->switches[cp1->index];
+	switch_t *s2 = ctx->p_lash->switches[cp2->index];
+
+	for (i = 0; i < ctx->mesh->dimension; i++) {
+		j = ctx->mesh->dim_order[i];
+		d = s1->node->coord[j] - s2->node->coord[j];
+
+		if (d > 0)
+			return 1;
+
+		if (d < 0)
+			return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * sort_switches - reorder switch array
+ */
+static void sort_switches(lash_t *p_lash, mesh_t *mesh)
+{
+	int i, j;
+	int num_switches = p_lash->num_switches;
+	comp_t *comp;
+	int *reverse;
+	switch_t *s;
+	switch_t **switches;
+
+	comp = malloc(num_switches * sizeof(comp_t));
+	reverse = malloc(num_switches * sizeof(int));
+	switches = malloc(num_switches * sizeof(switch_t *));
+	if (!comp || !reverse || !switches) {
+		OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR,
+			"Failed memory allocation - switches not sorted!\n");
+		goto Exit;
+	}
+
+	for (i = 0; i < num_switches; i++) {
+		comp[i].index = i;
+		comp[i].ctx.mesh = mesh;
+		comp[i].ctx.p_lash = p_lash;
+	}
+
+	qsort(comp, num_switches, sizeof(comp_t), compare_switches);
+
+	for (i = 0; i < num_switches; i++)
+		reverse[comp[i].index] = i;
+
+	for (i = 0; i < num_switches; i++) {
+		s = p_lash->switches[comp[i].index];
+		switches[i] = s;
+		s->id = i;
+		for (j = 0; j < s->node->num_links; j++)
+			s->node->links[j]->switch_id =
+				reverse[s->node->links[j]->switch_id];
+	}
+
+	for (i = 0; i < num_switches; i++)
+		p_lash->switches[i] = switches[i];
+
+Exit:
+	if (switches)
+		free(switches);
+	if (comp)
+		free(comp);
+	if (reverse)
+		free(reverse);
+}
+
+/*
  * osm_mesh_delete - free per mesh resources
  */
 static void mesh_delete(mesh_t *mesh)
@@ -1470,6 +1558,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 		if (reorder_links(p_lash, mesh))
 			goto err;
 
+		sort_switches(p_lash, mesh);
+
 		p = buf;
 		p += sprintf(p, "found ");
 		for (i = 0; i < mesh->dimension; i++)


From sashak at voltaire.com  Wed Aug  5 12:04:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 22:04:42 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is
 144 before checking for NodeDescription changed
In-Reply-To: <20090804124717.GA12236@comcast.net>
References: <20090804124717.GA12236@comcast.net>
Message-ID: <20090805190442.GZ7993@me>

On 08:47 Tue 04 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index bf39926..925cb27 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm,
>  		}
>  	}
>  
> -	/* Check for node description update. IB Spec v1.2.1 pg 823 */
> -	if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
> -	    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
> -		OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n");
> -
> -		if (p_physp) {
> -			CL_PLOCK_ACQUIRE(sm->p_lock);
> -			osm_req_get_node_desc(sm, p_physp);
> -			CL_PLOCK_RELEASE(sm->p_lock);
> -		} else {
> -			OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> -				"ERR 3812: No physical port found for "
> -				"trap 144: \"node description update\"\n");
> +	if (ib_notice_is_generic(p_ntci)) {
> +		/* Check for node description update. IB Spec v1.2.1 pg 823 */
> +		if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
> +			if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
> +			    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
> +				OSM_LOG(sm->p_log, OSM_LOG_INFO,
> +					"Trap 144 Node description update\n");
> +
> +				if (p_physp) {
> +					CL_PLOCK_ACQUIRE(sm->p_lock);
> +					osm_req_get_node_desc(sm, p_physp);
> +					CL_PLOCK_RELEASE(sm->p_lock);
> +				} else
> +					OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> +						"ERR 3812: No physical port found for "
> +						"trap 144: \"node description update\"\n");
> +			}
>  		}
> -	}
>  
> -	/* do a sweep if we received a trap */
> -	if (sm->p_subn->opt.sweep_on_trap) {
> -		/* if this is trap number 128 or run_heavy_sweep is TRUE -
> -		   update the force_heavy_sweep flag of the subnet.
> -		   Sweep also on traps 144/145 - these traps signal a change of
> -		   certain port capabilities/system image guid.
> -		   TODO: In the future this can be changed to just getting
> -		   PortInfo on this port instead of sweeping the entire subnet. */
> -		if (ib_notice_is_generic(p_ntci) &&
> -		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> -		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> -		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> -		     run_heavy_sweep)) {
> -			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> -				"Forcing heavy sweep. Received trap:%u\n",
> -				cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> +		/* do a sweep if we received a trap */
> +		if (sm->p_subn->opt.sweep_on_trap) {
> +			/* if this is trap number 128 or run_heavy_sweep is
> +			   TRUE - update the force_heavy_sweep flag of the
> +			   subnet. Also, sweep also on traps 144/145 -
> +			   these traps signal a change of certain port
> +			   capabilities/system image guid.
> +			   TODO: In the future this can be changed to just
> +			   getting PortInfo on this port instead of sweeping
> +			   the entire subnet. */
> +			if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> +			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> +			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> +			    run_heavy_sweep) {
> +				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> +					"Forcing heavy sweep. Received trap:%u\n",
> +					cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
>  
> -			sm->p_subn->force_heavy_sweep = TRUE;
> +				sm->p_subn->force_heavy_sweep = TRUE;
> +			}
> +			osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
>  		}
> -		osm_sm_signal(sm, OSM_SIGNAL_SWEEP);

Actually this disables sweep (light) on non generic traps. Was it desired
change? Could you see any potential issues with it?

Sasha

>  	}
>  
>  	/* If we reached here due to trap 129/130/131 - do not need to do
> 


From hnrose at comcast.net  Wed Aug  5 12:03:44 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 15:03:44 -0400
Subject: [ofa-general] [PATCH] opensm/osm_mesh.h: Fix SFW copyright
Message-ID: <20090805190344.GA28221@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_mesh.h b/opensm/include/opensm/osm_mesh.h
index 173fa86..3800372 100644
--- a/opensm/include/opensm/osm_mesh.h
+++ b/opensm/include/opensm/osm_mesh.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2088      System Fabric Works, Inc.
+ * Copyright (c) 2008,2009  System Fabric Works, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU


From hal.rosenstock at gmail.com  Wed Aug  5 12:10:59 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 15:10:59 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is
	144 before checking for NodeDescription changed
In-Reply-To: <20090805190442.GZ7993@me>
References: <20090804124717.GA12236@comcast.net> <20090805190442.GZ7993@me>
Message-ID: <f0e08f230908051210s7a1ad8fiea3a1cb76df232e1@mail.gmail.com>

On Wed, Aug 5, 2009 at 3:04 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

>  On 08:47 Tue 04 Aug     , Hal Rosenstock wrote:
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> > ---
> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> > index bf39926..925cb27 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -2,6 +2,7 @@
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> >   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > + * Copyright (c) 2009 HNR Consulting. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm,
> >               }
> >       }
> >
> > -     /* Check for node description update. IB Spec v1.2.1 pg 823 */
> > -     if (p_ntci->data_details.ntc_144.local_changes &
> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
> > -         p_ntci->data_details.ntc_144.change_flgs &
> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
> > -             OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description
> update\n");
> > -
> > -             if (p_physp) {
> > -                     CL_PLOCK_ACQUIRE(sm->p_lock);
> > -                     osm_req_get_node_desc(sm, p_physp);
> > -                     CL_PLOCK_RELEASE(sm->p_lock);
> > -             } else {
> > -                     OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> > -                             "ERR 3812: No physical port found for "
> > -                             "trap 144: \"node description update\"\n");
> > +     if (ib_notice_is_generic(p_ntci)) {
> > +             /* Check for node description update. IB Spec v1.2.1 pg 823
> */
> > +             if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
> > +                     if (p_ntci->data_details.ntc_144.local_changes &
> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
> > +                         p_ntci->data_details.ntc_144.change_flgs &
> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
> > +                             OSM_LOG(sm->p_log, OSM_LOG_INFO,
> > +                                     "Trap 144 Node description
> update\n");
> > +
> > +                             if (p_physp) {
> > +                                     CL_PLOCK_ACQUIRE(sm->p_lock);
> > +                                     osm_req_get_node_desc(sm, p_physp);
> > +                                     CL_PLOCK_RELEASE(sm->p_lock);
> > +                             } else
> > +                                     OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> > +                                             "ERR 3812: No physical port
> found for "
> > +                                             "trap 144: \"node
> description update\"\n");
> > +                     }
> >               }
> > -     }
> >
> > -     /* do a sweep if we received a trap */
> > -     if (sm->p_subn->opt.sweep_on_trap) {
> > -             /* if this is trap number 128 or run_heavy_sweep is TRUE -
> > -                update the force_heavy_sweep flag of the subnet.
> > -                Sweep also on traps 144/145 - these traps signal a
> change of
> > -                certain port capabilities/system image guid.
> > -                TODO: In the future this can be changed to just getting
> > -                PortInfo on this port instead of sweeping the entire
> subnet. */
> > -             if (ib_notice_is_generic(p_ntci) &&
> > -                 (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> > -                  run_heavy_sweep)) {
> > -                     OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> > -                             "Forcing heavy sweep. Received trap:%u\n",
> > -
> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> > +             /* do a sweep if we received a trap */
> > +             if (sm->p_subn->opt.sweep_on_trap) {
> > +                     /* if this is trap number 128 or run_heavy_sweep is
> > +                        TRUE - update the force_heavy_sweep flag of the
> > +                        subnet. Also, sweep also on traps 144/145 -
> > +                        these traps signal a change of certain port
> > +                        capabilities/system image guid.
> > +                        TODO: In the future this can be changed to just
> > +                        getting PortInfo on this port instead of
> sweeping
> > +                        the entire subnet. */
> > +                     if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 128 ||
> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 144 ||
> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 145 ||
> > +                         run_heavy_sweep) {
> > +                             OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> > +                                     "Forcing heavy sweep. Received
> trap:%u\n",
> > +
> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> >
> > -                     sm->p_subn->force_heavy_sweep = TRUE;
> > +                             sm->p_subn->force_heavy_sweep = TRUE;
> > +                     }
> > +                     osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
> >               }
> > -             osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
>
> Actually this disables sweep (light) on non generic traps. Was it desired
> change?


It was unintended; I'll resubmit adding that back.

-- Hal


> Could you see any potential issues with it?
>
> Sasha
>
> >       }
> >
> >       /* If we reached here due to trap 129/130/131 - do not need to do
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/4f27eed2/attachment.html>

From hnrose at comcast.net  Wed Aug  5 12:19:41 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 15:19:41 -0400
Subject: [ofa-general] [PATCHv2] opensm/osm_trap_rcv.c: Validate trap is 144
	before checking for NodeDescription changed
Message-ID: <20090805191941.GA29886@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Add back in light sweep on non generic traps which was inadvertently removed
in original version of patch

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index bf39926..26a052e 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -546,43 +547,49 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 		}
 	}
 
-	/* Check for node description update. IB Spec v1.2.1 pg 823 */
-	if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
-	    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
-		OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n");
-
-		if (p_physp) {
-			CL_PLOCK_ACQUIRE(sm->p_lock);
-			osm_req_get_node_desc(sm, p_physp);
-			CL_PLOCK_RELEASE(sm->p_lock);
-		} else {
-			OSM_LOG(sm->p_log, OSM_LOG_ERROR,
-				"ERR 3812: No physical port found for "
-				"trap 144: \"node description update\"\n");
+	if (ib_notice_is_generic(p_ntci)) {
+		/* Check for node description update. IB Spec v1.2.1 pg 823 */
+		if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
+			if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
+			    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
+				OSM_LOG(sm->p_log, OSM_LOG_INFO,
+					"Trap 144 Node description update\n");
+
+				if (p_physp) {
+					CL_PLOCK_ACQUIRE(sm->p_lock);
+					osm_req_get_node_desc(sm, p_physp);
+					CL_PLOCK_RELEASE(sm->p_lock);
+				} else
+					OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+						"ERR 3812: No physical port found for "
+						"trap 144: \"node description update\"\n");
+			}
 		}
-	}
 
-	/* do a sweep if we received a trap */
-	if (sm->p_subn->opt.sweep_on_trap) {
-		/* if this is trap number 128 or run_heavy_sweep is TRUE -
-		   update the force_heavy_sweep flag of the subnet.
-		   Sweep also on traps 144/145 - these traps signal a change of
-		   certain port capabilities/system image guid.
-		   TODO: In the future this can be changed to just getting
-		   PortInfo on this port instead of sweeping the entire subnet. */
-		if (ib_notice_is_generic(p_ntci) &&
-		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
-		     run_heavy_sweep)) {
-			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-				"Forcing heavy sweep. Received trap:%u\n",
-				cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
+		/* do a sweep if we received a trap */
+		if (sm->p_subn->opt.sweep_on_trap) {
+			/* if this is trap number 128 or run_heavy_sweep is
+			   TRUE - update the force_heavy_sweep flag of the
+			   subnet. Also, sweep also on traps 144/145 -
+			   these traps signal a change of certain port
+			   capabilities/system image guid.
+			   TODO: In the future this can be changed to just
+			   getting PortInfo on this port instead of sweeping
+			   the entire subnet. */
+			if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
+			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
+			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
+			    run_heavy_sweep) {
+				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
+					"Forcing heavy sweep. Received trap:%u\n",
+					cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
 
-			sm->p_subn->force_heavy_sweep = TRUE;
+				sm->p_subn->force_heavy_sweep = TRUE;
+			}
+			osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
 		}
+	} else if (sm->p_subn->opt.sweep_on_trap)
 		osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
-	}
 
 	/* If we reached here due to trap 129/130/131 - do not need to do
 	   the notice report. Just goto exit. We know this is the case


From sashak at voltaire.com  Wed Aug  5 12:46:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 22:46:11 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_trap_rcv.c: Validate trap is
 144 before checking for NodeDescription changed
In-Reply-To: <20090805191941.GA29886@comcast.net>
References: <20090805191941.GA29886@comcast.net>
Message-ID: <20090805194611.GA7993@me>

On 15:19 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> -	/* do a sweep if we received a trap */
> -	if (sm->p_subn->opt.sweep_on_trap) {
> -		/* if this is trap number 128 or run_heavy_sweep is TRUE -
> -		   update the force_heavy_sweep flag of the subnet.
> -		   Sweep also on traps 144/145 - these traps signal a change of
> -		   certain port capabilities/system image guid.
> -		   TODO: In the future this can be changed to just getting
> -		   PortInfo on this port instead of sweeping the entire subnet. */
> -		if (ib_notice_is_generic(p_ntci) &&
> -		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> -		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> -		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> -		     run_heavy_sweep)) {
> -			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> -				"Forcing heavy sweep. Received trap:%u\n",
> -				cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> +		/* do a sweep if we received a trap */
> +		if (sm->p_subn->opt.sweep_on_trap) {
> +			/* if this is trap number 128 or run_heavy_sweep is
> +			   TRUE - update the force_heavy_sweep flag of the
> +			   subnet. Also, sweep also on traps 144/145 -
> +			   these traps signal a change of certain port
> +			   capabilities/system image guid.
> +			   TODO: In the future this can be changed to just
> +			   getting PortInfo on this port instead of sweeping
> +			   the entire subnet. */
> +			if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> +			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> +			    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> +			    run_heavy_sweep) {
> +				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> +					"Forcing heavy sweep. Received trap:%u\n",
> +					cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
>  
> -			sm->p_subn->force_heavy_sweep = TRUE;
> +				sm->p_subn->force_heavy_sweep = TRUE;
> +			}
> +			osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
>  		}
> +	} else if (sm->p_subn->opt.sweep_on_trap)
>  		osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
> -	}

For me this part seems simpler in the original code, so I applied this
patch as:

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index bf39926..d2e4202 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -547,7 +548,9 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 	}
 
 	/* Check for node description update. IB Spec v1.2.1 pg 823 */
-	if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
+	if (ib_notice_is_generic(p_ntci) &&
+	    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 &&
+	    p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
 	    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
 		OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n");
 
@@ -555,11 +558,10 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 			CL_PLOCK_ACQUIRE(sm->p_lock);
 			osm_req_get_node_desc(sm, p_physp);
 			CL_PLOCK_RELEASE(sm->p_lock);
-		} else {
+		} else
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR,
 				"ERR 3812: No physical port found for "
 				"trap 144: \"node description update\"\n");
-		}
 	}
 
 	/* do a sweep if we received a trap */


Hope it is fine for you.

Sasha


From hal.rosenstock at gmail.com  Wed Aug  5 12:48:51 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 15:48:51 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_trap_rcv.c: Validate trap 
	is 144 before checking for NodeDescription changed
In-Reply-To: <20090805194611.GA7993@me>
References: <20090805191941.GA29886@comcast.net> <20090805194611.GA7993@me>
Message-ID: <f0e08f230908051248r63d3544ag3d7411b084a060df@mail.gmail.com>

On Wed, Aug 5, 2009 at 3:46 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

>  On 15:19 Wed 05 Aug     , Hal Rosenstock wrote:
> >
> > -     /* do a sweep if we received a trap */
> > -     if (sm->p_subn->opt.sweep_on_trap) {
> > -             /* if this is trap number 128 or run_heavy_sweep is TRUE -
> > -                update the force_heavy_sweep flag of the subnet.
> > -                Sweep also on traps 144/145 - these traps signal a
> change of
> > -                certain port capabilities/system image guid.
> > -                TODO: In the future this can be changed to just getting
> > -                PortInfo on this port instead of sweeping the entire
> subnet. */
> > -             if (ib_notice_is_generic(p_ntci) &&
> > -                 (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
> > -                  run_heavy_sweep)) {
> > -                     OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> > -                             "Forcing heavy sweep. Received trap:%u\n",
> > -
> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> > +             /* do a sweep if we received a trap */
> > +             if (sm->p_subn->opt.sweep_on_trap) {
> > +                     /* if this is trap number 128 or run_heavy_sweep is
> > +                        TRUE - update the force_heavy_sweep flag of the
> > +                        subnet. Also, sweep also on traps 144/145 -
> > +                        these traps signal a change of certain port
> > +                        capabilities/system image guid.
> > +                        TODO: In the future this can be changed to just
> > +                        getting PortInfo on this port instead of
> sweeping
> > +                        the entire subnet. */
> > +                     if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 128 ||
> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 144 ||
> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
> 145 ||
> > +                         run_heavy_sweep) {
> > +                             OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
> > +                                     "Forcing heavy sweep. Received
> trap:%u\n",
> > +
> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
> >
> > -                     sm->p_subn->force_heavy_sweep = TRUE;
> > +                             sm->p_subn->force_heavy_sweep = TRUE;
> > +                     }
> > +                     osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
> >               }
> > +     } else if (sm->p_subn->opt.sweep_on_trap)
> >               osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
> > -     }
>
> For me this part seems simpler in the original code, so I applied this
> patch as:
>
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index bf39926..d2e4202 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -2,6 +2,7 @@
>  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
>  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>  *
>  * This software is available to you under a choice of one of two
>  * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -547,7 +548,9 @@ trap_rcv_process_request(IN osm_sm_t * sm,
>        }
>
>        /* Check for node description update. IB Spec v1.2.1 pg 823 */
> -       if (p_ntci->data_details.ntc_144.local_changes &
> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
> +       if (ib_notice_is_generic(p_ntci) &&
> +           cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 &&
> +           p_ntci->data_details.ntc_144.local_changes &
> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
>            p_ntci->data_details.ntc_144.change_flgs &
> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
>                OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description
> update\n");
>
> @@ -555,11 +558,10 @@ trap_rcv_process_request(IN osm_sm_t * sm,
>                        CL_PLOCK_ACQUIRE(sm->p_lock);
>                        osm_req_get_node_desc(sm, p_physp);
>                        CL_PLOCK_RELEASE(sm->p_lock);
> -               } else {
> +               } else
>                        OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>                                "ERR 3812: No physical port found for "
>                                "trap 144: \"node description update\"\n");
> -               }
>        }
>
>        /* do a sweep if we received a trap */
>
>
> Hope it is fine for you.


Sure (that's a smaller/simpler change). I'll retest to be sure when it's in
the tree.

There's more coming which may head more towards where I was trying to go but
we'll see what happens with the next steps with this...

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/ca08dbb9/attachment.html>

From sashak at voltaire.com  Wed Aug  5 12:50:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 22:50:18 +0300
Subject: [ofa-general] Re: [PATCHv3] opensm/osm_mesh.c: Reorder switches for
	lash
In-Reply-To: <20090805184822.GA21614@comcast.net>
References: <20090805184822.GA21614@comcast.net>
Message-ID: <20090805195018.GB7993@me>

On 14:48 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> The goal of this patch is to change the order of the switches in the array kept
> in the lash context from the original order to one in which the switches are
> presented in 'odometer order'.
> 
> When the main routine in lash is called the switches are in an order that is
> likely based on the order that the switches were originally visited by SM 
> topology discovery which is some sort of tree walk. All of the analysis up to
> this point is independent of the actual order of the switches, but lash will
> use that order to enumerate the paths in the fabric and add them to the VL bins.
> 
> Odometer order means that the switches are labelled s[X0, ..., Xn-1] and
> ordered s[0, ..., 0], s[0, ..., 1], s[0, ..., Ln-1], s[0, .. 1, 0] etc.
> The dimensions are also reordered so that the dimension changing the fastest
> has the largest length, i.e. Ln >= Ln-1 >= ... >= L1. [All this is modulo
> possible end to end reversal but the basic idea is that the longest axis
> changes fastest.]
> 
> TO INVESTIGATE: Rather than using an additional switches array in
> sort_switches whether it can be done in place using p_lash->switches.
> 
> Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Aug  5 12:50:37 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 5 Aug 2009 22:50:37 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_mesh.h: Fix SFW copyright
In-Reply-To: <20090805190344.GA28221@comcast.net>
References: <20090805190344.GA28221@comcast.net>
Message-ID: <20090805195037.GC7993@me>

On 15:03 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From rdreier at cisco.com  Wed Aug  5 13:04:44 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 13:04:44 -0700
Subject: [ofa-general] Re: [PATCH linux-next 1/5] RDMA/cxgb3: unregister
	leaks memory.
In-Reply-To: <20090731193225.2550.35448.stgit@build.ogc.int> (Steve Wise's
	message of "Fri, 31 Jul 2009 14:32:25 -0500")
References: <20090731193225.2550.35448.stgit@build.ogc.int>
Message-ID: <adaljlyhvs3.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Wed Aug  5 13:06:22 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 13:06:22 -0700
Subject: [ofa-general] Re: [PATCH linux-next 3/5] RDMA/cxgb3: wake up any
	waiters on peer close/abort.
In-Reply-To: <20090731193235.2550.20835.stgit@build.ogc.int> (Steve Wise's
	message of "Fri, 31 Jul 2009 14:32:35 -0500")
References: <20090731193225.2550.35448.stgit@build.ogc.int>
	<20090731193235.2550.20835.stgit@build.ogc.int>
Message-ID: <adahbwmhvpd.fsf@cisco.com>

this one won't apply without 2/5, so I'll wait for you to resend both patches...


From rdreier at cisco.com  Wed Aug  5 13:06:28 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 13:06:28 -0700
Subject: [ofa-general] Re: [PATCH linux-next 4/5] RDMA/cxgb3: Set the
	appropriate IO channel in rdma_init work requests.
In-Reply-To: <20090731193241.2550.43016.stgit@build.ogc.int> (Steve Wise's
	message of "Fri, 31 Jul 2009 14:32:41 -0500")
References: <20090731193225.2550.35448.stgit@build.ogc.int>
	<20090731193241.2550.43016.stgit@build.ogc.int>
Message-ID: <adad47ahvp7.fsf@cisco.com>

thanks, applied this and 5/5.


From rdreier at cisco.com  Wed Aug  5 13:38:30 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 13:38:30 -0700
Subject: [ofa-general] Re: [PATCH] RDMA/nes: map MTU to IB_MTU_* and
	correctly report link state
In-Reply-To: <20090710204506.GA5060@ctung-MOBL> (Chien Tung's message of "Fri, 
	10 Jul 2009 15:45:06 -0500")
References: <20090710204506.GA5060@ctung-MOBL>
Message-ID: <ada8whyhu7t.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Wed Aug  5 13:39:58 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 13:39:58 -0700
Subject: [ofa-general] Re: [PATCH] ipath: strncpy does not null terminate
	string
In-Reply-To: <4A608922.7060900@gmail.com> (Roel Kluin's message of "Fri, 17
	Jul 2009 16:22:26 +0200")
References: <4A607754.4040204@gmail.com> <4A608922.7060900@gmail.com>
Message-ID: <ada4osmhu5d.fsf@cisco.com>


 > --- a/drivers/infiniband/hw/ipath/ipath_mad.c
 > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c
 > @@ -60,7 +60,7 @@ static int recv_subn_get_nodedescription(struct ib_smp *smp,
 >  	if (smp->attr_mod)
 >  		smp->status |= IB_SMP_INVALID_FIELD;
 >  
 > -	strncpy(smp->data, ibdev->node_desc, sizeof(smp->data));
 > +	strlcpy(smp->data, ibdev->node_desc, sizeof(smp->data));
 >  
 >  	return reply(smp);
 >  }

node_desc isn't really a string, isn't it?  Seems that we should be
using memcpy() here (since I think it is perfectly valid according to
the IB architecture to have NULs in the node description)

 - R.


From jgunthorpe at obsidianresearch.com  Wed Aug  5 13:42:59 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 5 Aug 2009 14:42:59 -0600
Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow
	interfaces to correspond to each other
In-Reply-To: <20090805083023.GK5599@mtls03>
References: <20090805083023.GK5599@mtls03>
Message-ID: <20090805204259.GB16677@obsidianresearch.com>

On Wed, Aug 05, 2009 at 11:30:23AM +0300, Eli Cohen wrote:

> for setting the GID table of a port has been added. Currently, each
> IB port has a single GID entry in its table and that GID entery
> equals the link local IPv6 address.

FWIW, I like this approach, and mapping to/from this GID to the MAC
without a ND operation nicely divorces the RMDAoE stuff from the IPv6
stack.

What about multicast though? Switches are going to have trouble with
group membership lists for non IP packets.. Even just sending a ICMPv6
packet (with an IPv6 ethertype) isn't guaranteed to fix it.

Jason


From sean.hefty at intel.com  Wed Aug  5 13:43:12 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 5 Aug 2009 13:43:12 -0700
Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device
	personality	from node type to port type
In-Reply-To: <20090805082808.GB5599@mtls03>
References: <20090805082808.GB5599@mtls03>
Message-ID: <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com>

>As a preparation to devices that, in general, support different transport
>protocol for each port, specifically RDMAoE, this patch defines transport type
>for each of a device's ports. As a result rdma_node_get_transport() has been
>unexported and is used internally by the implementation of the new API,
>rdma_port_get_transport() which gives the transport protocol of the queried
>port. All references to rdma_node_get_transport() are changed to to use
>rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum
>rdma_transport_type.

Can resources (PDs, CQs, MRs, etc.) between the different transports be shared?
Does QP failover between transports work?

>diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>index 5130fc5..f930f1d 100644
>--- a/drivers/infiniband/core/cm.c
>+++ b/drivers/infiniband/core/cm.c
>@@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device)
> 	unsigned long flags;
> 	int ret;
> 	u8 i;
>-
>-	if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB)
>-		return;

Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask
of the supported transports available on the device?  I'm wondering if something
like this makes sense, to allow skipping devices that are not of interest to a
particular module.  This would be in addition to the rdma_port_get_transport
call.

There's just a lot of new checks to handle the transport on a port by port
basis.

- Sean


From rdreier at cisco.com  Wed Aug  5 14:10:38 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 14:10:38 -0700
Subject: [ofa-general] [PATCH] cma: fix access to freed memory
In-Reply-To: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> (Sean
	Hefty's message of "Wed, 5 Aug 2009 08:46:43 -0700")
References: <20090803092528.GA25528@mtls03>
	<04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com>
Message-ID: <adaws5ige5t.fsf@cisco.com>


 > rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast.  This call
 > will block until the join callback completes or is canceled.  Can you describe
 > the race with cma_ib_mc_handler in more detail?
 > 
 > Also, cma_leave_mc_groups is only called from rdma_destroy_id.  Locking around
 > the mc->list shouldn't be required, since calls to join/leave aren't allowed.

So where does this leave things?  Is any part of Eli's patch needed?

 - R.


From sean.hefty at intel.com  Wed Aug  5 14:17:10 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 5 Aug 2009 14:17:10 -0700
Subject: [ofa-general] [PATCH] cma: fix access to freed memory
In-Reply-To: <adaws5ige5t.fsf@cisco.com>
References: <20090803092528.GA25528@mtls03>	<04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com>
	<adaws5ige5t.fsf@cisco.com>
Message-ID: <ED38826676C54A0496F9313799678669@amr.corp.intel.com>

>So where does this leave things?  Is any part of Eli's patch needed?

I don't believe the patch is needed, and Eli agreed with this.

- Sean


From hal.rosenstock at gmail.com  Wed Aug  5 14:48:06 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 5 Aug 2009 17:48:06 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is
	144 before checking for NodeDescription changed
In-Reply-To: <f0e08f230908051210s7a1ad8fiea3a1cb76df232e1@mail.gmail.com>
References: <20090804124717.GA12236@comcast.net> <20090805190442.GZ7993@me>
	<f0e08f230908051210s7a1ad8fiea3a1cb76df232e1@mail.gmail.com>
Message-ID: <f0e08f230908051448g5c218f86p1a16c616e4ab7f17@mail.gmail.com>

On Wed, Aug 5, 2009 at 3:10 PM, Hal Rosenstock <hal.rosenstock at gmail.com>wrote:

>
>
>   On Wed, Aug 5, 2009 at 3:04 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
>
>>  On 08:47 Tue 04 Aug     , Hal Rosenstock wrote:
>> >
>> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> > ---
>> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
>> > index bf39926..925cb27 100644
>> > --- a/opensm/opensm/osm_trap_rcv.c
>> > +++ b/opensm/opensm/osm_trap_rcv.c
>> > @@ -2,6 +2,7 @@
>> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>> >   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
>> reserved.
>> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> > + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>> >   *
>> >   * This software is available to you under a choice of one of two
>> >   * licenses.  You may choose to be licensed under the terms of the GNU
>> > @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm,
>> >               }
>> >       }
>> >
>> > -     /* Check for node description update. IB Spec v1.2.1 pg 823 */
>> > -     if (p_ntci->data_details.ntc_144.local_changes &
>> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
>> > -         p_ntci->data_details.ntc_144.change_flgs &
>> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
>> > -             OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node
>> description update\n");
>> > -
>> > -             if (p_physp) {
>> > -                     CL_PLOCK_ACQUIRE(sm->p_lock);
>> > -                     osm_req_get_node_desc(sm, p_physp);
>> > -                     CL_PLOCK_RELEASE(sm->p_lock);
>> > -             } else {
>> > -                     OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>> > -                             "ERR 3812: No physical port found for "
>> > -                             "trap 144: \"node description
>> update\"\n");
>> > +     if (ib_notice_is_generic(p_ntci)) {
>> > +             /* Check for node description update. IB Spec v1.2.1 pg
>> 823 */
>> > +             if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
>> > +                     if (p_ntci->data_details.ntc_144.local_changes &
>> TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
>> > +                         p_ntci->data_details.ntc_144.change_flgs &
>> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
>> > +                             OSM_LOG(sm->p_log, OSM_LOG_INFO,
>> > +                                     "Trap 144 Node description
>> update\n");
>> > +
>> > +                             if (p_physp) {
>> > +                                     CL_PLOCK_ACQUIRE(sm->p_lock);
>> > +                                     osm_req_get_node_desc(sm,
>> p_physp);
>> > +                                     CL_PLOCK_RELEASE(sm->p_lock);
>> > +                             } else
>> > +                                     OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>> > +                                             "ERR 3812: No physical
>> port found for "
>> > +                                             "trap 144: \"node
>> description update\"\n");
>> > +                     }
>> >               }
>> > -     }
>> >
>> > -     /* do a sweep if we received a trap */
>> > -     if (sm->p_subn->opt.sweep_on_trap) {
>> > -             /* if this is trap number 128 or run_heavy_sweep is TRUE -
>> > -                update the force_heavy_sweep flag of the subnet.
>> > -                Sweep also on traps 144/145 - these traps signal a
>> change of
>> > -                certain port capabilities/system image guid.
>> > -                TODO: In the future this can be changed to just getting
>> > -                PortInfo on this port instead of sweeping the entire
>> subnet. */
>> > -             if (ib_notice_is_generic(p_ntci) &&
>> > -                 (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
>> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
>> > -                  cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
>> > -                  run_heavy_sweep)) {
>> > -                     OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
>> > -                             "Forcing heavy sweep. Received trap:%u\n",
>> > -
>> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
>> > +             /* do a sweep if we received a trap */
>> > +             if (sm->p_subn->opt.sweep_on_trap) {
>> > +                     /* if this is trap number 128 or run_heavy_sweep
>> is
>> > +                        TRUE - update the force_heavy_sweep flag of the
>> > +                        subnet. Also, sweep also on traps 144/145 -
>> > +                        these traps signal a change of certain port
>> > +                        capabilities/system image guid.
>> > +                        TODO: In the future this can be changed to just
>> > +                        getting PortInfo on this port instead of
>> sweeping
>> > +                        the entire subnet. */
>> > +                     if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
>> 128 ||
>> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
>> 144 ||
>> > +                         cl_ntoh16(p_ntci->g_or_v.generic.trap_num) ==
>> 145 ||
>> > +                         run_heavy_sweep) {
>> > +                             OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
>> > +                                     "Forcing heavy sweep. Received
>> trap:%u\n",
>> > +
>> cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
>> >
>> > -                     sm->p_subn->force_heavy_sweep = TRUE;
>> > +                             sm->p_subn->force_heavy_sweep = TRUE;
>> > +                     }
>> > +                     osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
>> >               }
>> > -             osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
>>
>> Actually this disables sweep (light) on non generic traps. Was it desired
>> change?
>
>
> It was unintended; I'll resubmit adding that back.
>
> -- Hal
>
>
>> Could you see any potential issues with it?
>
>
In thinking about it, I'm not sure what light sweep on non generic trap
accomplishes anyhow.

-- Hal


>
>>
>> Sasha
>>
>> >       }
>> >
>> >       /* If we reached here due to trap 129/130/131 - do not need to do
>> >
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/d58833ab/attachment.html>

From hnrose at comcast.net  Wed Aug  5 15:06:13 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 18:06:13 -0400
Subject: [ofa-general] [PATCH] opensm/osm_mesh.c: Remove edges in lash matrix
Message-ID: <20090805220613.GA7155@comcast.net>


The intent of this change to remove edge nodes is to *not* count
them.

The point of this heuristic is to deal with the case of small
lattices which can easily have more surface than interior leading to
choosing a non representative seed. This causes impossible counts to
get reported.

Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 72a9aa9..b5d141d 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -170,6 +170,11 @@ static const struct mesh_info {
 
 	{8, {2, 2, 2, 2, 2, 2, 2, 2},	8, {-1792, -6144, -8960, -7168, -3360, -896, -112, 0, 1},	},
 
+	/*
+	 * mesh errors
+	 */
+	{2, {6, 6},                     4, {-192, -256, -80, 0, 1}, },
+
 	{-1, {0,}, 0, {0, },					},
 };
 
@@ -727,6 +732,36 @@ done:
 }
 
 /*
+ * remove_edges
+ *
+ * remove type from nodes that have fewer links
+ * than adjacent nodes
+ */
+static void remove_edges(lash_t *p_lash)
+{
+	int sw;
+	mesh_node_t *n, *nn;
+	int i;
+
+	for (sw = 0; sw < p_lash->num_switches; sw++) {
+		n = p_lash->switches[sw]->node;
+		if (!n->type)
+			continue;
+
+		for (i = 0; i < n->num_links; i++) {
+			nn = p_lash->switches[n->links[i]->switch_id]->node;
+
+			if (nn->num_links > n->num_links) {
+				printf("removed edge switch %s\n",
+				       p_lash->switches[sw]->p_sw->p_node->print_desc);
+				n->type = -1;
+				break;
+			}
+		}
+	}
+}
+
+/*
  * get_local_geometry
  *
  * analyze the local geometry around each switch
@@ -735,6 +770,7 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
 	int sw;
+	int status = 0;
 
 	OSM_LOG_ENTER(p_log);
 
@@ -747,15 +783,38 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh)
 			continue;
 
 		if (get_switch_metric(p_lash, sw)) {
-			OSM_LOG_EXIT(p_log);
-			return -1;
+			status = -1;
+			goto Exit;
 		}
-		classify_switch(p_lash, mesh, sw);
 		classify_mesh_type(p_lash, sw);
 	}
 
+	remove_edges(p_lash);
+
+	for (sw = 0; sw < p_lash->num_switches; sw++) {
+		if (p_lash->switches[sw]->node->type < 0)
+			continue;
+		classify_switch(p_lash, mesh, sw);
+	}
+
+Exit:
 	OSM_LOG_EXIT(p_log);
-	return 0;
+	return status;
+}
+
+static void print_axis(lash_t *p_lash, int sw, int port)
+{
+	mesh_node_t *node = p_lash->switches[sw]->node;
+	char *name = p_lash->switches[sw]->p_sw->p_node->print_desc;
+	int c = node->axes[port];
+
+	printf("%s[%d] = ", name, port);
+	if (c)
+		printf("%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2);
+	else
+		printf("N/A -> ");
+	printf("%s\n",
+	       p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc);
 }
 
 /*
@@ -805,6 +864,11 @@ static void seed_axes(lash_t *p_lash, int sw)
 		}
 	}
 
+	for (i = 0; i < n; i++) {
+		printf("seed: ");
+		print_axis(p_lash, sw, i);
+	}
+
 done:
 	OSM_LOG_EXIT(p_log);
 }
@@ -878,6 +942,12 @@ static void make_geometry(lash_t *p_lash, int sw)
 			n = s1->node->num_links;
 
 			/*
+			 * ignore chain fragments
+			 */
+			if (n < seed->node->num_links && n <= 2)
+				continue;
+
+			/*
 			 * only process 'mesh' switches
 			 */
 			if (!s1->node->matrix)
@@ -908,7 +978,8 @@ static void make_geometry(lash_t *p_lash, int sw)
 					if (j == i)
 						continue;
 
-					if (s1->node->matrix[i][j] != 2) {
+					if (s1->node->matrix[i][j] != 2 &&
+						s1->node->matrix[i][j] <= 4) {
 						if (s1->node->axes[j]) {
 							if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) {
 								OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n");


From hnrose at comcast.net  Wed Aug  5 15:27:37 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 5 Aug 2009 18:27:37 -0400
Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: In
	trap_rcv_process_request, no
	need to sweep on trap 145 and certain trap 144s
Message-ID: <20090805222737.GA8523@comcast.net>


NodeDescription changed trap only needs to query the new NodeDescription
and not cause sweep

Similarly for SystemImageGUID changed (trap 145)

LinkWidth/SpeedEnabled changed traps (at least right now) and
SM priority changed traps do need to sweep.

In the future, LinkWidth/SpeedEnabled changed trap handling
can query PortInfo (may also need to bounce port too).

Also, as noted in related email thread, it's unclear what
sweeping on non generic traps accomplishes but this behavior
is preserved.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index d2e4202..e5bd529 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -291,8 +291,9 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 	osm_physp_t *p_physp;
 	cl_ptr_vector_t *p_tbl;
 	osm_port_t *p_port;
+	osm_node_t *p_node;
 	ib_net16_t source_lid = 0;
-	boolean_t is_gsi = TRUE;
+	boolean_t is_gsi = TRUE, is_trap144_sweep = FALSE;
 	uint8_t port_num = 0;
 	boolean_t physp_change_trap = FALSE;
 	uint64_t event_wheel_timeout = OSM_DEFAULT_TRAP_SUPRESSION_TIMEOUT;
@@ -547,44 +548,59 @@ trap_rcv_process_request(IN osm_sm_t * sm,
 		}
 	}
 
-	/* Check for node description update. IB Spec v1.2.1 pg 823 */
-	if (ib_notice_is_generic(p_ntci) &&
-	    cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 &&
-	    p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES &&
-	    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
-		OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n");
-
-		if (p_physp) {
-			CL_PLOCK_ACQUIRE(sm->p_lock);
-			osm_req_get_node_desc(sm, p_physp);
-			CL_PLOCK_RELEASE(sm->p_lock);
-		} else
-			OSM_LOG(sm->p_log, OSM_LOG_ERROR,
-				"ERR 3812: No physical port found for "
-				"trap 144: \"node description update\"\n");
-	}
+	if (ib_notice_is_generic(p_ntci)) {
+		/* Check for node description update. IB Spec v1.2.1 pg 823 */
+		if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) {
+			/* update port's capability mask (in PortInfo) */
+			p_physp->port_info.capability_mask = p_ntci->data_details.ntc_144.new_cap_mask;
+			if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES) {
+				if (p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) {
+					OSM_LOG(sm->p_log, OSM_LOG_INFO,
+						"Trap 144 Node description update\n");
+
+					if (p_physp) {
+						CL_PLOCK_ACQUIRE(sm->p_lock);
+						osm_req_get_node_desc(sm, p_physp);
+						CL_PLOCK_RELEASE(sm->p_lock);
+					} else
+						OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+							"ERR 3812: No physical port found for "
+							"trap 144: \"node description update\"\n");
+				}
+			}
+			if (p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_LINK_WIDTH_ENABLE_CHANGE ||
+			    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_LINK_SPEED_ENABLE_CHANGE ||
+			    p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_SM_PRIORITY_CHANGE)
+				is_trap144_sweep = TRUE;
+		}
 
-	/* do a sweep if we received a trap */
-	if (sm->p_subn->opt.sweep_on_trap) {
-		/* if this is trap number 128 or run_heavy_sweep is TRUE -
-		   update the force_heavy_sweep flag of the subnet.
-		   Sweep also on traps 144/145 - these traps signal a change of
-		   certain port capabilities/system image guid.
-		   TODO: In the future this can be changed to just getting
-		   PortInfo on this port instead of sweeping the entire subnet. */
-		if (ib_notice_is_generic(p_ntci) &&
-		    (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 ||
-		     cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 ||
-		     run_heavy_sweep)) {
-			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-				"Forcing heavy sweep. Received trap:%u\n",
-				cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
+		if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145) {
+			/* update system image guid (in NodeInfo) */
+			p_node = osm_physp_get_node_ptr(p_physp);
+			if (p_node)
+				p_node->node_info.node_guid = p_ntci->data_details.ntc_145.new_sys_guid;
+		}
+
+		/* do a sweep if we received a trap */
+		if (sm->p_subn->opt.sweep_on_trap) {
+			/* if this is trap number 128 or run_heavy_sweep is
+			   TRUE - update the force_heavy_sweep flag of the
+			   subnet. Also, sweep on certain types of trap 144.
+			   TODO: In the future this can be changed to just
+			   getting PortInfo on this port instead of sweeping
+			   the entire subnet. */
+		    	if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 ||
+			    is_trap144_sweep || run_heavy_sweep) {
+				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
+					"Forcing heavy sweep. Received trap:%u\n",
+					cl_ntoh16(p_ntci->g_or_v.generic.trap_num));
 
-			sm->p_subn->force_heavy_sweep = TRUE;
+				sm->p_subn->force_heavy_sweep = TRUE;
+			}
+			osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
 		}
+	} else if (sm->p_subn->opt.sweep_on_trap)
 		osm_sm_signal(sm, OSM_SIGNAL_SWEEP);
-	}
 
 	/* If we reached here due to trap 129/130/131 - do not need to do
 	   the notice report. Just goto exit. We know this is the case


From nashwath at gmail.com  Wed Aug  5 17:03:04 2009
From: nashwath at gmail.com (Ashwath Narasimhan)
Date: Wed, 5 Aug 2009 20:03:04 -0400
Subject: [ofa-general] Setting the rate in Infiniband.
Message-ID: <ed1288770908051703i4289654cs23c4e3118bba41a4@mail.gmail.com>

Hello,
Problem:--
I am trying to set the "Rate" and "MTU" in infiniband using the config file
(opensm QoS) but I realize that I cannot set Rates like 500Mbps or 100Mbps.
The infiniband rates start from 2.5Gbps. (IBT specification vol 1 -> page
917, Table 207 "PathRecord" )

Background:--
The reason why I need such small rates is because I interface the Infiniband
HCA to an FPGA via an Infiniband physical link.  Imagine the FPGA as a
simple repeater that simply forwards the infiniband signals to the Target
HCA. The FPGA cannot handle such a high data rate and neither do I have as
much memory as required to buffer it on the FPGA (I might drop packets if
the buffer becomes full). Hence I wish to limit the rate to say 100Mbps
instead of 2.5Gbps.

Question:-
How can I set rates less than 2.5Gbps? Can this be changed at all?

regards,
Ashwath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/0fb3e062/attachment.html>

From rdreier at cisco.com  Wed Aug  5 17:20:11 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Aug 2009 17:20:11 -0700
Subject: [ofa-general] Setting the rate in Infiniband.
In-Reply-To: <ed1288770908051703i4289654cs23c4e3118bba41a4@mail.gmail.com>
	(Ashwath Narasimhan's message of "Wed, 5 Aug 2009 20:03:04 -0400")
References: <ed1288770908051703i4289654cs23c4e3118bba41a4@mail.gmail.com>
Message-ID: <adaskg5hjyc.fsf@cisco.com>


 > The reason why I need such small rates is because I interface the Infiniband
 > HCA to an FPGA via an Infiniband physical link.  Imagine the FPGA as a
 > simple repeater that simply forwards the infiniband signals to the Target
 > HCA. The FPGA cannot handle such a high data rate and neither do I have as
 > much memory as required to buffer it on the FPGA (I might drop packets if
 > the buffer becomes full). Hence I wish to limit the rate to say 100Mbps
 > instead of 2.5Gbps.
 > 
 > Question:-
 > How can I set rates less than 2.5Gbps? Can this be changed at all?

The IB physical layer does not define any signaling slower than 2.5Gbps,
ie 1 lane of single data rate.  There is nothing slower than any real IB
HCA or switch will be able to do.  Therefore for your FPGA to be able to
talk IB at all, you will need to be able to handle a 1X SDR link (and do
8b/10b encoding, etc).  Note that the 8b/10b encoding means there is
really only 2 Gbps of data on a 1X SDR link.

However, it is OK (in theory) for your FPGA to handle only a the minimal
IB MTU (256 bytes), and it also OK for your FPGA to give only a small
number of link-layer credits (whatever it has buffering for).  This
should limit the data you have to buffer to what you can handle, and
lets you throttle the traffic at the link level.  You will still need to
be able to handle link packets, idles, etc at the full data rate.

With that said, I would expect any FPGA fancy enough to have a SERDES
capable of doing IB signaling to be able handle 2 Gbps of real traffic,
since I've seen designs doing fairly complex processing using Virtex II
(ie 5+ year old FPGAs) able to handle full 4X SDR IB links.  I guess it
depends on the sophistication of your RTL.

 - R.


From bart.vanassche at gmail.com  Thu Aug  6 00:39:45 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 09:39:45 +0200
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
Message-ID: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>

On Wed, Aug 5, 2009 at 10:37 PM, James
Bottomley<James.Bottomley at hansenpartnership.com> wrote:
> On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote:
>> On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier<rdreier at cisco.com> wrote:
>> >
>> >  > The NULL pointer dereference happens when srp_reset_device() calls
>> >  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
>> >  > req->scmnd->device == NULL. When the sg_reset command issues an
>> >  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
>> >  > scmnd structure and sets scmnd->device to NULL. It is this scmnd
>> >  > structure that is passed to srp_reset_device(). What I'm not sure
>> >  > about is whether scsi_reset_provider() should set req->scmnd->device
>> >  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
>> >  > handle the condition req->scmnd->device == NULL.
>> >
>> > Well, I don't see how the reset ioctl can do anything useful unless it
>> > passes a device in with the scsi command -- otherwise for example
>> > srp_reset_device() has no idea what LUN to try and reset.
>>
>> (added linux-scsi in CC)
>>
>> I hope one of the SCSI people can tell us whether the behavior that
>> scsi_reset_provider()
>> passes the value NULL in req->scmnd->device to
>> scsi_try_bus_device_reset() is correct ?
>
> Need more information.
>
> cmd->device is supposed to be initialised in scsi_get_command(), which
> scsi_reset_provider() calls ... why do you think it got set to null?

This thread started with the observation that it is easy to trigger a
NULL pointer dereference in the SRP initiator
(http://bugzilla.kernel.org/show_bug.cgi?id=13893). The following
sequence is sufficient:
* Remove the ib_srp kernel module (doing so closes all active SRP sessions).
* Insert the ib_srp kernel module.
* Create a new SRP connection.
* Issue the sg_reset -d ${srp_device} command in a shell.
The sg_reset command issues an SG_SCSI_RESET ioctl. This ioctl is
processed by invoking scsi_reset_provider(), which in turns invokes
the eh_device_reset_handler method of the SRP initiator. Further
analysis showed that scsi_reset_provider() passes a non-NULL
cmd->device pointer to the SRP initiator, but that the SRP initiator
does not use this value. Instead srp_find_req() looks up a struct
srp_request pointer based on the struct scsi_cmnd * argument and
continues with the struct scsi_cmnd pointer contained in the struct
srp_request.

While I'm not sure that the patch below makes any sense, it makes the
NULL pointer dereference disappear. This made me wonder which
assumptions srp_find_req() is based on ?

--- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c
2009-08-03 12:13:11.000000000 +0200
+++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c  2009-08-06
08:50:30.000000000 +0200
@@ -1325,16 +1325,19 @@ static int srp_cm_handler(struct ib_cm_i
 }

 static int srp_send_tsk_mgmt(struct srp_target_port *target,
+                            struct scsi_cmnd *scmnd,
                             struct srp_request *req, u8 func)
 {
        struct srp_iu *iu;
        struct srp_tsk_mgmt *tsk_mgmt;

+       BUG_ON(!scmnd->device);
+
        spin_lock_irq(target->scsi_host->host_lock);

        if (target->state == SRP_TARGET_DEAD ||
            target->state == SRP_TARGET_REMOVED) {
-               req->scmnd->result = DID_BAD_TARGET << 16;
+               scmnd->result = DID_BAD_TARGET << 16;
                goto out;
        }

@@ -1348,7 +1351,7 @@ static int srp_send_tsk_mgmt(struct srp_
        memset(tsk_mgmt, 0, sizeof *tsk_mgmt);

        tsk_mgmt->opcode        = SRP_TSK_MGMT;
-       tsk_mgmt->lun           = cpu_to_be64((u64)
req->scmnd->device->lun << 48);
+       tsk_mgmt->lun           = cpu_to_be64((u64) scmnd->device->lun << 48);
        tsk_mgmt->tag           = req->index | SRP_TAG_TSK_MGMT;
        tsk_mgmt->tsk_mgmt_func = func;
        tsk_mgmt->task_tag      = req->index;
@@ -1395,7 +1398,7 @@ static int srp_abort(struct scsi_cmnd *s
                return FAILED;
        if (srp_find_req(target, scmnd, &req))
                return FAILED;
-       if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK))
+       if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_ABORT_TASK))
                return FAILED;

        spin_lock_irq(target->scsi_host->host_lock);
@@ -1425,7 +1428,9 @@ static int srp_reset_device(struct scsi_
                return FAILED;
        if (srp_find_req(target, scmnd, &req))
                return FAILED;
-       if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET))
+       if (WARN_ON(!scmnd->device))
+               return FAILED;
+       if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_LUN_RESET))
                return FAILED;
        if (req->tsk_status)
                return FAILED;

Bart.


From bart.vanassche at gmail.com  Thu Aug  6 02:58:50 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 11:58:50 +0200
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion 
	dependency detected
In-Reply-To: <adatz0mi03d.fsf@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
Message-ID: <e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>

On Wed, Aug 5, 2009 at 8:31 PM, Roland Dreier<rdreier at cisco.com> wrote:
> So I queued up the patch below for 2.6.32... this is almost the same as
> the patch I proposed before except that I fixed two places where I
> dropped the lock *after* calling ipoib_send() -- which missed the whole
> point of what I was trying to do.  So this patch has a much better
> chance of actually working!

After having applied this patch it took somewhat longer before a
locking inversion report was generated, but unfortunately there still
was a locking inversion report generated (see also
http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details):

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.30.4-scst-debug #1
---------------------------------------------------------
swapper/0 just changed the state of lock:
 (&priv->lock){-.-...}, at: [<ffffffffa050cc8f>]
ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib]
but this lock took another, HARDIRQ-unsafe lock in the past:
 (&(&mad_agent_priv->timed_work)->timer){+.-...}

and interrupts could create inverse lock ordering between them.

[ ... ]

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.30.4-scst-debug #1
Call Trace:
 <IRQ>  [<ffffffff80272bec>] print_irq_inversion_bug+0x14c/0x1c0
 [<ffffffff80272cdd>] check_usage_forwards+0x7d/0xc0
 [<ffffffff80271faf>] mark_lock+0x20f/0x6a0
 [<ffffffff80272c60>] ? check_usage_forwards+0x0/0xc0
 [<ffffffff802743e4>] __lock_acquire+0xce4/0x1c80
 [<ffffffff80272c60>] ? check_usage_forwards+0x0/0xc0
 [<ffffffff80275488>] lock_acquire+0x108/0x150
 [<ffffffffa050cc8f>] ? ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib]
 [<ffffffff80515101>] _spin_lock_irqsave+0x41/0x60
 [<ffffffffa050cc8f>] ? ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib]
 [<ffffffffa050cc8f>] ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib]
 [<ffffffffa04e56aa>] mlx4_ib_qp_event+0x7a/0xf0 [mlx4_ib]
 [<ffffffffa0252d4f>] mlx4_qp_event+0x6f/0xe0 [mlx4_core]
 [<ffffffffa024a659>] mlx4_eq_int+0x289/0x2e0 [mlx4_core]
 [<ffffffffa024a73f>] mlx4_msi_x_interrupt+0xf/0x20 [mlx4_core]
 [<ffffffff8028bf35>] handle_IRQ_event+0x95/0x200
 [<ffffffff8028e3d8>] handle_edge_irq+0xc8/0x170
 [<ffffffff8020eeef>] handle_irq+0x1f/0x30
 [<ffffffff8020e5fe>] do_IRQ+0x6e/0xf0
 [<ffffffff8020c913>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffffa0012d9e>] ? acpi_idle_enter_bm+0x27d/0x2ad [processor]
 [<ffffffffa0012d94>] ? acpi_idle_enter_bm+0x273/0x2ad [processor]
 [<ffffffff8046eae5>] ? cpuidle_idle_call+0xa5/0x100
 [<ffffffff8020b144>] ? cpu_idle+0x64/0xd0
 [<ffffffff8050de61>] ? start_secondary+0x188/0x1e7

Bart.


From vlad at lists.openfabrics.org  Thu Aug  6 03:18:14 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu,  6 Aug 2009 03:18:14 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090806-0200 daily build status
Message-ID: <20090806101814.5BB66E300A1@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From vlad at dev.mellanox.co.il  Thu Aug  6 07:35:57 2009
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 06 Aug 2009 17:35:57 +0300
Subject: [ofa-general] OFED-1.4.2 GA is available
Message-ID: <4A7AEA4D.6050103@dev.mellanox.co.il>

I am pleased to announce that OFED-1.4.2 GA release is done

The tarball is available on:
http://www.openfabrics.org/downloads/OFED/ofed-1.4.2/OFED-1.4.2.tgz

To get BUILD_ID run ofed_info

Please report any issues in bugzilla https://bugs.openfabrics.org/  for
OFED 1.4.2

Vladimir & Tziporet

========================================================================


Release information:
------------------------------
Linux Operating Systems:
      - RedHat EL4 up4:  2.6.9-42.ELsmp      *
      - RedHat EL4 up5:  2.6.9-55.ELsmp
      - RedHat EL4 up6:  2.6.9-67.ELsmp
      - RedHat EL4 up7:  2.6.9-78.ELsmp
      - RedHat EL5:      2.6.18-8.el5
      - RedHat EL5 up1:  2.6.18-53.el5
      - RedHat EL5 up2:  2.6.18-92.el5
      - RedHat EL5 up3:  2.6.18-128.el5
      - OEL 4.5:         2.6.9-55.ELsmp
      - OEL 5.2:         2.6.18-92.el5
      - CentOS 5.2:      2.6.18-92.el5
      - Fedora C9:       2.6.25-14.fc9          *
      - SLES10:          2.6.16.21-0.8-smp
      - SLES10 SP1:      2.6.16.46-0.12-smp
      - SLES10 SP1 up1:  2.6.16.53-0.16-smp
      - SLES10 SP2:      2.6.16.60-0.21-smp
      - SLES11 GA:       2.6.27.13-1-default
      - OpenSuSE 10.3:   2.6.22.5-31             *
      - kernel.org:      2.6.26 and 2.6.27

    * Minimal QA for these versions

Systems:
      * x86_64
      * x86
      * ia64
      * ppc64

Main Changes from OFED-1.4.1
============================
- NFSRDMA
Fix NULL pointer dereference when calling locks_release_private due to the fl_lmops pointer never being set.
crypto_alloc_hash calls are failing due to larval returning unexpected results.  Reverting nfs4_make_rec_clidname to use crypto_alloc_tfm.
kref safety checks were removed in previous versions due to kref behaving differently in 2.6.18 (and older).  

- RDS
Refactor end of __conn_create for readability
Fix completion notifications on blocking sockets
fix for double-def of assert_spin_locked in RHEL4_U4/5/6/7
RDS/IW:
Remove dead code
Remove page_shift variable from iwarp transport
RDS/IB: 
Always use PAGE_SIZE for FMR page size

- MLX4
map sufficient ICM memory for EQs
Failing probe function if not primary physical function
Add new device ID 0x6764
Fix post send of local invalidate and fast registration packets.

- MLX4_EN
Fix vlan flag endianess in LRO code.

- NES
Make LRO as default feature
fix qp refcount during disconnect
backport for LRO as default feature

- SDP
Fix BUG1672 - Data integrity error
Fix memory leak in bzcopy
Fix bad credits advertised when connection initiated
Fix compilation on i386 with gcc 3.4


- BACKPORTS
2.6.16_sles10_sp2: fix clear-dirty-page accounting.

- Bug fixes

See each component release notes for details on enhancements and bug
fixes


From bart.vanassche at gmail.com  Thu Aug  6 08:38:18 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 17:38:18 +0200
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
In-Reply-To: <4A7A949B.60408@panasas.com>
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
	<4A7A949B.60408@panasas.com>
Message-ID: <e2e108260908060838u34b97ab4n6e2007dbb6937ff5@mail.gmail.com>

On Thu, Aug 6, 2009 at 10:30 AM, Boaz Harrosh <bharrosh at panasas.com> wrote:
> [Just out of memory, I've not inspected the code for a long time]
>
> It looks like an srp_request was never allocated for the reset
> command. (since it never went through .queuecommand)
>
> static int srp_find_req(struct srp_target_port *target,
>                        struct scsi_cmnd *scmnd,
>                        struct srp_request **req)
> {
>        if (scmnd->host_scribble == (void *) -1L)
>                return -1;
>
>        *req = &target->req_ring[(long) scmnd->host_scribble];
>
>        return 0;
> }
>
> Specifically scmnd->host_scribble can just be Zero.
> When queues are active that does not matter and a device is found
> since the reset does not really need the scsi_cmnd. But in above
> scenario the queues were never used and the array entry is empty.

Hello Boaz,

Thanks for the info. Do you know by heart which SCSI drivers process
the SG_SCSI_RESET ioctl correctly and that could be used as an example
for fixing the SRP initiator ?

Bart.


From bart.vanassche at gmail.com  Thu Aug  6 09:43:56 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 18:43:56 +0200
Subject: [ofa-general] IB kernel modules and the kobject release() method
Message-ID: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>

Hello,

After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I
noticed that messages appeared in the kernel log about the IB modules
missing a kobject release() method. This happens both with a vanilla
2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone
noticed this before ?

An example of the messages logged in /var/log/messages:

...
kobject: 'ib_cm' (ffffffffa067a810): does not have a release()
function, it is broken and must be fixed.
kobject: 'iw_cm' (ffffffffa06a58d0): does not have a release()
function, it is broken and must be fixed.
...

See also https://bugs.openfabrics.org/show_bug.cgi?id=1702.

Bart.


From ssufficool at sbcounty.gov  Wed Aug  5 08:42:19 2009
From: ssufficool at sbcounty.gov (Sufficool, Stanley)
Date: Wed, 5 Aug 2009 08:42:19 -0700
Subject: [ofa-general] SRP and Multiple Port HCA
Message-ID: <C2F174F99918D54CA2A96E57C5079B6F015C447D@sbc-exmsg2.sbcounty.gov>

When I attempt to connect WinOF SRP to an OFED SRP Target, WinOF SRP
immediately disconnects after establishing the connection. This is
occurring on a 2 port HCA initiator to a 2 port HCA target with
redundant paths through 2 switches.
 
IIRC the past solution was to unplug the redundant path because WinOF
SRP does not have multi path DSM to handle the visibility of the disk on
multiple paths. Unfortunately, we have IPoIB apps that use the
redundancy, so this is not an option.
 
I can mask off the redundant path at the target using SCST group to name
assignments, but the naming of the SRP is by NODE not PORT so I do not
have a unique name per port to close one of the paths. 
 
The only fix I see for Windows SRP and multi path is to provide an
option to use port based naming on the SRP Target for establishing
sessions. Or is this something that can be taken care of at the WinOF
side with a registry entry or discover / connect tool?
 
Does anyone have other ideas that will work with our current config?
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090805/c1d55c2e/attachment.html>

From James.Bottomley at HansenPartnership.com  Wed Aug  5 13:37:44 2009
From: James.Bottomley at HansenPartnership.com (James Bottomley)
Date: Wed, 05 Aug 2009 15:37:44 -0500
Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference
 by SRP 
 initiator triggered by a SCSI reset after the SRP connection has been closed
In-Reply-To: <e2e108260908051054r11262096j3b659de24c820967@mail.gmail.com>
References: <e2e108260908030621q102437e0ua60aa5bdfacb2e7e@mail.gmail.com>
	<adafxc8ocst.fsf@cisco.com>
	<e2e108260908040907l6537c2dcveb64615a664a047e@mail.gmail.com>
	<adafxc7mtn8.fsf@cisco.com>
	<e2e108260908041125w730869c0s8d212e2765598c42@mail.gmail.com>
	<adaljlzl0ne.fsf@cisco.com>
	<e2e108260908050800p3a6613bbib95fa670248a863@mail.gmail.com>
	<adaiqh2jgug.fsf@cisco.com>
	<e2e108260908051054r11262096j3b659de24c820967@mail.gmail.com>
Message-ID: <1249504664.4183.45.camel@mulgrave.site>

On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote:
> On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier<rdreier at cisco.com> wrote:
> >
> >  > The NULL pointer dereference happens when srp_reset_device() calls
> >  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
> >  > req->scmnd->device == NULL. When the sg_reset command issues an
> >  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
> >  > scmnd structure and sets scmnd->device to NULL. It is this scmnd
> >  > structure that is passed to srp_reset_device(). What I'm not sure
> >  > about is whether scsi_reset_provider() should set req->scmnd->device
> >  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
> >  > handle the condition req->scmnd->device == NULL.
> >
> > Well, I don't see how the reset ioctl can do anything useful unless it
> > passes a device in with the scsi command -- otherwise for example
> > srp_reset_device() has no idea what LUN to try and reset.
> 
> (added linux-scsi in CC)
> 
> I hope one of the SCSI people can tell us whether the behavior that
> scsi_reset_provider()
> passes the value NULL in req->scmnd->device to
> scsi_try_bus_device_reset() is correct ?

Need more information.

cmd->device is supposed to be initialised in scsi_get_command(), which
scsi_reset_provider() calls ... why do you think it got set to null?

James


From bharrosh at panasas.com  Thu Aug  6 01:30:19 2009
From: bharrosh at panasas.com (Boaz Harrosh)
Date: Thu, 06 Aug 2009 11:30:19 +0300
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
In-Reply-To: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
Message-ID: <4A7A949B.60408@panasas.com>

On 08/06/2009 10:39 AM, Bart Van Assche wrote:
> On Wed, Aug 5, 2009 at 10:37 PM, James
> Bottomley<James.Bottomley at hansenpartnership.com> wrote:
>> On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote:
>>> On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier<rdreier at cisco.com> wrote:
>>>>
>>>>  > The NULL pointer dereference happens when srp_reset_device() calls
>>>>  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with
>>>>  > req->scmnd->device == NULL. When the sg_reset command issues an
>>>>  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an
>>>>  > scmnd structure and sets scmnd->device to NULL. It is this scmnd
>>>>  > structure that is passed to srp_reset_device(). What I'm not sure
>>>>  > about is whether scsi_reset_provider() should set req->scmnd->device
>>>>  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to
>>>>  > handle the condition req->scmnd->device == NULL.
>>>>
>>>> Well, I don't see how the reset ioctl can do anything useful unless it
>>>> passes a device in with the scsi command -- otherwise for example
>>>> srp_reset_device() has no idea what LUN to try and reset.
>>>
>>> (added linux-scsi in CC)
>>>
>>> I hope one of the SCSI people can tell us whether the behavior that
>>> scsi_reset_provider()
>>> passes the value NULL in req->scmnd->device to
>>> scsi_try_bus_device_reset() is correct ?
>>
>> Need more information.
>>
>> cmd->device is supposed to be initialised in scsi_get_command(), which
>> scsi_reset_provider() calls ... why do you think it got set to null?
> 
> This thread started with the observation that it is easy to trigger a
> NULL pointer dereference in the SRP initiator
> (http://bugzilla.kernel.org/show_bug.cgi?id=13893). The following
> sequence is sufficient:
> * Remove the ib_srp kernel module (doing so closes all active SRP sessions).
> * Insert the ib_srp kernel module.
> * Create a new SRP connection.
> * Issue the sg_reset -d ${srp_device} command in a shell.
> The sg_reset command issues an SG_SCSI_RESET ioctl. This ioctl is
> processed by invoking scsi_reset_provider(), which in turns invokes
> the eh_device_reset_handler method of the SRP initiator. Further
> analysis showed that scsi_reset_provider() passes a non-NULL
> cmd->device pointer to the SRP initiator, but that the SRP initiator
> does not use this value. Instead srp_find_req() looks up a struct
> srp_request pointer based on the struct scsi_cmnd * argument and
> continues with the struct scsi_cmnd pointer contained in the struct
> srp_request.
> 
> While I'm not sure that the patch below makes any sense, it makes the
> NULL pointer dereference disappear. This made me wonder which
> assumptions srp_find_req() is based on ?
> 

[Just out of memory, I've not inspected the code for a long time]

It looks like an srp_request was never allocated for the reset
command. (since it never went through .queuecommand)

static int srp_find_req(struct srp_target_port *target,
			struct scsi_cmnd *scmnd,
			struct srp_request **req)
{
	if (scmnd->host_scribble == (void *) -1L)
		return -1;

	*req = &target->req_ring[(long) scmnd->host_scribble];

	return 0;
}

Specifically scmnd->host_scribble can just be Zero.
When queues are active that does not matter and a device is found
since the reset does not really need the scsi_cmnd. But in above
scenario the queues were never used and the array entry is empty.

Boaz

> --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c
> 2009-08-03 12:13:11.000000000 +0200
> +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c  2009-08-06
> 08:50:30.000000000 +0200
> @@ -1325,16 +1325,19 @@ static int srp_cm_handler(struct ib_cm_i
>  }
> 
>  static int srp_send_tsk_mgmt(struct srp_target_port *target,
> +                            struct scsi_cmnd *scmnd,
>                              struct srp_request *req, u8 func)
>  {
>         struct srp_iu *iu;
>         struct srp_tsk_mgmt *tsk_mgmt;
> 
> +       BUG_ON(!scmnd->device);
> +
>         spin_lock_irq(target->scsi_host->host_lock);
> 
>         if (target->state == SRP_TARGET_DEAD ||
>             target->state == SRP_TARGET_REMOVED) {
> -               req->scmnd->result = DID_BAD_TARGET << 16;
> +               scmnd->result = DID_BAD_TARGET << 16;
>                 goto out;
>         }
> 
> @@ -1348,7 +1351,7 @@ static int srp_send_tsk_mgmt(struct srp_
>         memset(tsk_mgmt, 0, sizeof *tsk_mgmt);
> 
>         tsk_mgmt->opcode        = SRP_TSK_MGMT;
> -       tsk_mgmt->lun           = cpu_to_be64((u64)
> req->scmnd->device->lun << 48);
> +       tsk_mgmt->lun           = cpu_to_be64((u64) scmnd->device->lun << 48);
>         tsk_mgmt->tag           = req->index | SRP_TAG_TSK_MGMT;
>         tsk_mgmt->tsk_mgmt_func = func;
>         tsk_mgmt->task_tag      = req->index;
> @@ -1395,7 +1398,7 @@ static int srp_abort(struct scsi_cmnd *s
>                 return FAILED;
>         if (srp_find_req(target, scmnd, &req))
>                 return FAILED;
> -       if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK))
> +       if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_ABORT_TASK))
>                 return FAILED;
> 
>         spin_lock_irq(target->scsi_host->host_lock);
> @@ -1425,7 +1428,9 @@ static int srp_reset_device(struct scsi_
>                 return FAILED;
>         if (srp_find_req(target, scmnd, &req))
>                 return FAILED;
> -       if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET))
> +       if (WARN_ON(!scmnd->device))
> +               return FAILED;
> +       if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_LUN_RESET))
>                 return FAILED;
>         if (req->tsk_status)
>                 return FAILED;
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


From James.Bottomley at HansenPartnership.com  Thu Aug  6 08:43:45 2009
From: James.Bottomley at HansenPartnership.com (James Bottomley)
Date: Thu, 06 Aug 2009 15:43:45 +0000
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
 the SRP initiator
In-Reply-To: <e2e108260908060838u34b97ab4n6e2007dbb6937ff5@mail.gmail.com>
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
	<4A7A949B.60408@panasas.com>
	<e2e108260908060838u34b97ab4n6e2007dbb6937ff5@mail.gmail.com>
Message-ID: <1249573425.7073.16.camel@mulgrave.site>

On Thu, 2009-08-06 at 17:38 +0200, Bart Van Assche wrote:
> On Thu, Aug 6, 2009 at 10:30 AM, Boaz Harrosh <bharrosh at panasas.com> wrote:
> > [Just out of memory, I've not inspected the code for a long time]
> >
> > It looks like an srp_request was never allocated for the reset
> > command. (since it never went through .queuecommand)
> >
> > static int srp_find_req(struct srp_target_port *target,
> >                        struct scsi_cmnd *scmnd,
> >                        struct srp_request **req)
> > {
> >        if (scmnd->host_scribble == (void *) -1L)
> >                return -1;
> >
> >        *req = &target->req_ring[(long) scmnd->host_scribble];
> >
> >        return 0;
> > }
> >
> > Specifically scmnd->host_scribble can just be Zero.
> > When queues are active that does not matter and a device is found
> > since the reset does not really need the scsi_cmnd. But in above
> > scenario the queues were never used and the array entry is empty.
> 
> Hello Boaz,
> 
> Thanks for the info. Do you know by heart which SCSI drivers process
> the SG_SCSI_RESET ioctl correctly and that could be used as an example
> for fixing the SRP initiator ?

Basically all of them which are in regular use for clustering; so SAN:
qla2xxx; lpfc.  And for legacy SPI clusters: aic7xxx;mptspi

James


From eli at dev.mellanox.co.il  Thu Aug  6 10:18:40 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 6 Aug 2009 20:18:40 +0300
Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow
	interfaces to correspond to each other
In-Reply-To: <20090805204259.GB16677@obsidianresearch.com>
References: <20090805083023.GK5599@mtls03>
	<20090805204259.GB16677@obsidianresearch.com>
Message-ID: <20090806171840.GA32301@mtls03>

On Wed, Aug 05, 2009 at 02:42:59PM -0600, Jason Gunthorpe wrote:
> 
> What about multicast though? Switches are going to have trouble with
> group membership lists for non IP packets.. Even just sending a ICMPv6
> packet (with an IPv6 ethertype) isn't guaranteed to fix it.
> 

In this patch set, all multicast packets use the broadcast mac. We
will address this issue at a future time.


From eli at dev.mellanox.co.il  Thu Aug  6 10:20:35 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 6 Aug 2009 20:20:35 +0300
Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device
	personality from node type to port type
In-Reply-To: <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com>
References: <20090805082808.GB5599@mtls03>
	<73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com>
Message-ID: <20090806172035.GB32301@mtls03>

On Wed, Aug 05, 2009 at 01:43:12PM -0700, Sean Hefty wrote:
> 
> Can resources (PDs, CQs, MRs, etc.) between the different transports be shared?
> Does QP failover between transports work?

There is nothing in the architecture that precludes this; we are not
currently focusing on this.

> 
> Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask
> of the supported transports available on the device?  I'm wondering if something
> like this makes sense, to allow skipping devices that are not of interest to a
> particular module.  This would be in addition to the rdma_port_get_transport
> call.
> 
> There's just a lot of new checks to handle the transport on a port by port
> basis.
> 

We can use a function: rdma_is_transport_supported(ibdev, transport),
which will return true if at least one port runs the given transport.
Thus, as long as we have only a few transports, these checks will
amount to 1-2 lines of code in each module.


From sean.hefty at intel.com  Thu Aug  6 10:34:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 10:34:19 -0700
Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device	personality
	from node type to port type
In-Reply-To: <20090806172035.GB32301@mtls03>
References: <20090805082808.GB5599@mtls03>	<73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com>
	<20090806172035.GB32301@mtls03>
Message-ID: <69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com>

>> Can resources (PDs, CQs, MRs, etc.) between the different transports be
>shared?
>> Does QP failover between transports work?
>
>There is nothing in the architecture that precludes this; we are not
>currently focusing on this.

Does the implementation allow this?  Right now PDs, CQs, etc are allocated per
device, not per port.  I'm not immediately concerned about QP failover.
However, I believe there needs to be some level of coordination between the
Infiniband side of the CM and the Ethernet side of the CM, since QPs are
associated with CA GUIDs.  I'm just trying to understand the impact of this
coordination.

- Sean


From rdreier at cisco.com  Thu Aug  6 10:37:19 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 10:37:19 -0700
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	(Bart Van Assche's message of "Thu, 6 Aug 2009 18:43:56 +0200")
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
Message-ID: <adad478hmi8.fsf@cisco.com>


 > 
 > After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I
 > noticed that messages appeared in the kernel log about the IB modules
 > missing a kobject release() method. This happens both with a vanilla
 > 2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone
 > noticed this before ?
 > 
 > An example of the messages logged in /var/log/messages:
 > 
 > ...
 > kobject: 'ib_cm' (ffffffffa067a810): does not have a release()
 > function, it is broken and must be fixed.

I don't see anything similar with CONFIG_DEBUG_KOBJECT enabled on
2.6.31-rc5 (without adding in any OFED confusion).

It seems as if you get this message for every module being loaded; do
you see it for any non-RDMA-related modules?  (Do you have any such
modules in your config?)  I can imagine the OFED build system messing
things up, but if you're just building the modules as part of the normal
kernel build (ie your vanilla 2.6.30 kernel) then I don't see anything
that would make ib_cm or iw_cm any different from any other module.

For example, if I load ib_cm on my kernel, the only log messages I see
from "dmesg|grep ib_cm" are:

    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_add_internal: parent: 'module', set: 'module'
    kobject: 'holders' (ffff88022c1c9df8): kobject_add_internal: parent: 'ib_cm', set: '<NULL>'
    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_uevent_env
    kobject: 'ib_cm' (ffffffffa024c8f0): fill_kobj_path: path = '/module/ib_cm'
    kobject: 'notes' (ffff88022c1c9be8): kobject_add_internal: parent: 'ib_cm', set: '<NULL>'

 - R.


From rdreier at cisco.com  Thu Aug  6 10:38:20 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 10:38:20 -0700
Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow
	interfaces to correspond to each other
In-Reply-To: <20090806171840.GA32301@mtls03> (Eli Cohen's message of "Thu, 6
	Aug 2009 20:18:40 +0300")
References: <20090805083023.GK5599@mtls03>
	<20090805204259.GB16677@obsidianresearch.com>
	<20090806171840.GA32301@mtls03>
Message-ID: <ada8whwhmgj.fsf@cisco.com>


 > > What about multicast though? Switches are going to have trouble with
 > > group membership lists for non IP packets.. Even just sending a ICMPv6
 > > packet (with an IPv6 ethertype) isn't guaranteed to fix it.

 > In this patch set, all multicast packets use the broadcast mac. We
 > will address this issue at a future time.

I don't see how you can address it in the future -- if later on things
are changed to use multicast addresses, then systems running this code
will silently fail to receive multicasts.

 - R.


From rdreier at cisco.com  Thu Aug  6 10:41:03 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 10:41:03 -0700
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
In-Reply-To: <4A7A949B.60408@panasas.com> (Boaz Harrosh's message of "Thu, 06
	Aug 2009 11:30:19 +0300")
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
	<4A7A949B.60408@panasas.com>
Message-ID: <ada1vnohmc0.fsf@cisco.com>


 > Specifically scmnd->host_scribble can just be Zero.

I see at last, thanks!

The issue is that SRP is using host_scribble to hold an index, and index
0 is valid for us.

I guess the fix is a bit complex, but basically we should use
host_scribble to point to the request, and if we don't find a request in
reset_device we should allocate one.

It's a bit unfortunate that the SCSI midlayer bypasses queueing for the
device reset command because it means we may not have a slot in our
queue for the reset request etc but I suppose that's even more involved
to fix.

 - R.


From sean.hefty at intel.com  Thu Aug  6 10:52:34 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 10:52:34 -0700
Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1
In-Reply-To: <20090805082854.GD5599@mtls03>
References: <20090805082854.GD5599@mtls03>
Message-ID: <9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com>

>diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
>index 7b737c4..de83c71 100644
>--- a/drivers/infiniband/core/mad.c
>+++ b/drivers/infiniband/core/mad.c
>@@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct
>ib_device *device,
> 	unsigned long flags;
> 	u8 mgmt_class, vclass;
>
>+	/* Validate device and port */
>+	port_priv = ib_get_mad_port(device, port_num);
>+	if (!port_priv) {
>+		ret = ERR_PTR(-ENODEV);
>+		goto error1;
>+	}
>+
>+	if (!port_priv->qp_info[qp_type].qp)
>+		return NULL;

It seems odd that the first if has 'goto error1', but the second if simply
returns NULL. 

>+
> 	/* Validate parameters */
> 	qpn = get_spl_qp_index(qp_type);
> 	if (qpn == -1)
>@@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct
>ib_device *device,
> 			goto error1;
> 	}
>
>-	/* Validate device and port */
>-	port_priv = ib_get_mad_port(device, port_num);
>-	if (!port_priv) {
>-		ret = ERR_PTR(-ENODEV);
>-		goto error1;
>-	}
>-
> 	/* Allocate structures */
> 	mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL);
> 	if (!mad_agent_priv) {
>@@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
> 	struct ib_mad_agent_private *mad_agent_priv;
> 	struct ib_mad_snoop_private *mad_snoop_priv;
>
>+	if (!mad_agent)
>+		return 0;

Why would a kernel client call ib_unregister_mad_agent with a NULL pointer?

>+
> 	/* If the TID is zero, the agent can only snoop. */
> 	if (mad_agent->hi_tid) {
> 		mad_agent_priv = container_of(mad_agent,
>@@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info
>*qp_info)
> 	struct ib_mad_private *recv;
> 	struct ib_mad_list_head *mad_list;
>
>+	if (!qp_info->qp)
>+		return;
>+
> 	while (!list_empty(&qp_info->recv_queue.list)) {
>
> 		mad_list = list_entry(qp_info->recv_queue.list.next,
>@@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private
>*port_priv)
>
> 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
> 		qp = port_priv->qp_info[i].qp;
>+		if (!qp)
>+			continue;
>+
> 		/*
> 		 * PKey index for QP1 is irrelevant but
> 		 * one is needed for the Reset to Init transition
>@@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private
>*port_priv)
> 	}
>
> 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
>+		if (!port_priv->qp_info[i].qp)
>+			continue;
>+
> 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
> 		if (ret) {
> 			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
>@@ -2762,6 +2777,9 @@ error:
>
> static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
> {
>+	if (!qp_info->qp)
>+		return;
>+
> 	ib_destroy_qp(qp_info->qp);
> 	kfree(qp_info->snoop_table);
> }
>@@ -2777,6 +2795,7 @@ static int ib_mad_port_open(struct ib_device *device,
> 	struct ib_mad_port_private *port_priv;
> 	unsigned long flags;
> 	char name[sizeof "ib_mad123"];
>+	int has_smi;
>
> 	/* Create new device info */
> 	port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL);
>@@ -2793,6 +2812,10 @@ static int ib_mad_port_open(struct ib_device *device,
> 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
>
> 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
>+	has_smi = rdma_port_get_transport(device, port_num) ==
RDMA_TRANSPORT_IB;
>+	if (has_smi)
>+		cq_size *= 2;

cq_size is doubled twice

I really wish there were a cleaner way to add this support that didn't involve
adding so many checks throughout the code.  It's hard to know if checks were
added in all the places that were needed.  I can't think of a clever way to
handle QP 0.

- Sean


From rdreier at cisco.com  Thu Aug  6 10:56:26 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 10:56:26 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion
	dependency detected
In-Reply-To: <e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	(Bart Van Assche's message of "Thu, 6 Aug 2009 11:58:50 +0200")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
Message-ID: <adaws5gg71x.fsf@cisco.com>


 > After having applied this patch it took somewhat longer before a
 > locking inversion report was generated, but unfortunately there still
 > was a locking inversion report generated (see also
 > http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details):

ummm, yikes...

can you apply the hack patch I sent originally to take priv->lock from
an interrupt ASAP and try that along with the fix patch to drop
priv->lock before calling ipoib_send()?  That might make the lockdep
trace understandable.

 - R.


From sean.hefty at intel.com  Thu Aug  6 11:05:47 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 11:05:47 -0700
Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for
	RDMAoE	ports
In-Reply-To: <20090805082910.GE5599@mtls03>
References: <20090805082910.GE5599@mtls03>
Message-ID: <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com>

>Initialize umad context for devices that have any of their ports either IB or
>RDMAoE so as to allow user space apps to send and receive MADs on QP1.

Is there a need to expose QP1 to user space?  The CM is in the kernel, and
there's not an SA.

- Sean


From sean.hefty at intel.com  Thu Aug  6 11:12:49 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 11:12:49 -0700
Subject: [ofa-general] [PATCHv4 05/10] ib/cm: Enable CM support for RDMAoE
In-Reply-To: <20090805082919.GF5599@mtls03>
References: <20090805082919.GF5599@mtls03>
Message-ID: <397FD7F95179400BB728A5807176EF61@amr.corp.intel.com>

>diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>index f930f1d..63d6de3 100644
>--- a/drivers/infiniband/core/cm.c
>+++ b/drivers/infiniband/core/cm.c
>@@ -3699,7 +3699,7 @@ static void cm_add_one(struct ib_device *ib_device)
> 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
> 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
> 		tt = rdma_port_get_transport(ib_device, i);
>-		if (tt != RDMA_TRANSPORT_IB)
>+		if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE)
> 			continue;
>
> 		port = kzalloc(sizeof *port, GFP_KERNEL);
>diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
>index 4f5096d..21c78f5 100644
>--- a/drivers/infiniband/core/ucm.c
>+++ b/drivers/infiniband/core/ucm.c
>@@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device)
> {
> 	struct ib_ucm_device *ucm_dev;
> 	int i;
>+	enum rdma_transport_type tt;
>
> 	if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH)
> 		return;
>
>-	for (i = 1; i <= device->phys_port_cnt; ++i)
>-		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
>-			return;
>+	for (i = 1; i <= device->phys_port_cnt; ++i) {
>+		tt = rdma_port_get_transport(device, i);
>+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
>+			break;
>+	}
>+
>+	if (i > device->phys_port_cnt)
>+		return;
>
> 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
> 	if (!ucm_dev)

nit: There's a slight change in logic here.  Previously, the cm/ucm added a
device only if all ports were the correct type.  Now, they add a device if any
port is the correct type.  In practice, this shouldn't be an issue, but other
code in the cm/ucm is assuming that all ports on the device are usable.


From hnrose at comcast.net  Thu Aug  6 11:19:28 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 6 Aug 2009 14:19:28 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_file.c: Fix return status
	from do_ucast_file_load when file name is not provided
Message-ID: <20090806181928.GA21698@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c
index 2505c46..a22b936 100644
--- a/opensm/opensm/osm_ucast_file.c
+++ b/opensm/opensm/osm_ucast_file.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2008      Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2009      HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context)
 		OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE,
 			"LFTs file name is not given; "
 			"using default routing algorithm\n");
-		return 1;
+		return -1;
 	}
 
 	file = fopen(file_name, "r");


From hnrose at comcast.net  Thu Aug  6 11:23:15 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 6 Aug 2009 14:23:15 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: In lash_core,
	return status -1 for all errors
Message-ID: <20090806182315.GB21698@comcast.net>


In lash_process, rename variable from return_status to status
Also, status is not really IB_SUCCESS or not (although that works)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index b3107f0..96bfebb 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -795,7 +795,7 @@ static int lash_core(lash_t * p_lash)
 	int stop = 0, output_link, i_next_switch;
 	int output_link2, i_next_switch2;
 	int cycle_found2 = 0;
-	int status = 0;
+	int status = -1;
 	int *switch_bitmap = NULL;	/* Bitmap to check if we have processed this pair */
 	unsigned start_vl = p_lash->p_osm->subn.opt.lash_start_vl;
 
@@ -810,7 +810,6 @@ static int lash_core(lash_t * p_lash)
 
 		shortest_path(p_lash, i);
 		if (generate_routing_func_for_mst(p_lash, i, &dests)) {
-			status = -1;
 			OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06: "
 				"generate_routing_func_for_mst failed\n");
 			goto Exit;
@@ -951,10 +950,10 @@ static int lash_core(lash_t * p_lash)
 		OSM_LOG(p_log, OSM_LOG_INFO, "Lanes in layer %d: %d\n",
 			i, p_lash->num_mst_in_lane[i]);
 
+	status = 0;
 	goto Exit;
 
 Error_Not_Enough_Lanes:
-	status = -1;
 	OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: "
 		"Lane requirements (%d) exceed available lanes (%d)"
 		" with starting lane (%d)\n",
@@ -1222,7 +1221,7 @@ static int lash_process(void *context)
 {
 	lash_t *p_lash = context;
 	osm_log_t *p_log = &p_lash->p_osm->log;
-	int return_status = IB_SUCCESS;
+	int status = 0;
 
 	OSM_LOG_ENTER(p_log);
 
@@ -1231,18 +1230,18 @@ static int lash_process(void *context)
 	/* everything starts here */
 	lash_cleanup(p_lash);
 
-	return_status = discover_network_properties(p_lash);
-	if (return_status != IB_SUCCESS)
+	status = discover_network_properties(p_lash);
+	if (status)
 		goto Exit;
 
-	return_status = init_lash_structures(p_lash);
-	if (return_status != IB_SUCCESS)
+	status = init_lash_structures(p_lash);
+	if (status)
 		goto Exit;
 
 	process_switches(p_lash);
 
-	return_status = lash_core(p_lash);
-	if (return_status != IB_SUCCESS)
+	status = lash_core(p_lash);
+	if (status)
 		goto Exit;
 
 	populate_fwd_tbls(p_lash);
@@ -1252,7 +1251,7 @@ Exit:
 		free_lash_structures(p_lash);
 	OSM_LOG_EXIT(p_log);
 
-	return return_status;
+	return status;
 }
 
 static lash_t *lash_create(osm_opensm_t * p_osm)


From arlin.r.davis at intel.com  Thu Aug  6 11:39:40 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 6 Aug 2009 11:39:40 -0700
Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.21 release
Message-ID: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com>


New release for uDAPL 2.0 available on the OFA download page and in my git tree.

md5sum: 7874571e984c9d8ab315dcd90bfd7c44 dapl-2.0.21.tar.gz 

Summary of changes: 
v2 - scm: Fix disconnect. QP's need to move to ERROR state in 
v2 - dtest: modify dtest.c to cleanup CNO wait code and consolidate into 
v2 - common: CNO events, once triggered will not be returned during the cno wait. 
v2 - scm, cma: CNO support broken in both CMA and SCM providers. 
v2 - common osd: include winsock2.h for IPv6 definitions. 
v2 - common osd: include w2tcpip.h for sockaddr_in6 definitions. 
v2 - common: direct_wait objects pushed down to provider layer 
v2 - dapltest: Implement a malloc() threshold for the completion reaping. 
v2 - scm: handle connected state when freeing CM objects 
v2 - scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. 
v2 - scm: set TCP_NODELAY sockopt on the server side for sends. 
v2 - windows: remove obsolete files in dapl/udapl source tree 
v2 - dtestcm: add UD type QP option to test 
v2 - scm: destroy QP called before disconnect 
v2 - cma: add support for rdma_cm TIME_WAIT event. 
v2 - scm: remove old udapl_scm code replaced by openib_scm. 
v2 - winof: fix build issues after consolidating cma, scm code base. 
v2 - cma: lock held when exiting as a result of a rdma_create_event_channel failurb 
v2 - windows: all dlist functions have been moved to the header file. 
v2 - dtestcm windows: add build infrastructure for new dtestcm test suite 
v2 - openib_common: reorganize provider code base to share common mem, cq, qp, dto 
v2 - scm: fixes and optimizations for connection scaling 
v2 - scm: double the default fd_set_size 
v2 - scm: EP reference in CR should be cleared during ep_destroy 
v2 - dtestx: fix conn establishment event checking 
v2 - dtestcm: new test to measure dapl connection rates. 

Vlad, please pull new v2 package into OFED 1.5 and install the following:

NOTE: the reorder... v2 first and then v1
 
dapl-2.0.21-1 
dapl-utils-2.0.21-1 
dapl-devel-2.0.21-1 
dapl-debuginfo-2.0.21-1 
compat-dapl-1.2.14-1 
compat-dapl-devel-1.2.14-1 

See http://www.openfabrics.org/downloads/dapl/ more details.

-arlin


From bart.vanassche at gmail.com  Thu Aug  6 11:46:25 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 20:46:25 +0200
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <adad478hmi8.fsf@cisco.com>
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
Message-ID: <e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>

On Thu, Aug 6, 2009 at 7:37 PM, Roland Dreier <rdreier at cisco.com> wrote:
>
>  >
>  > After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I
>  > noticed that messages appeared in the kernel log about the IB modules
>  > missing a kobject release() method. This happens both with a vanilla
>  > 2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone
>  > noticed this before ?
>  >
>  > An example of the messages logged in /var/log/messages:
>  >
>  > ...
>  > kobject: 'ib_cm' (ffffffffa067a810): does not have a release()
>  > function, it is broken and must be fixed.
>
> I don't see anything similar with CONFIG_DEBUG_KOBJECT enabled on
> 2.6.31-rc5 (without adding in any OFED confusion).
>
> It seems as if you get this message for every module being loaded; do
> you see it for any non-RDMA-related modules?  (Do you have any such
> modules in your config?)  I can imagine the OFED build system messing
> things up, but if you're just building the modules as part of the normal
> kernel build (ie your vanilla 2.6.30 kernel) then I don't see anything
> that would make ib_cm or iw_cm any different from any other module.
>
> For example, if I load ib_cm on my kernel, the only log messages I see
> from "dmesg|grep ib_cm" are:
>
>    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_add_internal: parent: 'module', set: 'module'
>    kobject: 'holders' (ffff88022c1c9df8): kobject_add_internal: parent: 'ib_cm', set: '<NULL>'
>    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_uevent_env
>    kobject: 'ib_cm' (ffffffffa024c8f0): fill_kobj_path: path = '/module/ib_cm'
>    kobject: 'notes' (ffff88022c1c9be8): kobject_add_internal: parent: 'ib_cm', set: '<NULL>'

Just to be sure that I'm working with the vanilla 2.6.30.4 kernel
drives and not with the OFED drivers, I ran the following commands
before any IB modules were loaded:

rm -rf /lib/modules/$(uname -r)
cd /usr/src/linux-2.6.30.4
make modules_install

Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start)
and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop)
the IB subsystem. The "broken" message was logged during module unload
only, not during module load.

The "broken" message was also logged for the following non-IB kernel
modules: snd_seq_dummy, snd_pcm_oss, snd_mixer_oss, snd_seq,
snd_seq_device and scsi_tgt.

Bart.

Bart.


From rdreier at cisco.com  Thu Aug  6 12:22:02 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 12:22:02 -0700
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	(Bart Van Assche's message of "Thu, 6 Aug 2009 20:46:25 +0200")
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
Message-ID: <adaocqsg339.fsf@cisco.com>


 > Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start)
 > and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop)
 > the IB subsystem. The "broken" message was logged during module unload
 > only, not during module load.

Oh I see... yes I get it on unload too, for any module.  Seems like a
shortcoming in the kobject debugging code.

 - R.


From bart.vanassche at gmail.com  Thu Aug  6 12:29:43 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Thu, 6 Aug 2009 21:29:43 +0200
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <adaocqsg339.fsf@cisco.com>
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	<adaocqsg339.fsf@cisco.com>
Message-ID: <e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>

On Thu, Aug 6, 2009 at 9:22 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start)
>  > and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop)
>  > the IB subsystem. The "broken" message was logged during module unload
>  > only, not during module load.
>
> Oh I see... yes I get it on unload too, for any module.  Seems like a
> shortcoming in the kobject debugging code.

Are you sure that this indicates a shortcoming in the kobject
debugging code ? The most recent messages related to the message "does
not have a release() function, it is broken and must be fixed" I could
find on the LKML date from July 16, 2009
(http://lkml.org/lkml/2009/7/16/306 and
http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH
acknowledges that if this message is logged that this indicates a
problem that should be fixed.

Bart.


From ralph.campbell at qlogic.com  Thu Aug  6 12:48:10 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 6 Aug 2009 12:48:10 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ
	entries
Message-ID: <1249588090.23162.940.camel@chromite.mv.qlogic.com>

When ib_send_bw is run in bi-directional mode (-b), it doesn't
create enough completion queue entries for both the send *and*
the receive completions. Thus, CQ entries are lost due to the
queue being full and the test can hang.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/send_bw.c b/send_bw.c
index f842fb9..d5c4e63 100755
--- a/send_bw.c
+++ b/send_bw.c
@@ -489,7 +489,8 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev,
 		}
 	}
 
-	ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel, 0);
+	ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth,
+				NULL, ctx->channel, 0);
 	if (!ctx->cq) {
 		fprintf(stderr, "Couldn't create CQ\n");
 		return NULL;


From rdreier at cisco.com  Thu Aug  6 12:58:40 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 06 Aug 2009 12:58:40 -0700
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>
	(Bart Van Assche's message of "Thu, 6 Aug 2009 21:29:43 +0200")
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	<adaocqsg339.fsf@cisco.com>
	<e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>
Message-ID: <adafxc4g1e7.fsf@cisco.com>


 > Are you sure that this indicates a shortcoming in the kobject
 > debugging code ? The most recent messages related to the message "does
 > not have a release() function, it is broken and must be fixed" I could
 > find on the LKML date from July 16, 2009
 > (http://lkml.org/lkml/2009/7/16/306 and
 > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH
 > acknowledges that if this message is logged that this indicates a
 > problem that should be fixed.

I'm not sure -- I just assume that the core module unloading code is
working OK, since it is so heavily tested.  If there were really a "must
be fixed" problem with module unloading then someone would surely have
hit more than a warning message.

 - R.


From hal.rosenstock at gmail.com  Thu Aug  6 13:45:25 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Aug 2009 16:45:25 -0400
Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <f0e08f230908051007t66c799adgcebe61f15ed10c80@mail.gmail.com>
References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me>
	<f0e08f230908040945r41eb42cdnbcecb0e96c28278a@mail.gmail.com>
	<20090804201505.GI7993@me>
	<f0e08f230908050424s26cbe8d3y690adacaded59591@mail.gmail.com>
	<20090805134352.GS7993@me>
	<f0e08f230908050743m6a192bc6m684b24df9ed86259@mail.gmail.com>
	<20090805163140.GW7993@me>
	<f0e08f230908051007t66c799adgcebe61f15ed10c80@mail.gmail.com>
Message-ID: <f0e08f230908061345k3cfdd224q7b199eedde3bef5b@mail.gmail.com>

On Wed, Aug 5, 2009 at 1:07 PM, Hal Rosenstock <hal.rosenstock at gmail.com>wrote:

>
>
>  On Wed, Aug 5, 2009 at 12:31 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
>
>> On 10:43 Wed 05 Aug     , Hal Rosenstock wrote:
>> >
>> > Should this be done as a separate step on the way to the LFT
>> parallelization
>> > across switches ?
>>
>> What do you mean by "separate step" (separate from what)?
>
>
> Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a
> level and a second one to the implement the LFT parallelization across
> switches underneath that.
>
>
>>
>>
>> I'm trying to replay the idea again: each routing engine calculates LFTs
>> and fill sw->new_lfts array accordingly, after all it calls a procedure
>> for sending switches' LFT blocks (and TOPs). So routing engine itself
>> should not care about how exactly LFT blocks update MADs submission is
>> actually implemented.
>>
>
>
> Yes, understood.
>

The one issue which gets in the way a bit here is the port order list (only
applicable to certain engines and not others). Due to this, there are two
places where the FT MAD pushing occurs. It'll be clearer when I submit the
patch for this.

One other thing I ran into (and related to the osm_ucast_file.c patch I sent
a little while ago is the significance of > 0 returns from build_fwd_tables.
Is there a reason that a routing engine would want to run its
build_fwd_tables and then run the default one ? That seems to be what it
does.

It might be useful to document the status returns from build_lid_matrices
and build_fwd_tables.

-- Hal


>
> -- Hal
>
>
>>
>> Sasha
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090806/367ff747/attachment.html>

From liranl at mellanox.co.il  Thu Aug  6 14:04:49 2009
From: liranl at mellanox.co.il (Liran Liss)
Date: Fri, 7 Aug 2009 00:04:49 +0300
Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support -
	allowinterfaces to correspond to each other
In-Reply-To: <ada8whwhmgj.fsf@cisco.com>
References: <20090805083023.GK5599@mtls03><20090805204259.GB16677@obsidianresearch.com><20090806171840.GA32301@mtls03>
	<ada8whwhmgj.fsf@cisco.com>
Message-ID: <2ED289D4E09FBD4D92D911E869B97FDD50DF98@mtlexch01.mtl.com>


 > > What about multicast though? Switches are going to have trouble
with  > > group membership lists for non IP packets.. Even just sending
a ICMPv6  > > packet (with an IPv6 ethertype) isn't guaranteed to fix
it.

 > In this patch set, all multicast packets use the broadcast mac. We  >
will address this issue at a future time.

I don't see how you can address it in the future -- if later on things
are changed to use multicast addresses, then systems running this code
will silently fail to receive multicasts.

 - R.

We initially intended to defer this to a separate patch set for brevity,
but I understand your point.
We will work out a solution and resend.

Thanks,
--Liran


From sashak at voltaire.com  Thu Aug  6 14:06:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 7 Aug 2009 00:06:01 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_mesh.c: Remove edges in lash
	matrix
In-Reply-To: <20090805220613.GA7155@comcast.net>
References: <20090805220613.GA7155@comcast.net>
Message-ID: <20090806210601.GD7993@me>

Hi Hal,

On 18:06 Wed 05 Aug     , Hal Rosenstock wrote:
> 
> +
> +static void print_axis(lash_t *p_lash, int sw, int port)
> +{
> +	mesh_node_t *node = p_lash->switches[sw]->node;
> +	char *name = p_lash->switches[sw]->p_sw->p_node->print_desc;
> +	int c = node->axes[port];
> +
> +	printf("%s[%d] = ", name, port);
> +	if (c)
> +		printf("%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2);
> +	else
> +		printf("N/A -> ");
> +	printf("%s\n",
> +	       p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc);
>  }
>  
>  /*
> @@ -805,6 +864,11 @@ static void seed_axes(lash_t *p_lash, int sw)
>  		}
>  	}
>  
> +	for (i = 0; i < n; i++) {
> +		printf("seed: ");
> +		print_axis(p_lash, sw, i);
> +	}

Please remove debug prints or move it to use osm_log().

Sasha


From jgunthorpe at obsidianresearch.com  Thu Aug  6 14:12:31 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 6 Aug 2009 15:12:31 -0600
Subject: [ofa-general] Setting the rate in Infiniband.
In-Reply-To: <ed1288770908051703i4289654cs23c4e3118bba41a4@mail.gmail.com>
References: <ed1288770908051703i4289654cs23c4e3118bba41a4@mail.gmail.com>
Message-ID: <20090806211231.GG16677@obsidianresearch.com>

On Wed, Aug 05, 2009 at 08:03:04PM -0400, Ashwath Narasimhan wrote:

> The reason why I need such small rates is because I interface the
> Infiniband HCA to an FPGA via an Infiniband physical link.  Imagine
> the FPGA as a simple repeater that simply forwards the infiniband
> signals to the Target HCA. The FPGA cannot handle such a high data
> rate and neither do I have as much memory as required to buffer it
> on the FPGA (I might drop packets if the buffer becomes full). Hence
> I wish to limit the rate to say 100Mbps instead of 2.5Gbps.

The correct thing to do is manage the flow control credits you are
giving to the IB network so you don't loose packets.

Jason


From jgunthorpe at obsidianresearch.com  Thu Aug  6 14:19:29 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 6 Aug 2009 15:19:29 -0600
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ
	entries
In-Reply-To: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
Message-ID: <20090806211929.GH16677@obsidianresearch.com>

On Thu, Aug 06, 2009 at 12:48:10PM -0700, Ralph Campbell wrote:
> When ib_send_bw is run in bi-directional mode (-b), it doesn't
> create enough completion queue entries for both the send *and*
> the receive completions. Thus, CQ entries are lost due to the
> queue being full and the test can hang.

Is this on IB? I thought the required behavior on CQ exhaustion was
for the sendq to halt and incoming recvs to return RNR (same as recvq
exhaustion) - it shouldn't just hang, and CQ entries should never be
lost.

Clearly the patch is right, but the consequences you describe seem
wrong..

Jason


From sean.hefty at intel.com  Thu Aug  6 14:37:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 14:37:19 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few
	CQ	entries
In-Reply-To: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
Message-ID: <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>

>-	ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel,
>0);
>+	ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth,
>+				NULL, ctx->channel, 0);

I'm looking at a windows port of this test, but at least there, rx_depth is set
to rx_depth + tx_depth.


From ralph.campbell at qlogic.com  Thu Aug  6 14:46:44 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 06 Aug 2009 14:46:44 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ
	entries
In-Reply-To: <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
	<6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>
Message-ID: <1249595204.23162.951.camel@chromite.mv.qlogic.com>

On Thu, 2009-08-06 at 14:37 -0700, Sean Hefty wrote:
> >-	ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel,
> >0);
> >+	ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth,
> >+				NULL, ctx->channel, 0);
> 
> I'm looking at a windows port of this test, but at least there, rx_depth is set
> to rx_depth + tx_depth.

Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to
	ctx->rx_depth = rx_depth + tx_depth
but the rest of the code does ibv_post_send() and ibv_post_recv()
based on ctx->tx_depth and ctx->rx_depth which means the CQ needs
to be ctx->tx_depth + ctx->rx_depth big.


From sean.hefty at intel.com  Thu Aug  6 14:56:08 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 14:56:08 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few
	CQ	entries
In-Reply-To: <1249595204.23162.951.camel@chromite.mv.qlogic.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>	<6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>
	<1249595204.23162.951.camel@chromite.mv.qlogic.com>
Message-ID: <D30D936A450A4BB88B7539A3F0227F06@amr.corp.intel.com>

>Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to
>	ctx->rx_depth = rx_depth + tx_depth
>but the rest of the code does ibv_post_send() and ibv_post_recv()
>based on ctx->tx_depth and ctx->rx_depth which means the CQ needs
>to be ctx->tx_depth + ctx->rx_depth big.

If the tx_depth is the same on both sides, why would there ever be more than the
initial tx_depth and rx_depth completions on the CQ?  How many receive
completions can there be on the CQ, and what throttles the sender? 

- Sean


From ralph.campbell at qlogic.com  Thu Aug  6 15:04:27 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 06 Aug 2009 15:04:27 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ
	entries
In-Reply-To: <D30D936A450A4BB88B7539A3F0227F06@amr.corp.intel.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>
	<6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>
	<1249595204.23162.951.camel@chromite.mv.qlogic.com>
	<D30D936A450A4BB88B7539A3F0227F06@amr.corp.intel.com>
Message-ID: <1249596267.23162.956.camel@chromite.mv.qlogic.com>

On Thu, 2009-08-06 at 14:56 -0700, Sean Hefty wrote:
> >Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to
> >	ctx->rx_depth = rx_depth + tx_depth
> >but the rest of the code does ibv_post_send() and ibv_post_recv()
> >based on ctx->tx_depth and ctx->rx_depth which means the CQ needs
> >to be ctx->tx_depth + ctx->rx_depth big.
> 
> If the tx_depth is the same on both sides, why would there ever be more than the
> initial tx_depth and rx_depth completions on the CQ?  How many receive
> completions can there be on the CQ, and what throttles the sender? 
> 
> - Sean

Remember that this fix only affects the bi-directional test.
Both client and sever are going to post ctx->rx_depth receives
and ctx->tx_depth sends and then check for completions.
It won't post more sends or receives until the completions are
seen.


From hnrose at comcast.net  Thu Aug  6 15:34:17 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 6 Aug 2009 18:34:17 -0400
Subject: [ofa-general] [PATCHv2] opensm/osm_mesh.c: Remove edges in lash
	matrix
Message-ID: <20090806223417.GA2997@comcast.net>


The intent of this change is to remove edge nodes (by "not counting
them).

The point of this heuristic is to deal with the case of small
lattices which can easily have more surface than interior,
which leads to choosing a non representative seed. This causes
impossible counts to get reported.

Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Replaced printfs with OSM_LOG calls

diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 72a9aa9..174bd7e 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -170,6 +170,11 @@ static const struct mesh_info {
 
 	{8, {2, 2, 2, 2, 2, 2, 2, 2},	8, {-1792, -6144, -8960, -7168, -3360, -896, -112, 0, 1},	},
 
+	/*
+	 * mesh errors
+	 */
+	{2, {6, 6},                     4, {-192, -256, -80, 0, 1}, },
+
 	{-1, {0,}, 0, {0, },					},
 };
 
@@ -727,6 +732,42 @@ done:
 }
 
 /*
+ * remove_edges
+ *
+ * remove type from nodes that have fewer links
+ * than adjacent nodes
+ */
+static void remove_edges(lash_t *p_lash)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int sw;
+	mesh_node_t *n, *nn;
+	int i;
+
+	OSM_LOG_ENTER(p_log);
+
+	for (sw = 0; sw < p_lash->num_switches; sw++) {
+		n = p_lash->switches[sw]->node;
+		if (!n->type)
+			continue;
+
+		for (i = 0; i < n->num_links; i++) {
+			nn = p_lash->switches[n->links[i]->switch_id]->node;
+
+			if (nn->num_links > n->num_links) {
+				OSM_LOG(p_log, OSM_LOG_DEBUG,
+					"removed edge switch %s\n",
+					p_lash->switches[sw]->p_sw->p_node->print_desc);
+				n->type = -1;
+				break;
+			}
+		}
+	}
+
+	OSM_LOG_EXIT(p_log);
+}
+
+/*
  * get_local_geometry
  *
  * analyze the local geometry around each switch
@@ -735,6 +776,7 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
 	int sw;
+	int status = 0;
 
 	OSM_LOG_ENTER(p_log);
 
@@ -747,15 +789,38 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh)
 			continue;
 
 		if (get_switch_metric(p_lash, sw)) {
-			OSM_LOG_EXIT(p_log);
-			return -1;
+			status = -1;
+			goto Exit;
 		}
-		classify_switch(p_lash, mesh, sw);
 		classify_mesh_type(p_lash, sw);
 	}
 
+	remove_edges(p_lash);
+
+	for (sw = 0; sw < p_lash->num_switches; sw++) {
+		if (p_lash->switches[sw]->node->type < 0)
+			continue;
+		classify_switch(p_lash, mesh, sw);
+	}
+
+Exit:
 	OSM_LOG_EXIT(p_log);
-	return 0;
+	return status;
+}
+
+static void print_axis(lash_t *p_lash, char *p, int sw, int port)
+{
+	mesh_node_t *node = p_lash->switches[sw]->node;
+	char *name = p_lash->switches[sw]->p_sw->p_node->print_desc;
+	int c = node->axes[port];
+
+	p += sprintf(p, "%s[%d] = ", name, port);
+	if (c)
+		p += sprintf(p, "%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2);
+	else
+		p += sprintf(p, "N/A -> ");
+	p += sprintf(p, "%s\n",
+		     p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc);
 }
 
 /*
@@ -773,6 +838,7 @@ static void seed_axes(lash_t *p_lash, int sw)
 	mesh_node_t *node = p_lash->switches[sw]->node;
 	int n = node->num_links;
 	int i, j, c;
+	char buf[256], *p;
 
 	OSM_LOG_ENTER(p_log);
 	if (!node->matrix || !node->dimension)
@@ -805,6 +871,12 @@ static void seed_axes(lash_t *p_lash, int sw)
 		}
 	}
 
+	for (i = 0; i < n; i++) {
+		p = buf;
+		print_axis(p_lash, p, sw, i);
+		OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
+	}
+
 done:
 	OSM_LOG_EXIT(p_log);
 }
@@ -878,6 +950,12 @@ static void make_geometry(lash_t *p_lash, int sw)
 			n = s1->node->num_links;
 
 			/*
+			 * ignore chain fragments
+			 */
+			if (n < seed->node->num_links && n <= 2)
+				continue;
+
+			/*
 			 * only process 'mesh' switches
 			 */
 			if (!s1->node->matrix)
@@ -908,7 +986,8 @@ static void make_geometry(lash_t *p_lash, int sw)
 					if (j == i)
 						continue;
 
-					if (s1->node->matrix[i][j] != 2) {
+					if (s1->node->matrix[i][j] != 2 &&
+						s1->node->matrix[i][j] <= 4) {
 						if (s1->node->axes[j]) {
 							if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) {
 								OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n");


From sean.hefty at intel.com  Thu Aug  6 15:40:21 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 6 Aug 2009 15:40:21 -0700
Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few
	CQ	entries
In-Reply-To: <1249596267.23162.956.camel@chromite.mv.qlogic.com>
References: <1249588090.23162.940.camel@chromite.mv.qlogic.com>	
	<6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com>	
	<1249595204.23162.951.camel@chromite.mv.qlogic.com>	
	<D30D936A450A4BB88B7539A3F0227F06@amr.corp.intel.com>
	<1249596267.23162.956.camel@chromite.mv.qlogic.com>
Message-ID: <8103A2A3D9FB46AC85A3520AECC8897B@amr.corp.intel.com>

>Remember that this fix only affects the bi-directional test.
>Both client and sever are going to post ctx->rx_depth receives
>and ctx->tx_depth sends and then check for completions.
>It won't post more sends or receives until the completions are
>seen.

Okay - I think I understand what's happening.

The maximum number of outstanding sends is limited to tx_depth / 2.  After
posting that many sends, the code waits for completions.  Once some sends
complete, additional sends may be posted, up to the iteration count.  There's
nothing that coordinates posting the sends with completing receives on the
remote side.  (This is what I was missing.)  Eventually, all posted receives
could be complete and generate CQ entries.  The send side is basically throttled
by RNR NACKs.

Now I don't understand the purpose behind doubling the rx_depth...

- Sean


From weiny2 at llnl.gov  Thu Aug  6 16:01:07 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 6 Aug 2009 16:01:07 -0700
Subject: [ofa-general] [PATCH] libibmad: make accessors function for retry
 values used in libibmad
Message-ID: <20090806160107.83193923.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 6 Aug 2009 15:27:30 -0700
Subject: [PATCH] libibmad: make accessors function for retry values used in libibmad

        In addition use this function to determine the retries used throughout the library.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |    1 +
 libibmad/src/libibmad.map         |    1 +
 libibmad/src/mad.c                |    5 +++++
 libibmad/src/mad_internal.h       |    1 +
 libibmad/src/rpc.c                |   12 +++---------
 5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index c5d73d5..0d0dcf1 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -804,6 +804,7 @@ MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout);
 MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls);
 
 MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms);
+MAD_EXPORT int mad_get_retries(struct ibmad_port *srcport);
 
 
 /* register.c */
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index a8605b5..b9a890c 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -71,6 +71,7 @@ IBMAD_1.3 {
 		mad_rpc_set_retries;
 		mad_rpc_set_timeout;
 		mad_get_timeout;
+		mad_get_retries;
 		madrpc;
 		madrpc_def_timeout;
 		madrpc_init;
diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c
index bc64a0f..7192dd6 100644
--- a/libibmad/src/mad.c
+++ b/libibmad/src/mad.c
@@ -70,6 +70,11 @@ int mad_get_timeout(struct ibmad_port *srcport, int override_ms)
 	    srcport->timeout ? srcport->timeout : madrpc_timeout);
 }
 
+int mad_get_retries(struct ibmad_port *srcport)
+{
+	return (srcport->retries ? srcport->retries : madrpc_retries);
+}
+
 void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data)
 {
 	int is_resp = rpc->method & IB_MAD_RESPONSE;
diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
index 7a16a46..475adfc 100644
--- a/libibmad/src/mad_internal.h
+++ b/libibmad/src/mad_internal.h
@@ -44,5 +44,6 @@ struct ibmad_port {
 
 extern struct ibmad_port *ibmp;
 extern int madrpc_timeout;
+extern int madrpc_retries;
 
 #endif /* _MAD_INTERNAL_H_ */
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index bb83114..b5e4441 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -53,9 +53,9 @@ struct ibmad_port *ibmp = &mad_port;
 
 static int iberrs;
 
+int madrpc_retries = MAD_DEF_RETRIES;
 int madrpc_timeout = MAD_DEF_TIMEOUT_MS;
 
-static int madrpc_retries = MAD_DEF_RETRIES;
 static void *save_mad;
 static int save_mad_len = 256;
 
@@ -211,7 +211,6 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
 {
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
-	int retries;
 	int redirect = 1;
 
 	while (redirect) {
@@ -221,12 +220,10 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
 		if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
 			return NULL;
 
-		retries = port->retries ? port->retries : madrpc_retries;
-
 		if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 				      port->class_agents[rpc->mgtclass],
 				      len, mad_get_timeout(port, rpc->timeout),
-				      retries)) < 0) {
+				      mad_get_retries(port))) < 0) {
 			IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 			return NULL;
 		}
@@ -267,7 +264,6 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
 {
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
-	int retries;
 
 	memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);
 
@@ -276,12 +272,10 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
 		return NULL;
 
-	retries = port->retries ? port->retries : madrpc_retries;
-
 	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 			      port->class_agents[rpc->mgtclass],
 			      len, mad_get_timeout(port, rpc->timeout),
-			      retries)) < 0) {
+			      mad_get_retries(port))) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return NULL;
 	}
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug  6 16:01:06 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 6 Aug 2009 16:01:06 -0700
Subject: [ofa-general] [PATCH] libibmad: make accessors function for timeout
 values used in libibmad
Message-ID: <20090806160106.4725041e.weiny2@llnl.gov>

Sasha,

In using mad_send_via and mad_receive_via I have found getting the timeout and retry values from the mad layer to be beneficial.

This and the patch that follows export functions to get those values as well as standardize the use of them internally.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 27 Jul 2009 13:48:17 -0700
Subject: [PATCH] libibmad: make accessors function for timeout values used in libibmad

	In addition use this function to determine the timeout to be used throughout the library.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |    3 +++
 libibmad/src/libibmad.map         |    1 +
 libibmad/src/mad.c                |    8 ++++++++
 libibmad/src/mad_internal.h       |    1 +
 libibmad/src/rpc.c                |   17 ++++++++---------
 libibmad/src/serv.c               |    8 +++++---
 6 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index ee004a9..c5d73d5 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -803,6 +803,9 @@ MAD_EXPORT void mad_rpc_set_retries(struct ibmad_port *port, int retries);
 MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout);
 MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls);
 
+MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms);
+
+
 /* register.c */
 MAD_EXPORT int mad_register_port_client(int port_id, int mgmt,
 					uint8_t rmpp_version);
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 1462064..a8605b5 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -70,6 +70,7 @@ IBMAD_1.3 {
 		mad_rpc_class_agent;
 		mad_rpc_set_retries;
 		mad_rpc_set_timeout;
+		mad_get_timeout;
 		madrpc;
 		madrpc_def_timeout;
 		madrpc_init;
diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c
index 8defabd..bc64a0f 100644
--- a/libibmad/src/mad.c
+++ b/libibmad/src/mad.c
@@ -44,6 +44,8 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
+#include "mad_internal.h"
+
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
 
@@ -62,6 +64,12 @@ uint64_t mad_trid(void)
 	return next;
 }
 
+int mad_get_timeout(struct ibmad_port *srcport, int override_ms)
+{
+	return (override_ms ? override_ms :
+	    srcport->timeout ? srcport->timeout : madrpc_timeout);
+}
+
 void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data)
 {
 	int is_resp = rpc->method & IB_MAD_RESPONSE;
diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
index 24418cc..7a16a46 100644
--- a/libibmad/src/mad_internal.h
+++ b/libibmad/src/mad_internal.h
@@ -43,5 +43,6 @@ struct ibmad_port {
 };
 
 extern struct ibmad_port *ibmp;
+extern int madrpc_timeout;
 
 #endif /* _MAD_INTERNAL_H_ */
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index c6fd392..bb83114 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -53,8 +53,9 @@ struct ibmad_port *ibmp = &mad_port;
 
 static int iberrs;
 
+int madrpc_timeout = MAD_DEF_TIMEOUT_MS;
+
 static int madrpc_retries = MAD_DEF_RETRIES;
-static int madrpc_timeout = MAD_DEF_TIMEOUT_MS;
 static void *save_mad;
 static int save_mad_len = 256;
 
@@ -210,7 +211,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
 {
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
-	int timeout, retries;
+	int retries;
 	int redirect = 1;
 
 	while (redirect) {
@@ -220,13 +221,12 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
 		if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
 			return NULL;
 
-		timeout = rpc->timeout ? rpc->timeout :
-			port->timeout ? port->timeout : madrpc_timeout;
 		retries = port->retries ? port->retries : madrpc_retries;
 
 		if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 				      port->class_agents[rpc->mgtclass],
-				      len, timeout, retries)) < 0) {
+				      len, mad_get_timeout(port, rpc->timeout),
+				      retries)) < 0) {
 			IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 			return NULL;
 		}
@@ -267,7 +267,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
 {
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
-	int timeout, retries;
+	int retries;
 
 	memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);
 
@@ -276,13 +276,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
 		return NULL;
 
-	timeout = rpc->timeout ? rpc->timeout :
-	    port->timeout ? port->timeout : madrpc_timeout;
 	retries = port->retries ? port->retries : madrpc_retries;
 
 	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 			      port->class_agents[rpc->mgtclass],
-			      len, timeout, retries)) < 0) {
+			      len, mad_get_timeout(port, rpc->timeout),
+			      retries)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return NULL;
 	}
diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c
index c9a093a..fad1e5b 100644
--- a/libibmad/src/serv.c
+++ b/libibmad/src/serv.c
@@ -73,7 +73,8 @@ int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
 	}
 
 	if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass],
-		      umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) {
+		      umad, IB_MAD_SIZE, mad_get_timeout(srcport, rpc->timeout),
+			0) < 0) {
 		IBWARN("send failed; %m");
 		return -1;
 	}
@@ -155,7 +156,7 @@ int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus,
 
 	if (umad_send
 	    (srcport->port_id, srcport->class_agents[rpc.mgtclass], umad,
-	     IB_MAD_SIZE, rpc.timeout, 0) < 0) {
+	     IB_MAD_SIZE, mad_get_timeout(srcport, rpc.timeout), 0) < 0) {
 		DEBUG("send failed; %m");
 		return -1;
 	}
@@ -174,7 +175,8 @@ void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport)
 	int agent;
 	int length = IB_MAD_SIZE;
 
-	if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) {
+	if ((agent = umad_recv(srcport->port_id, mad, &length,
+			mad_get_timeout(srcport, timeout))) < 0) {
 		if (!umad)
 			umad_free(mad);
 		DEBUG("recv failed: %m");
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug  6 18:37:16 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 6 Aug 2009 18:37:16 -0700
Subject: [ofa-general] [PATCH] opensm/complib: account for nsec overflow in
	timeout values
Message-ID: <20090806183716.c08bbea3.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 6 Aug 2009 18:31:46 -0700
Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 opensm/complib/cl_event.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c
index d14b2f4..4bc8d37 100644
--- a/opensm/complib/cl_event.c
+++ b/opensm/complib/cl_event.c
@@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event,
 	} else {
 		/* Get the current time */
 		if (gettimeofday(&curtime, NULL) == 0) {
-			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000);
-			timeout.tv_nsec =
-			    (curtime.tv_usec + (wait_us % 1000000)) * 1000;
+			uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000))
+						* 1000;
+			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000)
+						+ (n_sec % 1000000000);
+			timeout.tv_nsec = n_sec % 1000000000;
 
 			wait_ret = pthread_cond_timedwait(&p_event->condvar,
 							  &p_event->mutex,
-- 
1.5.4.5


From eli at dev.mellanox.co.il  Thu Aug  6 20:26:29 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Fri, 7 Aug 2009 06:26:29 +0300
Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device
	personality from node type to port type
In-Reply-To: <69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com>
References: <20090805082808.GB5599@mtls03>
	<73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com>
	<20090806172035.GB32301@mtls03>
	<69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com>
Message-ID: <20090807032629.GA20589@mtls03>

On Thu, Aug 06, 2009 at 10:34:19AM -0700, Sean Hefty wrote:
> 
> Does the implementation allow this?  Right now PDs, CQs, etc are allocated per
> device, not per port.  I'm not immediately concerned about QP failover.
> However, I believe there needs to be some level of coordination between the
> Infiniband side of the CM and the Ethernet side of the CM, since QPs are
> associated with CA GUIDs.  I'm just trying to understand the impact of this
> coordination.
> 

There is nothing in the implementation to prevent it. We did not see a
reason to. The ports share a common node GUID but each one has its own
GIDs.


From eli at dev.mellanox.co.il  Thu Aug  6 20:29:01 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Fri, 7 Aug 2009 06:29:01 +0300
Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for
	RDMAoE ports
In-Reply-To: <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com>
References: <20090805082910.GE5599@mtls03>
	<376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com>
Message-ID: <20090807032901.GB20589@mtls03>

On Thu, Aug 06, 2009 at 11:05:47AM -0700, Sean Hefty wrote:
> 
> Is there a need to expose QP1 to user space?  The CM is in the kernel, and
> there's not an SA.
> 

Good point. There seems to be no reason to expose it. Will fix.


From eli at dev.mellanox.co.il  Thu Aug  6 20:36:05 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Fri, 7 Aug 2009 06:36:05 +0300
Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1
In-Reply-To: <9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com>
References: <20090805082854.GD5599@mtls03>
	<9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com>
Message-ID: <20090807033605.GC20589@mtls03>

On Thu, Aug 06, 2009 at 10:52:34AM -0700, Sean Hefty wrote:
> >+	/* Validate device and port */
> >+	port_priv = ib_get_mad_port(device, port_num);
> >+	if (!port_priv) {
> >+		ret = ERR_PTR(-ENODEV);
> >+		goto error1;
> >+	}
> >+
> >+	if (!port_priv->qp_info[qp_type].qp)
> >+		return NULL;
> 
> It seems odd that the first if has 'goto error1', but the second if simply
> returns NULL. 
> 

The original intention was to release the caller from the need to
decide whether to register the mad agent or not and so the NULL
returned would not be treated as error. Thinking it over I realize
that it would be better to let the caller decide (according to the
port protocol) whether or not to register the mad agent. Will fix.


> >@@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent)
> > 	struct ib_mad_agent_private *mad_agent_priv;
> > 	struct ib_mad_snoop_private *mad_snoop_priv;
> >
> >+	if (!mad_agent)
> >+		return 0;
> 
> Why would a kernel client call ib_unregister_mad_agent with a NULL pointer?
> 

Same as above. Goes away after the fix.
> >
> > 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
> >+	has_smi = rdma_port_get_transport(device, port_num) ==
> RDMA_TRANSPORT_IB;
> >+	if (has_smi)
> >+		cq_size *= 2;
> 
> cq_size is doubled twice
> 

This is a bug - I'll fix it  - thanks.

> I really wish there were a cleaner way to add this support that didn't involve
> adding so many checks throughout the code.  It's hard to know if checks were
> added in all the places that were needed.  I can't think of a clever way to
> handle QP 0.

The fix discussed above will eliminate a good portion of these checks.


From bart.vanassche at gmail.com  Fri Aug  7 00:26:33 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Fri, 7 Aug 2009 09:26:33 +0200
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <adafxc4g1e7.fsf@cisco.com>
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	<adaocqsg339.fsf@cisco.com>
	<e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>
	<adafxc4g1e7.fsf@cisco.com>
Message-ID: <e2e108260908070026s10658adl2c4a9a5b3eba1a08@mail.gmail.com>

On Thu, Aug 6, 2009 at 9:58 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > Are you sure that this indicates a shortcoming in the kobject
>  > debugging code ? The most recent messages related to the message "does
>  > not have a release() function, it is broken and must be fixed" I could
>  > find on the LKML date from July 16, 2009
>  > (http://lkml.org/lkml/2009/7/16/306 and
>  > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH
>  > acknowledges that if this message is logged that this indicates a
>  > problem that should be fixed.
>
> I'm not sure -- I just assume that the core module unloading code is
> working OK, since it is so heavily tested.  If there were really a "must
> be fixed" problem with module unloading then someone would surely have
> hit more than a warning message.

(added Greg KH and the LKML in CC)

I tried to look up more information about kobjects. The comment of
commit 7a6a41615bfb2f03ce797bc24104c50b42c935e5 suggests that in the
past the function kobject_cleanup() did not free the memory allocated
for static kobject names but that this was the responsibility of the
release() function. This should have been fixed in the current version
of kobject_cleanup(). So I'm wondering whether the message that
kobjects that do not have a release() function are broken still makes
sense ?

See also
* http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a6a41615bfb2f03ce797bc24104c50b42c935e5.
* http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.30.y.git;a=blob;f=lib/kobject.c

Bart.


From bart.vanassche at gmail.com  Fri Aug  7 01:31:18 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Fri, 7 Aug 2009 10:31:18 +0200
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
In-Reply-To: <ada1vnohmc0.fsf@cisco.com>
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
	<4A7A949B.60408@panasas.com> <ada1vnohmc0.fsf@cisco.com>
Message-ID: <e2e108260908070131r49dd2d37s8bb36c9365d991e8@mail.gmail.com>

On Thu, Aug 6, 2009 at 7:41 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > Specifically scmnd->host_scribble can just be Zero.
>
> I see at last, thanks!
>
> The issue is that SRP is using host_scribble to hold an index, and index
> 0 is valid for us.
>
> I guess the fix is a bit complex, but basically we should use
> host_scribble to point to the request, and if we don't find a request in
> reset_device we should allocate one.

A fix like the one below ?

--- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c	2009-08-03
12:13:11.000000000 +0200
+++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c	2009-08-07
10:23:27.000000000 +0200
@@ -1371,16 +1371,27 @@ out:
 	return -1;
 }

+/**
+ * Look up the struct srp_request that has been associated with the specified
+ * SCSI command by srp_queuecommand().
+ *
+ * Returns 0 upon success and -1 upon failure.
+ */
 static int srp_find_req(struct srp_target_port *target,
 			struct scsi_cmnd *scmnd,
 			struct srp_request **req)
 {
-	if (scmnd->host_scribble == (void *) -1L)
-		return -1;
+	/*
+	 * The code below will only work if SRP_RQ_SIZE is a power of two,
+	 * so check this first.
+	 */
+	BUILD_BUG_ON((SRP_RQ_SIZE ^ (SRP_RQ_SIZE - 1))
+		     != (SRP_RQ_SIZE | (SRP_RQ_SIZE - 1)));

-	*req = &target->req_ring[(long) scmnd->host_scribble];
+	*req = &target->req_ring[(long)scmnd->host_scribble
+				 & (SRP_RQ_SIZE - 1)];

-	return 0;
+	return (*req)->scmnd == scmnd ? 0 : -1;
 }

 static int srp_abort(struct scsi_cmnd *scmnd)
@@ -1423,8 +1434,15 @@ static int srp_reset_device(struct scsi_

 	if (target->qp_in_error)
 		return FAILED;
-	if (srp_find_req(target, scmnd, &req))
-		return FAILED;
+	if (srp_find_req(target, scmnd, &req)) {
+		/*
+		 * scmnd has not yet been queued -- queue it now. This can
+		 * happen e.g. when a SG_SCSI_RESET ioctl has been issued.
+		 */
+		if (srp_queuecommand(scmnd, scmnd->scsi_done)
+		    || srp_find_req(target, scmnd, &req))
+			return FAILED;
+	}
 	if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET))
 		return FAILED;
 	if (req->tsk_status)


From bart.vanassche at gmail.com  Fri Aug  7 02:58:11 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Fri, 7 Aug 2009 11:58:11 +0200
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion 
	dependency detected
In-Reply-To: <adaws5gg71x.fsf@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
Message-ID: <e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>

On Thu, Aug 6, 2009 at 7:56 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > After having applied this patch it took somewhat longer before a
>  > locking inversion report was generated, but unfortunately there still
>  > was a locking inversion report generated (see also
>  > http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details):
>
> ummm, yikes...
>
> can you apply the hack patch I sent originally to take priv->lock from
> an interrupt ASAP and try that along with the fix patch to drop
> priv->lock before calling ipoib_send()?  That might make the lockdep
> trace understandable.

The lockdep report I obtained this morning with a 2.6.30.4 kernel and
the two patches applied has been attached to the kernel bugzilla
entry. This lockdep report was generated while testing the SRPT target
software. I have double checked that the SRPT target implementation
does not hold any spinlocks or mutexes while calling functions in the
IB core. This means that the SRPT target code cannot have caused any
of the reported lock cycles.

By the way, I noticed that while many subsystems in the Linux kernel
use event queues to report information to higher software layers, that
the IB core makes extensive use of callback functions. The combination
of nested locking and callback functions can easily lead to lock
inversion. This effect is well known in the operating system world --
see e.g. the talk by John Ousterhout about multithreaded versus
event-driven software (http://home.pacbell.net/ouster/threads.pdf,
1996).

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.30.4-scst-debug #2
---------------------------------------------------------
[ ... ]
stack backtrace:
Pid: 26040, comm: cc1 Not tainted 2.6.30.4-scst-debug #2
Call Trace:
 <IRQ>  [<ffffffff80272bec>] print_irq_inversion_bug+0x14c/0x1c0
 [<ffffffff80272cdd>] check_usage_forwards+0x7d/0xc0
 [<ffffffff80271faf>] mark_lock+0x20f/0x6a0
 [<ffffffff80272c60>] ? check_usage_forwards+0x0/0xc0
 [<ffffffff802743e4>] __lock_acquire+0xce4/0x1c80
 [<ffffffff802713bd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff80249305>] ? release_console_sem+0x1e5/0x230
 [<ffffffff80249919>] ? vprintk+0x2e9/0x480
 [<ffffffff80275488>] lock_acquire+0x108/0x150
 [<ffffffffa043f5a2>] ? ib_cm_notify+0x102/0x2c0 [ib_cm]
 [<ffffffff80515371>] _spin_lock_irqsave+0x41/0x60
 [<ffffffffa043f5a2>] ? ib_cm_notify+0x102/0x2c0 [ib_cm]
 [<ffffffffa043f5a2>] ib_cm_notify+0x102/0x2c0 [ib_cm]
 [<ffffffffa06a6e1e>] srpt_qp_event+0x4e/0x140 [ib_srpt]
 [<ffffffffa02656aa>] mlx4_ib_qp_event+0x7a/0xf0 [mlx4_ib]
 [<ffffffffa04c5e0f>] mlx4_qp_event+0x6f/0xe0 [mlx4_core]
 [<ffffffffa04bd659>] mlx4_eq_int+0x289/0x2e0 [mlx4_core]
 [<ffffffffa04bd79a>] mlx4_msi_x_interrupt+0x6a/0x90 [mlx4_core]
 [<ffffffff8028bf35>] handle_IRQ_event+0x95/0x200
 [<ffffffff8028e3d8>] handle_edge_irq+0xc8/0x170
 [<ffffffff8020eeef>] handle_irq+0x1f/0x30
 [<ffffffff8020e5fe>] do_IRQ+0x6e/0xf0
 [<ffffffff8020c913>] ret_from_intr+0x0/0xf
 <EOI> <6>

Bart.


From vlad at lists.openfabrics.org  Fri Aug  7 03:07:20 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri,  7 Aug 2009 03:07:20 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090807-0200 daily build status
Message-ID: <20090807100720.50569E61CFA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hnrose at comcast.net  Fri Aug  7 04:08:11 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 7 Aug 2009 07:08:11 -0400
Subject: [ofa-general] [PATCHv3] opensm: Parallelize (Stripe) LFT sets across
	switches
Message-ID: <20090807110811.GA23431@comcast.net>


Currently, MADs are pipelined to a single switch at a time which
effectively serializes these requests due to processing at the SMA.
This patch pipelines (stripes) them across the switches first before
proceeding with successive blocks. As a result of this striping,
multiple switches can process the set and respond concurrently
which results in an improvement to the subnet initialization time.

All unicast routing protocols are updated for this.

A similar subsequent change will do this for MFTs.

Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:

With a small cluster of 17 IS4 switches and 11 HCAs and
to artificially increase the cluster, LMC of 7 was used
including EnhancedSwitchPort 0 LMC.

With the new code, LFT configuration is more than twice as
fast as with the old code :)
Current ucast manager ran on avarage for ~250msec, with the
new code - 110-120msec.

Routing calculation phase of the ucast manager took ~1200 usec,
the rest was sending the blocks and waiting for no more pending
transactions.

Here are some detailed results of different executions (the
number on the left is timer value in usec):

Current ucast manager (w/o the optimization):

000000 [LFT]: osm_ucast_mgr_process() - START
001131 [LFT]: ucast_mgr_process_tbl() - START
032251 [LFT]: ucast_mgr_process_tbl() - END
032263 [LFT]: osm_ucast_mgr_process() - END
253416 [LFT]: Done wait_for_pending_transactions()

New algorithm:

001417 [LFT]: osm_ucast_mgr_process() - START
002690 [LFT]: ucast_mgr_process_tbl() - START
032946 [LFT]: ucast_mgr_process_tbl() - END
032948 [LFT]: osm_ucast_pipeline_tbl() - START
033846 [LFT]: osm_ucast_pipeline_tbl() - END
033858 [LFT]: osm_ucast_mgr_process() - END
108203 [LFT]: Done wait_for_pending_transactions()

With IS3 based Qlogic switches, which do not handle DR packets forwarding
in HW, with a fabric of ~1100 HCAs, ~280 switches:

Current OSM configures LFTs in ~2 seconds.
New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up).

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v2:
Eliminated max_smps_per_node
Moved LFTs pushing up to ucast_mgr_route level from the individual routing engines

Changes since v1:
Added Yevgeny's performance data
No change to actual patch

diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
index a040476..4ef045c 100644
--- a/opensm/include/opensm/osm_ucast_mgr.h
+++ b/opensm/include/opensm/osm_ucast_mgr.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -242,16 +242,12 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN struct osm_sm * sm);
 *
 * SYNOPSIS
 */
-int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
-				IN osm_switch_t * const p_sw);
+void osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr);
 /*
 * PARAMETERS
 *	p_mgr
 *		[in] Pointer to an osm_ucast_mgr_t object.
 *
-*	p_mgr
-*		[in] Pointer to an osm_switch_t object.
-*
 * SEE ALSO
 *	Unicast Manager
 *********/
diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
index 216b496..30a3c1d 100644
--- a/opensm/opensm/osm_ucast_cache.c
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2008      Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -1085,9 +1085,10 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr)
 			memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
 		}
 
-		osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
 	}
 
+	osm_ucast_mgr_set_fwd_table(p_mgr);
+
 	return 0;
 }
 
diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c
index 2505c46..5b73ca5 100644
--- a/opensm/opensm/osm_ucast_file.c
+++ b/opensm/opensm/osm_ucast_file.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2008      Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -167,9 +167,6 @@ static int do_ucast_file_load(void *context)
 				"skipping parsing. Using default "
 				"routing algorithm\n");
 		} else if (!strncmp(p, "Unicast lids", 12)) {
-			if (p_sw)
-				osm_ucast_mgr_set_fwd_table(&p_osm->sm.
-							    ucast_mgr, p_sw);
 			q = strstr(p, " guid 0x");
 			if (!q) {
 				OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
@@ -220,7 +217,7 @@ static int do_ucast_file_load(void *context)
 				return -1;
 			}
 			p = q;
-			/* additionally try to exract guid */
+			/* additionally try to extract guid */
 			q = strstr(p, " portguid 0x");
 			if (!q) {
 				OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE,
@@ -246,9 +243,6 @@ static int do_ucast_file_load(void *context)
 		}
 	}
 
-	if (p_sw)
-		osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
-
 	fclose(file);
 	return 0;
 }
diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index bde6dbd..6ec6bc7 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -2,7 +2,7 @@
  * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -1905,8 +1905,6 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
 	ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
 
 	p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
-	osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
-				    p_sw->p_osm_sw);
 }
 
 /***************************************************
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index b3107f0..0a567b3 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
@@ -990,7 +990,6 @@ static void populate_fwd_tbls(lash_t * p_lash)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
 	osm_subn_t *p_subn = &p_lash->p_osm->subn;
-	osm_opensm_t *p_osm = p_lash->p_osm;
 	osm_switch_t *p_sw, *p_next_sw, *p_dst_sw;
 	osm_port_t *port;
 	uint16_t max_lid_ho, lid;
@@ -1054,7 +1053,6 @@ static void populate_fwd_tbls(lash_t * p_lash)
 					physical_egress_port);
 			}
 		}		/* for */
-		osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
 	}
 	OSM_LOG_EXIT(p_log);
 }
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 78a7031..e28752a 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -315,16 +315,13 @@ Exit:
 
 /**********************************************************************
  **********************************************************************/
-int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
-				IN osm_switch_t * p_sw)
+static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw)
 {
 	osm_node_t *p_node;
 	osm_dr_path_t *p_path;
 	osm_madw_context_t context;
 	ib_api_status_t status;
 	ib_switch_info_t si;
-	uint16_t block_id_ho = 0;
-	uint8_t block[IB_SMP_DATA_SIZE];
 	boolean_t set_swinfo_require = FALSE;
 	uint16_t lin_top;
 	uint8_t life_state;
@@ -382,48 +379,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
 				ib_get_err_str(status));
 	}
 
-	/*
-	   Send linear forwarding table blocks to the switch
-	   as long as the switch indicates it has blocks needing
-	   configuration.
-	 */
-
-	context.lft_context.node_guid = osm_node_get_node_guid(p_node);
-	context.lft_context.set_method = TRUE;
-
-	if (!p_sw->new_lft) {
-		/* any routing should provide the new_lft */
-		CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
-			  p_mgr->cache_valid && !p_sw->need_update);
-		goto Exit;
-	}
-
-	for (block_id_ho = 0;
-	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
-	     block_id_ho++) {
-		if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
-		    !memcmp(block,
-			    p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
-			    IB_SMP_DATA_SIZE))
-			continue;
-
-		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
-			"Writing FT block %u\n", block_id_ho);
-
-		status = osm_req_set(p_mgr->sm, p_path,
-				     p_sw->new_lft +
-				     block_id_ho * IB_SMP_DATA_SIZE,
-				     sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL,
-				     cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
-				     &context);
-
-		if (status != IB_SUCCESS)
-			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
-				"Sending linear fwd. tbl. block failed (%s)\n",
-				ib_get_err_str(status));
-	}
-
-Exit:
 	OSM_LOG_EXIT(p_mgr->p_log);
 	return 0;
 }
@@ -508,7 +463,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
 		}
 	}
 
-	osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
+	set_fwd_tbl_top(p_mgr, p_sw);
 
 	if (p_mgr->p_subn->opt.lmc)
 		free_ports_priv(p_mgr);
@@ -516,6 +471,101 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
 	OSM_LOG_EXIT(p_mgr->p_log);
 }
 
+static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item,
+				  IN void *context)
+{
+	osm_ucast_mgr_t *p_mgr = context;
+	osm_switch_t *const p_sw = (osm_switch_t *) p_map_item;
+
+	set_fwd_tbl_top(p_mgr, p_sw);
+}
+
+static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm,
+				    IN uint8_t * p_block,
+				    IN osm_dr_path_t * p_path,
+				    IN uint16_t block_id_ho,
+				    IN osm_madw_context_t * p_context)
+{
+	ib_api_status_t status;
+	boolean_t sts;
+
+	OSM_LOG_ENTER(p_sm->p_log);
+
+	for (;
+	     (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block));
+	     block_id_ho++) {
+		if (!p_sw->need_update && !p_sm->p_subn->need_update &&
+		    !memcmp(p_block,
+			    p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
+			    IB_SMP_DATA_SIZE))
+			continue;
+
+		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+			"Writing FT block %u to switch 0x%" PRIx64 "\n",
+			block_id_ho,
+			cl_ntoh64(p_context->lft_context.node_guid));
+
+		status = osm_req_set(p_sm, p_path,
+				     p_sw->new_lft +
+				     block_id_ho * IB_SMP_DATA_SIZE,
+				     IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
+				     cl_hton32(block_id_ho),
+				     CL_DISP_MSGID_NONE, p_context);
+
+		if (status != IB_SUCCESS)
+			OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: "
+				"Sending linear fwd. tbl. block failed (%s)\n",
+				ib_get_err_str(status));
+		break;
+	}
+
+	OSM_LOG_EXIT(p_sm->p_log);
+	return sts;
+}
+
+static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw,
+					 IN osm_ucast_mgr_t *p_mgr,
+					 IN uint16_t block_id_ho)
+{
+	osm_dr_path_t *p_path;
+	osm_madw_context_t context;
+	uint8_t block[IB_SMP_DATA_SIZE];
+	boolean_t status;
+
+	OSM_LOG_ENTER(p_mgr->p_log);
+
+	CL_ASSERT(p_sw && p_sw->p_node);
+
+	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
+		"Processing switch 0x%" PRIx64 "\n",
+		cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
+
+	/*
+	   Send linear forwarding table blocks to the switch
+	   as long as the switch indicates it has blocks needing
+	   configuration.
+	 */
+	if (!p_sw->new_lft) {
+		/* any routing should provide the new_lft */
+		CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
+			  p_mgr->cache_valid && !p_sw->need_update);
+		status = FALSE;
+		goto Exit;
+	}
+
+	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
+
+	context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node);
+	context.lft_context.set_method = TRUE;
+
+	status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path,
+				    block_id_ho, &context);
+
+Exit:
+	OSM_LOG_EXIT(p_mgr->p_log);
+	return status;
+}
+
 /**********************************************************************
  **********************************************************************/
 static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
@@ -731,7 +781,6 @@ static int ucast_mgr_setup_all_switches(osm_subn_t * p_subn)
 
 /**********************************************************************
  **********************************************************************/
-
 static int add_guid_to_order_list(void *ctx, uint64_t guid, char *p)
 {
 	osm_ucast_mgr_t *m = ctx;
@@ -870,6 +919,30 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m)
 		add_sw_endports_to_order_list(s[i], m);
 }
 
+static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr)
+{
+	cl_qmap_t *p_sw_tbl;
+	osm_switch_t *p_sw;
+	uint16_t block_id_ho = 0;
+	int sws_notdone;
+	boolean_t sts;
+
+	p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
+	while (1) {
+		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
+		sws_notdone = 0;		
+		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
+			sts = pipeline_next_lft_block(p_sw, p_mgr, block_id_ho);
+			if (sts)
+				sws_notdone++;
+			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
+		}
+		if (!sws_notdone)
+			break;
+		block_id_ho++;
+	}
+}
+
 static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
 {
 	cl_qlist_init(&p_mgr->port_order_list);
@@ -904,6 +977,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
 	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl,
 			   p_mgr);
 
+	ucast_mgr_pipeline_fwd_tbl(p_mgr);
+
 	cl_qlist_remove_all(&p_mgr->port_order_list);
 
 	return 0;
@@ -911,6 +986,16 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
 
 /**********************************************************************
  **********************************************************************/
+void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr)
+{
+	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
+			   ucast_mgr_process_top, p_mgr);
+
+	ucast_mgr_pipeline_fwd_tbl(p_mgr);
+}
+
+/**********************************************************************
+ **********************************************************************/
 static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm)
 {
 	int ret;
@@ -940,6 +1025,9 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm)
 
 	osm->routing_engine_used = osm_routing_engine_type(r->name);
 
+	if (r->ucast_build_fwd_tables)
+		osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
+
 	return 0;
 }
 

From dzieko at wcss.pl  Fri Aug  7 04:25:26 2009
From: dzieko at wcss.pl (Pawel Dziekonski)
Date: Fri, 7 Aug 2009 13:25:26 +0200
Subject: [ofa-general] ib0: multicast join failed
Message-ID: <20090807112526.GD21691@cefeid.wcss.wroc.pl>

Hi,

today I got the following:

ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11

and connection to Lustre was lost.

I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
address.
There is plenty of free mem so this is not the oom-killer case.
There are no other noticable problems with this host.

Is it a hardware problem with IB iface?
regards, P

# ibv_devinfo
hca_id: mthca0
        fw_ver:                 1.2.0
        node_guid:                      0030:487e:0c06:0000
        sys_image_guid:                 0030:487e:0c06:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25204
        hw_ver:                 0xA0
        board_id:               SM_0000000003
        phys_port_cnt:                  1
                port:   1
                  state:                        PORT_INIT (2)
                  max_mtu:              2048 (4)
                  active_mtu:           2048 (4)
                  sm_lid:                       1
                  port_lid:             448
                  port_lmc:             0x00


# ofed_info 
OFED-1.3.1
libibverbs:
git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3
commit 40b771aa6a9c0ad092b2e20775b4723d3b173792
libmthca:
git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3
commit 9501e698d257949acfab2edc90812602966dbcc9
libmlx4:
git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3
commit 3869d6dab7e12fe452270ca641f7dd7082b42482
libehca:
git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3
commit fd898180cfa3b737f893f432a80b91bac3396325
libipathverbs:
git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3
commit 82be4d81859d1fd2edf830220fe65a9923b80a46
libcxgb3:
git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3
commit 6f7485feb244d8571fcab2292ef92c97bea48df0
libnes:
git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3
commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3
libibcm:
git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3
commit 53ec35f544bbc1838bbadc2210909c25a954a5e2
librdmacm:
git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3
commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1
dapl1:
git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3
commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779
dapl2:
git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3
commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177
libsdp:
git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3
commit c8102dccc502930442b23de658674d386456b350
sdpnetstat:
git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3
commit 3341620a7259c4f7bdd4180864b98e260c3dc223
srptools:
git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d
perftest:
git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3
commit 6321b5468f7293088cc003809049c02b176130d8
qlvnictools:
git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3
commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd
tvflash:
git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3
commit f5e7407a7f2058448df5e5320d9843f944427429
mstflint:
git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3
commit 78bbd3d521a9078553a991111ffb6f76665b9ee9
qperf:
git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3
commit 6221aabd038df0b7033e035378ca190641ed2295
management:
git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3
commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf
ibutils:
git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3
commit 7daf94fab6eaf307316326f3f49704e6080a1508
ibsim:
git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3
commit 55113d9f919709c7c97ea41d29991941b9c8be70

ofa_kernel-1.3.1:
Git:
git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c

# MPI
mvapich-1.0.1-2533.src.rpm
mvapich2-1.0.3-1.src.rpm
openmpi-1.2.6-1.src.rpm
mpitests-3.0-773.src.rpm


-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From yosefe at voltaire.com  Fri Aug  7 05:04:25 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Fri, 07 Aug 2009 15:04:25 +0300
Subject: [ofa-general] ib0: multicast join failed
In-Reply-To: <20090807112526.GD21691@cefeid.wcss.wroc.pl>
References: <20090807112526.GD21691@cefeid.wcss.wroc.pl>
Message-ID: <4A7C1849.6030500@voltaire.com>

On 07/08/09 14:25, Pawel Dziekonski wrote:
> Hi,
> 
> today I got the following:
> 
> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> 
> and connection to Lustre was lost.
> 
> I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
> address.
> There is plenty of free mem so this is not the oom-killer case.
> There are no other noticable problems with this host.
> 
> Is it a hardware problem with IB iface?
> regards, P
> 


Is your SM alive?


From dzieko at wcss.pl  Fri Aug  7 05:12:51 2009
From: dzieko at wcss.pl (Pawel Dziekonski)
Date: Fri, 7 Aug 2009 14:12:51 +0200
Subject: [ofa-general] ib0: multicast join failed
In-Reply-To: <4A7C1849.6030500@voltaire.com>
References: <20090807112526.GD21691@cefeid.wcss.wroc.pl>
	<4A7C1849.6030500@voltaire.com>
Message-ID: <20090807121251.GE21691@cefeid.wcss.wroc.pl>

On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote:
> On 07/08/09 14:25, Pawel Dziekonski wrote:
> > Hi,
> > 
> > today I got the following:
> > 
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > 
> > and connection to Lustre was lost.
> > 
> > I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
> > address.
> > There is plenty of free mem so this is not the oom-killer case.
> > There are no other noticable problems with this host.
> > 
> > Is it a hardware problem with IB iface?
> 
> Is your SM alive?

Well, this is a good question.

My SM is on Voltaire ISR2012 switch. Today I lost contact with its web
interface - I don't know why. CLI works fine. Net itself works too. So
I assume that SM works.

L:ISR2012-0004(utilities)# sminfo -m -e
[1249654331:530476][32061] => _do_madrpc: timeout after 3 retries, 600 ms
sm_lid:..........................1
sm_guid:.........................0x8f10500000007
sm_key:..........................0x0
sm_activity:.....................574044927
sm_priority:.....................14
sm_state:........................SMINFO_MASTER
nodeip:..........................
nodename:........................
node_guid:.......................0x8f10500000007
devid:...........................0x5a37
vendor:..........................0x8f1
node_desc:.......................ISR2012 Voltaire sFB-2012
node_type:.......................Switch
localport:.......................0

L:ISR2012-0004(utilities)# port-verify -b
[1249653810:282614][26657] => _do_madrpc: timeout after 3 retries, 600 ms
[1249653810:283115][26657] => madrpc: failed class 129 method 1 attr 17 DR Path: 0,18,24,13
[1249653810:283585][26657] => discover: Nodeinfo on 0,18,24,13 port 13 failed, skipping port
#
# Topology file: generated on Fri Aug  7 14:03:34 2009
#
Printing Chassis 1 (chassis guid 0x0008f10500000004)

devid=0x5a38
switchguids=0x8f104003f680a Chassis ISR2012 1 Line  9  Chip 1
Switch  24 "S-0008f104003f680a"         # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 192
[13][ext 13] "S-0008f10400413b08"[11] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

devid=0x5a30
switchguids=0x8f104004136c0
Switch  24 "S-0008f104004136c0"         # "ISR9024D Voltaire" smalid 209
[22] "S-0008f104003f680a"[22] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

devid=0x5a30
switchguids=0x8f104004136b0
Switch  24 "S-0008f104004136b0"         # "ISR9024D Voltaire" smalid 204
[13] "S-000b8cffff002cc7"[12] width 4X speed 5.0 Gbs
errs.sym:........................752 <- Alert !!!
[24] "S-0008f104003f680a"[19] width 4X speed 5.0 Gbs
errs.sym:........................1 <- Alert !!!
errs.rcv:........................1 <- Alert !!!

devid=0x5a30
switchguids=0x8f10400413b08
Switch  24 "S-0008f10400413b08"         # "ISR9024D Voltaire" smalid 224
[13]     Alert -> Could not access this port Remote Peer.

devid=0xb924
switchguids=0xb8cffff002cc7
Switch  24 "S-000b8cffff002cc7"         # "MT47396 Infiniscale-III Mellanox Technologies" smalid 246
[12] "S-0008f104004136b0"[13] width 4X speed 5.0 Gbs
errs.sym:........................4 <- Alert !!!

devid=0x6732
hcaguids=0x2c90300031878
Hca     2 "H-0002c90300031878"          # "oss1 HCA-1"
[1] "S-0008f104004136c0"[9]     # lid 169 lmc 0 width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

SUMMARY: ALARM [found - 5 bad_nodes and 6 bad_ports].

(VL15Dropped errors masked out)

-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From yosefe at voltaire.com  Fri Aug  7 06:34:20 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Fri, 07 Aug 2009 16:34:20 +0300
Subject: [ofa-general] ib0: multicast join failed
In-Reply-To: <20090807121251.GE21691@cefeid.wcss.wroc.pl>
References: <20090807112526.GD21691@cefeid.wcss.wroc.pl>	<4A7C1849.6030500@voltaire.com>
	<20090807121251.GE21691@cefeid.wcss.wroc.pl>
Message-ID: <4A7C2D5C.1030005@voltaire.com>

On 07/08/09 15:12, Pawel Dziekonski wrote:
> On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote:
>> On 07/08/09 14:25, Pawel Dziekonski wrote:
>>> Hi,
>>>
>>> today I got the following:
>>>
>>> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
>>> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
>>>
>>> and connection to Lustre was lost.
>>>
>>> I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
>>> address.
>>> There is plenty of free mem so this is not the oom-killer case.
>>> There are no other noticable problems with this host.
>>>
>>> Is it a hardware problem with IB iface?
>> Is your SM alive?
> 
> Well, this is a good question.
> 
> My SM is on Voltaire ISR2012 switch. Today I lost contact with its web
> interface - I don't know why. CLI works fine. Net itself works too. So
> I assume that SM works.
> 

I guess there is some physical problem in the fabric (cables?) because the host
cannot reach the SM - the port is in INIT state.


From hnrose at comcast.net  Fri Aug  7 06:43:05 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 7 Aug 2009 09:43:05 -0400
Subject: [ofa-general] [PATCH] opensm/osm_mcast_tbl.c:
	osm_mcast_tbl_get_block returns boolean
Message-ID: <20090807134305.GA30766@comcast.net>


so use TRUE/FALSE rather than IB_INVALID_PARMETER

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
index 82850be..38c06c1 100644
--- a/opensm/opensm/osm_mcast_tbl.c
+++ b/opensm/opensm/osm_mcast_tbl.c
@@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
 	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
-		return (IB_INVALID_PARAMETER);
+		return (TRUE);
 
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
 		p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position];


From weiny2 at llnl.gov  Fri Aug  7 09:07:03 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 7 Aug 2009 09:07:03 -0700
Subject: [ofa-general] [PATCH] libibnetdisc: fix potential memory leak of
	port object
Message-ID: <20090807090703.2b857dea.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 7 Aug 2009 09:05:44 -0700
Subject: [PATCH] libibnetdisc: fix potential memory leak of port object

	NOTE: This moves the port allocation below the port array allocation
	failure rather than free the port allocation after port array
	allocation fails.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index b9e89d9..27ae9f3 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -446,14 +446,6 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd
 {
 	struct ibnd_port *port;
 
-	port = malloc(sizeof(*port));
-	if (!port)
-		return NULL;
-
-	memcpy(port, temp, sizeof(*port));
-	port->port.node = (ibnd_node_t *)node;
-	port->port.ext_portnum = 0;
-
 	if (node->node.ports == NULL) {
 		node->node.ports = calloc(sizeof(*node->node.ports), node->node.numports + 1);
 		if (!node->node.ports) {
@@ -462,6 +454,14 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd
 		}
 	}
 
+	port = malloc(sizeof(*port));
+	if (!port)
+		return NULL;
+
+	memcpy(port, temp, sizeof(*port));
+	port->port.node = (ibnd_node_t *)node;
+	port->port.ext_portnum = 0;
+
 	node->node.ports[temp->port.portnum] = (ibnd_port_t *)port;
 
 	add_to_portguid_hash(port, fabric->portstbl);
-- 
1.5.4.5


From hnrose at comcast.net  Fri Aug  7 09:41:27 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 7 Aug 2009 12:41:27 -0400
Subject: [ofa-general] [PATCH] opensm: Parallelize (Stripe) MFT sets across
	switches
Message-ID: <20090807164127.GA795@comcast.net>


Similar to previous patch to "Parallelize (Stripe) LFT sets across switches".
Currently, MADs are pipelined to a single switch first which effectively
serializes these requests. This patch pipelines the MFT set MADs across
switches first (before cycling to the next MFT block) so that multiple
switches can be responding concurrently. Speedup is dependent on number
of MFT blocks in use (number of MLIDs) which is dependent on the number
of multicast groups.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 7ce28c5..e281842 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -103,6 +103,8 @@ typedef struct osm_switch {
 	uint8_t *lft;
 	uint8_t *new_lft;
 	osm_mcast_tbl_t mcast_tbl;
+	uint32_t mft_block_num;
+	uint32_t mft_position;
 	unsigned endport_links;
 	unsigned need_update;
 	void *priv;
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 4dbbaa0..f91c6b6 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  *
@@ -325,15 +325,12 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
 {
 	osm_node_t *p_node;
 	osm_dr_path_t *p_path;
-	osm_madw_context_t mad_context;
+	osm_madw_context_t context;
 	ib_api_status_t status;
-	uint32_t block_id_ho = 0;
-	int16_t block_num = 0;
-	uint32_t position = 0;
-	uint32_t max_position;
+	uint32_t block_id_ho;
 	osm_mcast_tbl_t *p_tbl;
 	ib_net16_t block[IB_MCAST_BLOCK_SIZE];
-	int ret = 0;
+	int ret = -1;
 
 	CL_ASSERT(sm);
 
@@ -353,36 +350,34 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
 	   configuration.
 	 */
 
-	mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node);
-	mad_context.mft_context.set_method = TRUE;
+	context.mft_context.node_guid = osm_node_get_node_guid(p_node);
+	context.mft_context.set_method = TRUE;
 
 	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
-	max_position = p_tbl->max_position;
 
-	while (osm_mcast_tbl_get_block(p_tbl, block_num,
-				       (uint8_t) position, block)) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-			"Writing MFT block 0x%X\n", block_id_ho);
+	if (p_sw->mft_position <= p_tbl->max_position &&
+	    osm_mcast_tbl_get_block(p_tbl, p_sw->mft_block_num,
+				    (uint8_t) p_sw->mft_position, block)) {
+
+		block_id_ho = p_sw->mft_block_num + (p_sw->mft_position << 28);
 
-		block_id_ho = block_num + (position << 28);
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Writing MFT block %u position %u to switch 0x%" PRIx64 "\n",
+			p_sw->mft_block_num, p_sw->mft_position,
+			cl_ntoh64(context.lft_context.node_guid));
 
 		status = osm_req_set(sm, p_path, (void *)block, sizeof(block),
 				     IB_MAD_ATTR_MCAST_FWD_TBL,
 				     cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
-				     &mad_context);
+				     &context);
 
-		if (status != IB_SUCCESS) {
+		if (status != IB_SUCCESS)
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: "
-				"Sending multicast fwd. tbl. block failed (%s)\n",
+				"Sending MFT block failed (%s)\n",
 				ib_get_err_str(status));
-			ret = -1;
-		}
 
-		if (++position > max_position) {
-			position = 0;
-			block_num++;
-		}
-	}
+	} else
+		ret = 0;
 
 	OSM_LOG_EXIT(sm->p_log);
 	return ret;
@@ -1077,7 +1072,8 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
 	cl_qmap_t *p_sw_tbl;
 	cl_qlist_t *p_list = &sm->mgrp_list;
 	osm_mgrp_t *p_mgrp;
-	int i, ret = 0;
+	osm_mcast_tbl_t *p_tbl;
+	int sws_notdone, i, ret = 0;
 
 	OSM_LOG_ENTER(sm->p_log);
 
@@ -1114,11 +1110,30 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
 	 */
 	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
 	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
-		if (mcast_mgr_set_tbl(sm, p_sw))
-			ret = -1;
+		p_sw->mft_block_num = 0;
+		p_sw->mft_position = 0;
 		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
 	}
 
+	while (1) {
+		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
+		sws_notdone = 0;
+		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
+			if (mcast_mgr_set_tbl(sm, p_sw))
+				sws_notdone++;
+			p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
+			if (++p_sw->mft_position > p_tbl->max_position) {
+				p_sw->mft_position = 0;
+				p_sw->mft_block_num++;
+			}
+			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
+		}
+		if (!sws_notdone) {
+			ret = -1;
+			break;
+		}
+	}
+
 	while (!cl_is_qlist_empty(p_list)) {
 		cl_list_item_t *p = cl_qlist_remove_head(p_list);
 		free(p);
@@ -1142,9 +1157,10 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 	osm_switch_t *p_sw;
 	cl_qmap_t *p_sw_tbl;
 	osm_mgrp_t *p_mgrp;
+	osm_mcast_tbl_t *p_tbl;
 	ib_net16_t mlid;
 	osm_mcast_mgr_ctxt_t *ctx;
-	int ret = 0;
+	int sws_notdone, ret = 0;
 
 	OSM_LOG_ENTER(sm->p_log);
 
@@ -1195,11 +1211,30 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 	p_sw_tbl = &sm->p_subn->sw_guid_tbl;
 	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
 	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
-		if (mcast_mgr_set_tbl(sm, p_sw))
-			ret = -1;
+		p_sw->mft_block_num = 0;
+		p_sw->mft_position = 0;
 		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
 	}
 
+	while (1) {
+		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
+		sws_notdone = 0;
+		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
+			if (mcast_mgr_set_tbl(sm, p_sw))
+				sws_notdone++;
+			p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
+			if (++p_sw->mft_position > p_tbl->max_position) {
+				p_sw->mft_position = 0;
+				p_sw->mft_block_num++;
+			}
+			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
+		}
+		if (!sws_notdone) {
+			ret = -1;
+			break;
+		}
+	}
+
 	osm_dump_mcast_routes(sm->p_subn->p_osm);
 
 exit:


From rdreier at cisco.com  Fri Aug  7 11:22:10 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 07 Aug 2009 11:22:10 -0700
Subject: [ofa-general] Re: [PATCH] mlx4_core: map sufficient ICM memory for
	EQs
In-Reply-To: <20090730130434.GA21428@mtls03> (Eli Cohen's message of "Thu, 30
	Jul 2009 16:04:34 +0300")
References: <20090730130434.GA21428@mtls03>
Message-ID: <adaeirneb71.fsf@cisco.com>

Thanks, applied with a few cleanups:
  ilog2(roundup_pow_of_two())  ->  order_base_2()
  xxx * (1 << yy)  ->  xxx << yy


From hal.rosenstock at gmail.com  Fri Aug  7 11:38:26 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 7 Aug 2009 14:38:26 -0400
Subject: [ofa-general] osm_link_mgr.c:link_mgr_get_smsl question
Message-ID: <f0e08f230908071138m10e9a574g8d84623a99f89527@mail.gmail.com>

Hi Sasha,

osm_link_mgr.c:link_mgr_get_smsl has the following:

        /* Find osm_port of the source = p_physp */
        slid = osm_physp_get_base_lid(p_physp);
        p_src_port =
            cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, cl_ntoh16(slid));

        /* Call lash to find proper SL */
        sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port);

It may be that this code is invoked prior to the LID being assigned so
getting the p_src_port based on the LID yields NULL and then calling
osm_get_lash_sl causes a seg fault.

I can see two ways to fix this:
1. Replace with port GUID search
2. Have osm_get_lash_sl handle NULL for p_src_port
Maybe you see other ways to deal with this.

Do you have a preferred approach ?

-- Hal


From swise at opengridcomputing.com  Fri Aug  7 12:28:11 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 07 Aug 2009 14:28:11 -0500
Subject: [ofa-general] [PATCH v2 1/2] RDMA/cxgb3: Don't free the endpoint
	early.
Message-ID: <20090807192811.14821.11554.stgit@build.ogc.int>

- Keep ref on connection request endpoints until either accepted or
rejected so it doesn't get freed early.

- Endpoint flags now need to be set via atomic bitops because they can
be set on both the iw_cxgb3 workqueue thread and user disconnect threads.

- Don't move out of CLOSING too early due to multiple calls to
iwch_ep_disconnect.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |   52 ++++++++++++++++++---------------
 drivers/infiniband/hw/cxgb3/iwch_cm.h |    9 +++---
 2 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 52d7bb0..7f22f17 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -286,7 +286,7 @@ void __free_ep(struct kref *kref)
 	ep = container_of(container_of(kref, struct iwch_ep_common, kref),
 			  struct iwch_ep, com);
 	PDBG("%s ep %p state %s\n", __func__, ep, states[state_read(&ep->com)]);
-	if (ep->com.flags & RELEASE_RESOURCES) {
+	if (test_bit(RELEASE_RESOURCES, &ep->com.flags)) {
 		cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid);
 		dst_release(ep->dst);
 		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
@@ -297,7 +297,7 @@ void __free_ep(struct kref *kref)
 static void release_ep_resources(struct iwch_ep *ep)
 {
 	PDBG("%s ep %p tid %d\n", __func__, ep, ep->hwtid);
-	ep->com.flags |= RELEASE_RESOURCES;
+	set_bit(RELEASE_RESOURCES, &ep->com.flags);
 	put_ep(&ep->com);
 }
 
@@ -786,10 +786,12 @@ static void connect_request_upcall(struct iwch_ep *ep)
 	event.private_data_len = ep->plen;
 	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
 	event.provider_data = ep;
-	if (state_read(&ep->parent_ep->com) != DEAD)
+	if (state_read(&ep->parent_ep->com) != DEAD) {
+		get_ep(&ep->com);
 		ep->parent_ep->com.cm_id->event_handler(
 						ep->parent_ep->com.cm_id,
 						&event);
+	}
 	put_ep(&ep->parent_ep->com);
 	ep->parent_ep = NULL;
 }
@@ -1156,8 +1158,7 @@ static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 	 * We get 2 abort replies from the HW.  The first one must
 	 * be ignored except for scribbling that we need one more.
 	 */
-	if (!(ep->com.flags & ABORT_REQ_IN_PROGRESS)) {
-		ep->com.flags |= ABORT_REQ_IN_PROGRESS;
+	if (!test_and_set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) {
 		return CPL_RET_BUF_DONE;
 	}
 
@@ -1480,7 +1481,6 @@ static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		 * rejects the CR.
 		 */
 		__state_set(&ep->com, CLOSING);
-		get_ep(&ep->com);
 		break;
 	case MPA_REP_SENT:
 		__state_set(&ep->com, CLOSING);
@@ -1561,8 +1561,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 	 * We get 2 peer aborts from the HW.  The first one must
 	 * be ignored except for scribbling that we need one more.
 	 */
-	if (!(ep->com.flags & PEER_ABORT_IN_PROGRESS)) {
-		ep->com.flags |= PEER_ABORT_IN_PROGRESS;
+	if (!test_and_set_bit(PEER_ABORT_IN_PROGRESS, &ep->com.flags)) {
 		return CPL_RET_BUF_DONE;
 	}
 
@@ -1591,7 +1590,6 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		 * the reference on it until the ULP accepts or
 		 * rejects the CR.
 		 */
-		get_ep(&ep->com);
 		break;
 	case MORIBUND:
 	case CLOSING:
@@ -1797,6 +1795,7 @@ int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
 		err = send_mpa_reject(ep, pdata, pdata_len);
 		err = iwch_ep_disconnect(ep, 0, GFP_KERNEL);
 	}
+	put_ep(&ep->com);
 	return 0;
 }
 
@@ -1810,8 +1809,10 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
 
 	PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
-	if (state_read(&ep->com) == DEAD)
-		return -ECONNRESET;
+	if (state_read(&ep->com) == DEAD) {
+		err = -ECONNRESET;
+		goto err;
+	}
 
 	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
 	BUG_ON(!qp);
@@ -1819,7 +1820,8 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) ||
 	    (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) {
 		abort_connection(ep, NULL, GFP_KERNEL);
-		return -EINVAL;
+		err = -EINVAL;
+		goto err;
 	}
 
 	cm_id->add_ref(cm_id);
@@ -1836,8 +1838,6 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 
 	PDBG("%s %d ird %d ord %d\n", __func__, __LINE__, ep->ird, ep->ord);
 
-	get_ep(&ep->com);
-
 	/* bind QP to EP and move to RTS */
 	attrs.mpa_attr = ep->mpa_attr;
 	attrs.max_ird = ep->ird;
@@ -1855,30 +1855,31 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	err = iwch_modify_qp(ep->com.qp->rhp,
 			     ep->com.qp, mask, &attrs, 1);
 	if (err)
-		goto err;
+		goto err1;
 
 	/* if needed, wait for wr_ack */
 	if (iwch_rqes_posted(qp)) {
 		wait_event(ep->com.waitq, ep->com.rpl_done);
 		err = ep->com.rpl_err;
 		if (err)
-			goto err;
+			goto err1;
 	}
 
 	err = send_mpa_reply(ep, conn_param->private_data,
 			     conn_param->private_data_len);
 	if (err)
-		goto err;
+		goto err1;
 
 
 	state_set(&ep->com, FPDU_MODE);
 	established_upcall(ep);
 	put_ep(&ep->com);
 	return 0;
-err:
+err1:
 	ep->com.cm_id = NULL;
 	ep->com.qp = NULL;
 	cm_id->rem_ref(cm_id);
+err:
 	put_ep(&ep->com);
 	return err;
 }
@@ -2097,14 +2098,17 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
 			ep->com.state = CLOSING;
 			start_ep_timer(ep);
 		}
+		set_bit(CLOSE_SENT, &ep->com.flags);
 		break;
 	case CLOSING:
-		close = 1;
-		if (abrupt) {
-			stop_ep_timer(ep);
-			ep->com.state = ABORTING;
-		} else
-			ep->com.state = MORIBUND;
+		if (!test_and_set_bit(CLOSE_SENT, &ep->com.flags)) {
+			close = 1;
+			if (abrupt) {
+				stop_ep_timer(ep);
+				ep->com.state = ABORTING;
+			} else
+				ep->com.state = MORIBUND;
+		}
 		break;
 	case MORIBUND:
 	case ABORTING:
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
index 43c0aea..b9efadf 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.h
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -145,9 +145,10 @@ enum iwch_ep_state {
 };
 
 enum iwch_ep_flags {
-	PEER_ABORT_IN_PROGRESS	= (1 << 0),
-	ABORT_REQ_IN_PROGRESS	= (1 << 1),
-	RELEASE_RESOURCES	= (1 << 2),
+	PEER_ABORT_IN_PROGRESS	= 0,
+	ABORT_REQ_IN_PROGRESS	= 1,
+	RELEASE_RESOURCES	= 2,
+	CLOSE_SENT		= 3,
 };
 
 struct iwch_ep_common {
@@ -162,7 +163,7 @@ struct iwch_ep_common {
 	wait_queue_head_t waitq;
 	int rpl_done;
 	int rpl_err;
-	u32 flags;
+	unsigned long flags;
 };
 
 struct iwch_listen_ep {


From swise at opengridcomputing.com  Fri Aug  7 12:28:17 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 07 Aug 2009 14:28:17 -0500
Subject: [ofa-general] [PATCH v2 2/2] RDMA/cxgb3: wake up any waiters on peer
	close/abort.
In-Reply-To: <20090807192811.14821.11554.stgit@build.ogc.int>
References: <20090807192811.14821.11554.stgit@build.ogc.int>
Message-ID: <20090807192817.14821.70876.stgit@build.ogc.int>

A close/abort while waiting for a wr_ack during connection migration
can cause a hung process in iwch_accept_cr/iwch_reject_cr.

The fix is to set rpl_error/rpl_done and wake up the waiters when we
get a close/abort while in MPA_REQ_RCVD state.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 7f22f17..66b4135 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -1478,9 +1478,14 @@ static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		/*
 		 * We're gonna mark this puppy DEAD, but keep
 		 * the reference on it until the ULP accepts or
-		 * rejects the CR.
+		 * rejects the CR. Also wake up anyone waiting
+		 * in rdma connection migration (see iwch_accept_cr()).
 		 */
 		__state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
 		break;
 	case MPA_REP_SENT:
 		__state_set(&ep->com, CLOSING);
@@ -1588,8 +1593,13 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		/*
 		 * We're gonna mark this puppy DEAD, but keep
 		 * the reference on it until the ULP accepts or
-		 * rejects the CR.
+		 * rejects the CR. Also wake up anyone waiting
+		 * in rdma connection migration (see iwch_accept_cr()).
 		 */
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
 		break;
 	case MORIBUND:
 	case CLOSING:
@@ -1828,8 +1838,6 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	ep->com.cm_id = cm_id;
 	ep->com.qp = qp;
 
-	ep->com.rpl_done = 0;
-	ep->com.rpl_err = 0;
 	ep->ird = conn_param->ird;
 	ep->ord = conn_param->ord;
 

From rdreier at cisco.com  Fri Aug  7 13:58:40 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 07 Aug 2009 13:58:40 -0700
Subject: [ofa-general] [PATCH v2 2/2] RDMA/cxgb3: wake up any waiters on
	peer close/abort.
In-Reply-To: <20090807192817.14821.70876.stgit@build.ogc.int> (Steve Wise's
	message of "Fri, 07 Aug 2009 14:28:17 -0500")
References: <20090807192811.14821.11554.stgit@build.ogc.int>
	<20090807192817.14821.70876.stgit@build.ogc.int>
Message-ID: <ada3a83e3y7.fsf@cisco.com>

thanks for respinning, got em both.


From roel.kluin at gmail.com  Fri Aug  7 14:02:34 2009
From: roel.kluin at gmail.com (Roel Kluin)
Date: Fri, 07 Aug 2009 23:02:34 +0200
Subject: [ofa-general] [PATCH] IB/mthca: Read buffer overflow
Message-ID: <4A7C966A.3040005@gmail.com>

If the QP was found in MGM in the first iteration, and we break out of
the loop, i == 0 and we read and write mgm->qp[-1].

Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
---
Not entirely sure whether it can happen

diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c
index d4c8105..fd72665 100644
--- a/drivers/infiniband/hw/mthca/mthca_mcg.c
+++ b/drivers/infiniband/hw/mthca/mthca_mcg.c
@@ -272,8 +272,10 @@ int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 		goto out;
 	}
 
-	mgm->qp[loc]   = mgm->qp[i - 1];
-	mgm->qp[i - 1] = 0;
+	if (i != 0) {
+		mgm->qp[loc]   = mgm->qp[i - 1];
+		mgm->qp[i - 1] = 0;
+	}
 
 	err = mthca_WRITE_MGM(dev, index, mailbox, &status);
 	if (err)


From rdreier at cisco.com  Fri Aug  7 14:08:51 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 07 Aug 2009 14:08:51 -0700
Subject: [ofa-general] Re: [PATCH] IB/mthca: Read buffer overflow
In-Reply-To: <4A7C966A.3040005@gmail.com> (Roel Kluin's message of "Fri, 07
	Aug 2009 23:02:34 +0200")
References: <4A7C966A.3040005@gmail.com>
Message-ID: <aday6pvcows.fsf@cisco.com>


 > If the QP was found in MGM in the first iteration, and we break out of
 > the loop, i == 0 and we read and write mgm->qp[-1].
 > 
 > Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
 > ---
 > Not entirely sure whether it can happen

I don't think it can happen.  The loop and following code is:

	for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) {
		if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31)))
			loc = i;
		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
			break;
	}

	if (loc == -1) {
		mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num);
		err = -EINVAL;
		goto out;
	}

	mgm->qp[loc]   = mgm->qp[i - 1];

and you're worried that i == 0 at that last bit.  For i == 0 there, we
need to break out of the loop on the first iteration, ie hit

		if (!(mgm->qp[i] & cpu_to_be32(1 << 31)))
			break;

with i == 0, meaning (mgm->qp[0] & cpu_to_be32(1 << 31) == 0.

But to get past the loc == -1 test that returns from the function, we
must also hit the loc = i assignment on the first iteration of the loop,
so we must have (mgm->qp[0] == cpu_to_be32(ibqp->qp_num | (1 << 31)) be
true, which would mean in particular that mgm->qp[0] would have to have
that high order bit set.  Which contradicts the conclusion we just
reached.

So the bad case of accessing index -1 can never happen just from the
structure of the code.

 - R.


From rdreier at cisco.com  Fri Aug  7 14:14:03 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 07 Aug 2009 14:14:03 -0700
Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in
	the SRP initiator
In-Reply-To: <e2e108260908070131r49dd2d37s8bb36c9365d991e8@mail.gmail.com>
	(Bart Van Assche's message of "Fri, 7 Aug 2009 10:31:18 +0200")
References: <e2e108260908060039x7718577yf932d8a9188fe0cb@mail.gmail.com>
	<4A7A949B.60408@panasas.com> <ada1vnohmc0.fsf@cisco.com>
	<e2e108260908070131r49dd2d37s8bb36c9365d991e8@mail.gmail.com>
Message-ID: <adatz0jcoo4.fsf@cisco.com>


 > A fix like the one below ?

I think this gets us part of the way, but not quite.

 > --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c	2009-08-03
 > 12:13:11.000000000 +0200
 > +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c	2009-08-07
 > 10:23:27.000000000 +0200
 > @@ -1371,16 +1371,27 @@ out:
 >  	return -1;
 >  }
 > 
 > +/**
 > + * Look up the struct srp_request that has been associated with the specified
 > + * SCSI command by srp_queuecommand().
 > + *
 > + * Returns 0 upon success and -1 upon failure.
 > + */
 >  static int srp_find_req(struct srp_target_port *target,
 >  			struct scsi_cmnd *scmnd,
 >  			struct srp_request **req)
 >  {
 > -	if (scmnd->host_scribble == (void *) -1L)
 > -		return -1;
 > +	/*
 > +	 * The code below will only work if SRP_RQ_SIZE is a power of two,
 > +	 * so check this first.
 > +	 */
 > +	BUILD_BUG_ON((SRP_RQ_SIZE ^ (SRP_RQ_SIZE - 1))
 > +		     != (SRP_RQ_SIZE | (SRP_RQ_SIZE - 1)));

could this be BUILD_BUG_ON(!is_power_of_2(SRP_RQ_SIZE)) ?
 > 
 > -	*req = &target->req_ring[(long) scmnd->host_scribble];
 > +	*req = &target->req_ring[(long)scmnd->host_scribble
 > +				 & (SRP_RQ_SIZE - 1)];
 > 
 > -	return 0;
 > +	return (*req)->scmnd == scmnd ? 0 : -1;
 >  }
 > 
 >  static int srp_abort(struct scsi_cmnd *scmnd)
 > @@ -1423,8 +1434,15 @@ static int srp_reset_device(struct scsi_
 > 
 >  	if (target->qp_in_error)
 >  		return FAILED;
 > -	if (srp_find_req(target, scmnd, &req))
 > -		return FAILED;
 > +	if (srp_find_req(target, scmnd, &req)) {
 > +		/*
 > +		 * scmnd has not yet been queued -- queue it now. This can
 > +		 * happen e.g. when a SG_SCSI_RESET ioctl has been issued.
 > +		 */
 > +		if (srp_queuecommand(scmnd, scmnd->scsi_done)
 > +		    || srp_find_req(target, scmnd, &req))
 > +			return FAILED;

I don't think we can just pass the command to srp_queuecommand() here.
For one thing queuecommand requires some locking, and second, we don't
actually want to queue the command -- in fact I'm not sure it is set up
properly with an opcode etc to execute the command.

What I think needs to happen is we need to allocate a request for the
command the same way srp_queuecommand() does, and in fact maybe that
code could be factored out to avoid duplication.

 -R .


From mdidomenico4 at gmail.com  Fri Aug  7 14:37:55 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 7 Aug 2009 17:37:55 -0400
Subject: [ofa-general] sun x4100 with IB
Message-ID: <e75d22a90908071437t7b195d43h318f1cd06ac0ffb@mail.gmail.com>

I have several Sun x4100 with Infiniband servers which appear to be
running at 400MB/sec instead of 800MB/sec.  It's a freshly reformatted
cluster converting from solaris to linux.  We also reset the bios
settings with "load optimal defaults". Does anyone know which bios
setting I changed to dump the BW?

x4100
mellanox ib
ofed-1.4.1-rc6 w/ openmpi


From robertacummins at gmail.com  Fri Aug  7 15:37:51 2009
From: robertacummins at gmail.com (Robert Cummins)
Date: Fri, 07 Aug 2009 16:37:51 -0600
Subject: [ofa-general] sun x4100 with IB
In-Reply-To: <e75d22a90908071437t7b195d43h318f1cd06ac0ffb@mail.gmail.com>
References: <e75d22a90908071437t7b195d43h318f1cd06ac0ffb@mail.gmail.com>
Message-ID: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com>

Can you send the output from lspci -vvv?   What card are you using?  Is
it an Infinihost III SDR card?  What does ibdiagnet -lw 4x -ls 5 return?

On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote:
> I have several Sun x4100 with Infiniband servers which appear to be
> running at 400MB/sec instead of 800MB/sec.  It's a freshly reformatted
> cluster converting from solaris to linux.  We also reset the bios
> settings with "load optimal defaults". Does anyone know which bios
> setting I changed to dump the BW?
> 
> x4100
> mellanox ib
> ofed-1.4.1-rc6 w/ openmpi
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From gregkh at suse.de  Fri Aug  7 20:48:17 2009
From: gregkh at suse.de (Greg KH)
Date: Fri, 7 Aug 2009 20:48:17 -0700
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <e2e108260908070026s10658adl2c4a9a5b3eba1a08@mail.gmail.com>
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	<adaocqsg339.fsf@cisco.com>
	<e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>
	<adafxc4g1e7.fsf@cisco.com>
	<e2e108260908070026s10658adl2c4a9a5b3eba1a08@mail.gmail.com>
Message-ID: <20090808034817.GA30697@suse.de>

On Fri, Aug 07, 2009 at 09:26:33AM +0200, Bart Van Assche wrote:
> On Thu, Aug 6, 2009 at 9:58 PM, Roland Dreier<rdreier at cisco.com> wrote:
> >
> >  > Are you sure that this indicates a shortcoming in the kobject
> >  > debugging code ? The most recent messages related to the message "does
> >  > not have a release() function, it is broken and must be fixed" I could
> >  > find on the LKML date from July 16, 2009
> >  > (http://lkml.org/lkml/2009/7/16/306 and
> >  > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH
> >  > acknowledges that if this message is logged that this indicates a
> >  > problem that should be fixed.
> >
> > I'm not sure -- I just assume that the core module unloading code is
> > working OK, since it is so heavily tested.  If there were really a "must
> > be fixed" problem with module unloading then someone would surely have
> > hit more than a warning message.
> 
> (added Greg KH and the LKML in CC)
> 
> I tried to look up more information about kobjects. The comment of
> commit 7a6a41615bfb2f03ce797bc24104c50b42c935e5 suggests that in the
> past the function kobject_cleanup() did not free the memory allocated
> for static kobject names but that this was the responsibility of the
> release() function. This should have been fixed in the current version
> of kobject_cleanup(). So I'm wondering whether the message that
> kobjects that do not have a release() function are broken still makes
> sense ?

No, it still makes sense :)

thanks,

greg k-h


From vlad at lists.openfabrics.org  Sat Aug  8 03:00:39 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  8 Aug 2009 03:00:39 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090808-0200 daily build status
Message-ID: <20090808100039.D9115E28264@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From mdidomenico4 at gmail.com  Sat Aug  8 07:22:20 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Sat, 8 Aug 2009 10:22:20 -0400
Subject: [ofa-general] sun x4100 with IB
In-Reply-To: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com>
References: <e75d22a90908071437t7b195d43h318f1cd06ac0ffb@mail.gmail.com>
	<1249684671.13945.68.camel@rockymtn.cumminsconsultants.com>
Message-ID: <e75d22a90908080722y184fe037wa6c8d8fbf80470e7@mail.gmail.com>

Yes, its an infinihost III, i believe its MT23208, but dont quote me
on that, i'm not at the machine currently

Is there something specific you want to see in lspci -vvv, i can't
easily cut and paste from the machine

On Fri, Aug 7, 2009 at 6:37 PM, Robert Cummins<robertacummins at gmail.com> wrote:
> Can you send the output from lspci -vvv?   What card are you using?  Is
> it an Infinihost III SDR card?  What does ibdiagnet -lw 4x -ls 5 return?
>
> On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote:
>> I have several Sun x4100 with Infiniband servers which appear to be
>> running at 400MB/sec instead of 800MB/sec.  It's a freshly reformatted
>> cluster converting from solaris to linux.  We also reset the bios
>> settings with "load optimal defaults". Does anyone know which bios
>> setting I changed to dump the BW?
>>
>> x4100
>> mellanox ib
>> ofed-1.4.1-rc6 w/ openmpi
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From bart.vanassche at gmail.com  Sat Aug  8 10:49:22 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sat, 8 Aug 2009 19:49:22 +0200
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has
	not allocated
Message-ID: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>

Hello,

Has anyone ever encountered a message like the one below ? This message was
generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and
before any out-of-tree kernel modules were loaded.

------------[ cut here ]------------
WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0()
Hardware name: P5Q DELUXE
mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it
has not allocated [device address=0x0000000139482000] [size=4096 bytes]
Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel
snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801
rtc_core hid_belkin mlx4_core(
+) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev
serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx
xor raid0 sd_mod crc_t10dif
ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic
ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal
processor thermal_sys hwmon
Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6
Call Trace:
 [<ffffffff8039bc7c>] ? check_sync+0x47c/0x4b0
 [<ffffffff80248b48>] warn_slowpath_common+0x78/0xd0
 [<ffffffff80248bfc>] warn_slowpath_fmt+0x3c/0x40
 [<ffffffff80517769>] ? _spin_lock_irqsave+0x49/0x60
 [<ffffffff8039b8ab>] ? check_sync+0xab/0x4b0
 [<ffffffff8039bc7c>] check_sync+0x47c/0x4b0
 [<ffffffff802724ac>] ? mark_held_locks+0x6c/0x90
 [<ffffffff8039be1d>] debug_dma_sync_single_for_cpu+0x1d/0x20
 [<ffffffffa024a969>] mlx4_write_mtt+0x159/0x1e0 [mlx4_core]
 [<ffffffffa0243c02>] mlx4_create_eq+0x222/0x650 [mlx4_core]
 [<ffffffff8027281d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa02441f5>] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core]
 [<ffffffffa0248b08>] mlx4_setup_hca+0x98/0x550 [mlx4_core]
 [<ffffffffa0249891>] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core]
 [<ffffffffa0249331>] __mlx4_init_one+0x371/0x920 [mlx4_core]
 [<ffffffffa024df18>] mlx4_init_one+0x22/0x44 [mlx4_core]
 [<ffffffff8025cd90>] ? do_work_for_cpu+0x0/0x30
 [<ffffffff803a43e2>] local_pci_probe+0x12/0x20
 [<ffffffff8025cda3>] do_work_for_cpu+0x13/0x30
 [<ffffffff802613e6>] kthread+0x56/0x90
 [<ffffffff8020cffa>] child_rip+0xa/0x20
 [<ffffffff8020c9c0>] ? restore_args+0x0/0x30
 [<ffffffff80261390>] ? kthread+0x0/0x90
 [<ffffffff8020cff0>] ? child_rip+0x0/0x20
---[ end trace 4480af29bc755c6a ]---

Bart.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090808/786f5d5d/attachment.html>

From vlad at lists.openfabrics.org  Sun Aug  9 03:01:46 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  9 Aug 2009 03:01:46 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090809-0200 daily build status
Message-ID: <20090809100146.50D46E2814C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From vlad at dev.mellanox.co.il  Sun Aug  9 08:49:49 2009
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 09 Aug 2009 18:49:49 +0300
Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.21 release
In-Reply-To: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com>
References: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com>
Message-ID: <4A7EF01D.8090907@dev.mellanox.co.il>

Arlin Davis wrote:
>
> Vlad, please pull new v2 package into OFED 1.5 and install the following:
>
> NOTE: the reorder... v2 first and then v1
>  
> dapl-2.0.21-1 
> dapl-utils-2.0.21-1 
> dapl-devel-2.0.21-1 
> dapl-debuginfo-2.0.21-1 
> compat-dapl-1.2.14-1 
> compat-dapl-devel-1.2.14-1 
>
> See http://www.openfabrics.org/downloads/dapl/ more details.
>
> -arlin
Done,

Regards,
Vladimir


From worleys at gmail.com  Sun Aug  9 10:09:15 2009
From: worleys at gmail.com (Chris Worley)
Date: Sun, 9 Aug 2009 11:09:15 -0600
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually
	hangs
Message-ID: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>

I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off
the CD.. never updated) and it's embedded IB stack (not the latest
OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info").

I'm running a W2008S (fully patched) initiator w/
MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.

Using Mellanox QDR cards/switch.

Writes over SRP, as measured from the initiator using IOMeter, get
proper performance (i.e. 1.2GB/s).

Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s).
And while reading, IOMeter eventually hangs the system (Windows
becomes unresponsive to GUI interaction).  In this state, I see iostat
reporting transfers at the same low read rate from the target... so
there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it
acts like it's a "skipping record" (sorry of you young folks don't
know what that is... but I can't think of another way to describe it)
and never moving on to the next benchmark, just endlessly repeating
the same I/O over and over again.  If I unload then reload the mlx4_ib
driver on the target, then the Windows system quickly returns, but
IOMeter remains hung and needs killed.

So, I have a lot of experimentation to do on the target in 1)
upgrading the target or changing the distro altogether and 2) using
OFED instead of built-in IB stack on the target to try to see if I can
budge this issue.

But, I was wondering if somebody might have a hint on this _or_ have a
known target distro/kernel setup that works reliably w/ Windows-based
SRP initiators.

Thanks,

Chris


From landman at scalableinformatics.com  Sun Aug  9 10:26:29 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Sun, 09 Aug 2009 13:26:29 -0400
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually
	hangs
In-Reply-To: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
Message-ID: <4A7F06C5.7030203@scalableinformatics.com>

Chris Worley wrote:
> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off
> the CD.. never updated) and it's embedded IB stack (not the latest
> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info").
> 
> I'm running a W2008S (fully patched) initiator w/
> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.
> 
> Using Mellanox QDR cards/switch.
> 
> Writes over SRP, as measured from the initiator using IOMeter, get
> proper performance (i.e. 1.2GB/s).
> 
> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s).

Chris:

   What is the backing store capable of?  That is, if you are doing, say 
dd's streaming from disk, what rate do you see?  Or are you doing this 
with a RAMDISK to check protocol performance?

   The dd's should look something like this on the RHEL machine

	write:
	
		dd if=/dev/zero of=/path/to/target bs=1M count=32k

(make sure the product of count * bs is greater than 2x system ram)

	read:

		dd if=/path/to/target of=/dev/null bs=1M count=32k

If you are not getting 1.6 GB/s out of the file system locally, you 
won't get it out of the target over the network.  The backing store is 
usually one of the slower aspects.

For our units, this is what we are seeing:

dd if=/dev/zero of=/data/big.file ...
10240+0 records in
10240+0 records out
171798691840 bytes (172 GB) copied, 94.8258 seconds, 1.8 GB/s

[root at jr5 ~]# dd if=/data/big.file of=/dev/null bs=16M
10240+0 records in
10240+0 records out
171798691840 bytes (172 GB) copied, 76.6224 seconds,  2.2 GB/s

So our writes and reads through SCST should be less than 1.8 and 2.2 
GB/s respectively.

> And while reading, IOMeter eventually hangs the system (Windows
> becomes unresponsive to GUI interaction).  In this state, I see iostat

Hmmm....  We had IOMeter running continuously over a 10GbE link to a 
SCST-based target at SC09.  The backing store could provide ~700 MB/s, 
and we saw 500 MB/s for ~4 days  running during the day (running 
benchmarks continuously all day long).

> reporting transfers at the same low read rate from the target... so
> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it
> acts like it's a "skipping record" (sorry of you young folks don't
> know what that is... but I can't think of another way to describe it)
> and never moving on to the next benchmark, just endlessly repeating
> the same I/O over and over again.  If I unload then reload the mlx4_ib
> driver on the target, then the Windows system quickly returns, but
> IOMeter remains hung and needs killed.
> 
> So, I have a lot of experimentation to do on the target in 1)
> upgrading the target or changing the distro altogether and 2) using
> OFED instead of built-in IB stack on the target to try to see if I can
> budge this issue.
> 
> But, I was wondering if somebody might have a hint on this _or_ have a
> known target distro/kernel setup that works reliably w/ Windows-based
> SRP initiators.

SCST works (the versions we have used, 1.0.0, 1.0.1, ...) reliably with 
Windows initiator for XP, XP64, 2003, and 2008.  Look in the windows 
error log, and see if you are getting driver timeouts.  See if you have 
an updated driver.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From worleys at gmail.com  Sun Aug  9 11:19:23 2009
From: worleys at gmail.com (Chris Worley)
Date: Sun, 9 Aug 2009 12:19:23 -0600
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and 
	eventually hangs
In-Reply-To: <4A7F06C5.7030203@scalableinformatics.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
	<4A7F06C5.7030203@scalableinformatics.com>
Message-ID: <f3177b9e0908091119v3bab3068o196aaed403c1defa@mail.gmail.com>

On Sun, Aug 9, 2009 at 11:26 AM, Joe
Landman<landman at scalableinformatics.com> wrote:
> Chris Worley wrote:
>>
>> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off
>> the CD.. never updated) and it's embedded IB stack (not the latest
>> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info").
>>
>> I'm running a W2008S (fully patched) initiator w/
>> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.
>>
>> Using Mellanox QDR cards/switch.
>>
>> Writes over SRP, as measured from the initiator using IOMeter, get
>> proper performance (i.e. 1.2GB/s).
>>
>> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s).
>
> Chris:
>
>  What is the backing store capable of?  That is, if you are doing, say dd's
> streaming from disk, what rate do you see?  Or are you doing this with a
> RAMDISK to check protocol performance?

I tested my local performance before testing SRP.

These are ioDrives.  I'm running two, so the local performance is
1.6GB/s for reads.  I've run up to four ioDrives through one QDR IB
link w/ Linux host and initiator, and get 2.7GB/s to the initiator.
This was using an upgraded distro on the target, and I'm testing this
on someone elses machine and don't have permission to upgrade it yet.
This could also be the rev of WinOF.

>
>  The dd's should look something like this on the RHEL machine
>
>        write:
>
>                dd if=/dev/zero of=/path/to/target bs=1M count=32k
>
> (make sure the product of count * bs is greater than 2x system ram)
>
>        read:
>
>                dd if=/path/to/target of=/dev/null bs=1M count=32k
>
> If you are not getting 1.6 GB/s out of the file system locally, you won't
> get it out of the target over the network.

1.6GB/s out of two ioDrives is no problem locally.

>  The backing store is usually one
> of the slower aspects.
>
> For our units, this is what we are seeing:
>
> dd if=/dev/zero of=/data/big.file ...
> 10240+0 records in
> 10240+0 records out
> 171798691840 bytes (172 GB) copied, 94.8258 seconds, 1.8 GB/s
>
> [root at jr5 ~]# dd if=/data/big.file of=/dev/null bs=16M
> 10240+0 records in
> 10240+0 records out
> 171798691840 bytes (172 GB) copied, 76.6224 seconds,  2.2 GB/s
>
> So our writes and reads through SCST should be less than 1.8 and 2.2 GB/s
> respectively.
>
>> And while reading, IOMeter eventually hangs the system (Windows
>> becomes unresponsive to GUI interaction).  In this state, I see iostat
>
> Hmmm....  We had IOMeter running continuously over a 10GbE link to a
> SCST-based target at SC09.  The backing store could provide ~700 MB/s, and
> we saw 500 MB/s for ~4 days  running during the day (running benchmarks
> continuously all day long).
>
>> reporting transfers at the same low read rate from the target... so
>> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it
>> acts like it's a "skipping record" (sorry of you young folks don't
>> know what that is... but I can't think of another way to describe it)
>> and never moving on to the next benchmark, just endlessly repeating
>> the same I/O over and over again.  If I unload then reload the mlx4_ib
>> driver on the target, then the Windows system quickly returns, but
>> IOMeter remains hung and needs killed.
>>
>> So, I have a lot of experimentation to do on the target in 1)
>> upgrading the target or changing the distro altogether and 2) using
>> OFED instead of built-in IB stack on the target to try to see if I can
>> budge this issue.
>>
>> But, I was wondering if somebody might have a hint on this _or_ have a
>> known target distro/kernel setup that works reliably w/ Windows-based
>> SRP initiators.
>
> SCST works (the versions we have used, 1.0.0, 1.0.1, ...) reliably with
> Windows initiator for XP, XP64, 2003, and 2008.  Look in the windows error
> log, and see if you are getting driver timeouts.  See if you have an updated
> driver.

I'm worried more about the underlying IB stack and kernel on the
target side.  It would be best to know exactly which distro, kernel,
and OFED revisions (unless you're using the distro's built-in IB
stack) you're using on the target.  The WinOF version you're using on
the Windows side would be helpful info too. Can you relay these?

Thanks,

Chris
>
> Regards,
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/jackrabbit
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>


From landman at scalableinformatics.com  Sun Aug  9 12:06:03 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Sun, 09 Aug 2009 15:06:03 -0400
Subject: ***SPAM*** Re: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads
	and eventually hangs
In-Reply-To: <f3177b9e0908091119v3bab3068o196aaed403c1defa@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>	
	<4A7F06C5.7030203@scalableinformatics.com>
	<f3177b9e0908091119v3bab3068o196aaed403c1defa@mail.gmail.com>
Message-ID: <4A7F1E1B.5010300@scalableinformatics.com>

Chris Worley wrote:

> I'm worried more about the underlying IB stack and kernel on the
> target side.  It would be best to know exactly which distro, kernel,

We see good performance with Centos/RedHat and Ubuntu.  We build our own 
kernel (due to performance/stability issues we see under load with 
distro kernels).  Ours is a 2.6.28.7.  We are testing some 2.6.30.x 
kernels now as well.

OFED 1.4 now, we used 1.3.x last year.

> and OFED revisions (unless you're using the distro's built-in IB
> stack) you're using on the target.  The WinOF version you're using on
> the Windows side would be helpful info too. Can you relay these?

WinOF version ... whatever was default installed on the units.

> 
> Thanks,
> 
> Chris
>> Regards,
>>
>> Joe
>>
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics, Inc.
>> email: landman at scalableinformatics.com
>> web  : http://scalableinformatics.com
>>       http://scalableinformatics.com/jackrabbit
>> phone: +1 734 786 8423 x121
>> fax  : +1 866 888 3112
>> cell : +1 734 612 4615
>>


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From marcin.slusarz at gmail.com  Sun Aug  9 12:54:05 2009
From: marcin.slusarz at gmail.com (Marcin Slusarz)
Date: Sun,  9 Aug 2009 21:54:05 +0200
Subject: [ofa-general] [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
Message-ID: <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>

Signed-off-by: Marcin Slusarz <marcin.slusarz at gmail.com>
Cc: Roland Dreier <rolandd at cisco.com>
Cc: Sean Hefty <sean.hefty at intel.com>
Cc: Hal Rosenstock <hal.rosenstock at gmail.com>
Cc: general at lists.openfabrics.org
---
 drivers/infiniband/hw/cxgb3/iwch.c |    4 +---
 drivers/infiniband/hw/mlx4/main.c  |    6 +-----
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
index 26fc0a4..9cc99df 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -105,11 +105,9 @@ static void rnic_init(struct iwch_dev *rnicp)
 static void open_rnic_dev(struct t3cdev *tdev)
 {
 	struct iwch_dev *rnicp;
-	static int vers_printed;
 
 	PDBG("%s t3cdev %p\n", __func__,  tdev);
-	if (!vers_printed++)
-		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+	printk_once(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
 		       DRV_VERSION);
 	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
 	if (!rnicp) {
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..0b2f77a 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = {
 
 static void *mlx4_ib_add(struct mlx4_dev *dev)
 {
-	static int mlx4_ib_version_printed;
 	struct mlx4_ib_dev *ibdev;
 	int num_ports = 0;
 	int i;
 
-	if (!mlx4_ib_version_printed) {
-		printk(KERN_INFO "%s", mlx4_ib_version);
-		++mlx4_ib_version_printed;
-	}
+	printk_once(KERN_INFO "%s", mlx4_ib_version);
 
 	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
 		num_ports++;
-- 
1.6.3.3


From nashwath at gmail.com  Sun Aug  9 19:07:57 2009
From: nashwath at gmail.com (Ashwath Narasimhan)
Date: Sun, 9 Aug 2009 22:07:57 -0400
Subject: [ofa-general] Setting the Credits.
Message-ID: <ed1288770908091907p5dc48a64jcfbd96e8a23bca29@mail.gmail.com>

Hi Jason/All,Thank you for your response. Do you mean the link layer flow
control (VL's)? or the end to end flow control credits of the Transport
layer? How do I set the end to end flow control credits? I looked at the
driver source code and the file ipath_qp.c interested me. Here they
calculate the credits based on the difference between the head and tail
pointers of the 'qp' receive queue pairs (refer -
drivers/infiniband/hw/ipath). should i change the size of these queues? am I
even looking at the right file?

regards,
Ashwath.

On Thu, Aug 6, 2009 at 5:12 PM, Jason Gunthorpe <
jgunthorpe at obsidianresearch.com> wrote:

> On Wed, Aug 05, 2009 at 08:03:04PM -0400, Ashwath Narasimhan wrote:
>
> > The reason why I need such small rates is because I interface the
> > Infiniband HCA to an FPGA via an Infiniband physical link.  Imagine
> > the FPGA as a simple repeater that simply forwards the infiniband
> > signals to the Target HCA. The FPGA cannot handle such a high data
> > rate and neither do I have as much memory as required to buffer it
> > on the FPGA (I might drop packets if the buffer becomes full). Hence
> > I wish to limit the rate to say 100Mbps instead of 2.5Gbps.
>
> The correct thing to do is manage the flow control credits you are
> giving to the IB network so you don't loose packets.
>
> Jason
>


-- 
regards,
Ashwath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090809/c3e80110/attachment.html>

From rdreier at cisco.com  Sun Aug  9 22:00:31 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 09 Aug 2009 22:00:31 -0700
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
	(Marcin Slusarz's message of "Sun, 9 Aug 2009 21:54:05 +0200")
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
	<1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
Message-ID: <adaocqo9sb4.fsf@cisco.com>


 >  drivers/infiniband/hw/cxgb3/iwch.c |    4 +---
 >  drivers/infiniband/hw/mlx4/main.c  |    6 +-----

 > --- a/drivers/infiniband/hw/mlx4/main.c
 > +++ b/drivers/infiniband/hw/mlx4/main.c
 > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = {
 >  
 >  static void *mlx4_ib_add(struct mlx4_dev *dev)
 >  {
 > -	static int mlx4_ib_version_printed;
 >  	struct mlx4_ib_dev *ibdev;
 >  	int num_ports = 0;
 >  	int i;
 >  
 > -	if (!mlx4_ib_version_printed) {
 > -		printk(KERN_INFO "%s", mlx4_ib_version);
 > -		++mlx4_ib_version_printed;
 > -	}
 > +	printk_once(KERN_INFO "%s", mlx4_ib_version);
 >  
 >  	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
 >  		num_ports++;

Looks fine but there is near-identical code in
drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert
too.

Thanks,
  Roland


From jackm at dev.mellanox.co.il  Sun Aug  9 23:36:26 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 10 Aug 2009 09:36:26 +0300
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <adaocqo9sb4.fsf@cisco.com>
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
	<1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
	<adaocqo9sb4.fsf@cisco.com>
Message-ID: <200908100936.26963.jackm@dev.mellanox.co.il>

I'm a bit nervous about this one.  
printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h
(i.e., when the kernel is configured).  Otherwise, it gets defined to printk --
and it will always print in this case.
(see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272).

Do you think that distributions will ALWAYS have CONFIG_PRINTK defined?

I would prefer to wait until printk_once printing only once is not config-dependent.

-Jack

On Monday 10 August 2009 08:00, Roland Dreier wrote:
> 
>  >  drivers/infiniband/hw/cxgb3/iwch.c |    4 +---
>  >  drivers/infiniband/hw/mlx4/main.c  |    6 +-----
> 
>  > --- a/drivers/infiniband/hw/mlx4/main.c
>  > +++ b/drivers/infiniband/hw/mlx4/main.c
>  > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = {
>  >  
>  >  static void *mlx4_ib_add(struct mlx4_dev *dev)
>  >  {
>  > -	static int mlx4_ib_version_printed;
>  >  	struct mlx4_ib_dev *ibdev;
>  >  	int num_ports = 0;
>  >  	int i;
>  >  
>  > -	if (!mlx4_ib_version_printed) {
>  > -		printk(KERN_INFO "%s", mlx4_ib_version);
>  > -		++mlx4_ib_version_printed;
>  > -	}
>  > +	printk_once(KERN_INFO "%s", mlx4_ib_version);
>  >  
>  >  	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
>  >  		num_ports++;
> 
> Looks fine but there is near-identical code in
> drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert
> too.
> 
> Thanks,
>   Roland
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eli at dev.mellanox.co.il  Mon Aug 10 01:45:27 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 10 Aug 2009 11:45:27 +0300
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it
	has not allocated
In-Reply-To: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
Message-ID: <20090810084527.GA2446@mtls03>

Looking at mlx4_write_mtt_chunk() I see that it calls
mlx4_table_find() with a pointer to single dma_addr_t - dma_handle -
while the dma addresses for the ICM memory is actually a list of
different addresses covering possibly different sizes. I think
mlx4_table_find() should be changed to support that, and then we can
use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device()
with the correct dma addresses.
Roland, what do you think?

On Sat, Aug 08, 2009 at 07:49:22PM +0200, Bart Van Assche wrote:
> Hello,
> 
> Has anyone ever encountered a message like the one below ? This message was
> generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and
> before any out-of-tree kernel modules were loaded.
> 
> ------------[ cut here ]------------
> WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0()
> Hardware name: P5Q DELUXE
> mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it
> has not allocated [device address=0x0000000139482000] [size=4096 bytes]
> Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel
> snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801
> rtc_core hid_belkin mlx4_core(
> +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev
> serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx
> xor raid0 sd_mod crc_t10dif
> ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic
> ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal
> processor thermal_sys hwmon
> Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6
> Call Trace:
>  [<ffffffff8039bc7c>] ? check_sync+0x47c/0x4b0
>  [<ffffffff80248b48>] warn_slowpath_common+0x78/0xd0
>  [<ffffffff80248bfc>] warn_slowpath_fmt+0x3c/0x40
>  [<ffffffff80517769>] ? _spin_lock_irqsave+0x49/0x60
>  [<ffffffff8039b8ab>] ? check_sync+0xab/0x4b0
>  [<ffffffff8039bc7c>] check_sync+0x47c/0x4b0
>  [<ffffffff802724ac>] ? mark_held_locks+0x6c/0x90
>  [<ffffffff8039be1d>] debug_dma_sync_single_for_cpu+0x1d/0x20
>  [<ffffffffa024a969>] mlx4_write_mtt+0x159/0x1e0 [mlx4_core]
>  [<ffffffffa0243c02>] mlx4_create_eq+0x222/0x650 [mlx4_core]
>  [<ffffffff8027281d>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffffa02441f5>] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core]
>  [<ffffffffa0248b08>] mlx4_setup_hca+0x98/0x550 [mlx4_core]
>  [<ffffffffa0249891>] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core]
>  [<ffffffffa0249331>] __mlx4_init_one+0x371/0x920 [mlx4_core]
>  [<ffffffffa024df18>] mlx4_init_one+0x22/0x44 [mlx4_core]
>  [<ffffffff8025cd90>] ? do_work_for_cpu+0x0/0x30
>  [<ffffffff803a43e2>] local_pci_probe+0x12/0x20
>  [<ffffffff8025cda3>] do_work_for_cpu+0x13/0x30
>  [<ffffffff802613e6>] kthread+0x56/0x90
>  [<ffffffff8020cffa>] child_rip+0xa/0x20
>  [<ffffffff8020c9c0>] ? restore_args+0x0/0x30
>  [<ffffffff80261390>] ? kthread+0x0/0x90
>  [<ffffffff8020cff0>] ? child_rip+0x0/0x20
> ---[ end trace 4480af29bc755c6a ]---
> 
> Bart.

> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From vlad at lists.openfabrics.org  Mon Aug 10 03:00:05 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 10 Aug 2009 03:00:05 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090810-0200 daily build status
Message-ID: <20090810100005.E7592E61E29@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one':
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class'
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From bart.vanassche at gmail.com  Mon Aug 10 03:40:48 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Mon, 10 Aug 2009 12:40:48 +0200
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and 
	eventually hangs
In-Reply-To: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
Message-ID: <e2e108260908100340p71efed9u72cf996be0843edd@mail.gmail.com>

On Sun, Aug 9, 2009 at 7:09 PM, Chris Worley <worleys at gmail.com> wrote:

> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off
> the CD.. never updated) and it's embedded IB stack (not the latest
> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info").
>
> I'm running a W2008S (fully patched) initiator w/
> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.
>
> Using Mellanox QDR cards/switch.
>
> Writes over SRP, as measured from the initiator using IOMeter, get
> proper performance (i.e. 1.2GB/s).
>
> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s).
> And while reading, IOMeter eventually hangs the system (Windows
> becomes unresponsive to GUI interaction).  In this state, I see iostat
> reporting transfers at the same low read rate from the target... so
> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it
> acts like it's a "skipping record" (sorry of you young folks don't
> know what that is... but I can't think of another way to describe it)
> and never moving on to the next benchmark, just endlessly repeating
> the same I/O over and over again.  If I unload then reload the mlx4_ib
> driver on the target, then the Windows system quickly returns, but
> IOMeter remains hung and needs killed.
>

The throughput of the SRP protocol strongly depends on the block size used
for I/O. The results I obtained with IOmeter are:
* For a block size of 32 KB: 396 MB/s for reading and 321 MB/s for writing.
* For a block size of 1 MB: 1383 MB/s for reading and 1151 MB/s for writing.
These results are about 90% of the throughput obtained with dd.

Setup details:
* Two Mellanox ConnectX DDR cards connected back to back, operating in PCIe
2.0 mode.
* Target: vanilla 2.6.30.4 kernel + SCST patches + the two patches attached
to http://bugzilla.kernel.org/show_bug.cgi?id=13757 + SCST r1030.
* Initiator: openSUSE 11.0 (contains a patched 2.6.27.25 kernel) with
openSUSE 11.0 OFED components + Linux version of IOmeter's dynamo + IOmeter
GUI running in a virtual machine.
* I/O-scheduler used by SRP initiator: noop.

Bart.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/d70a5210/attachment.html>

From hnrose at comcast.net  Mon Aug 10 06:13:20 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 10 Aug 2009 09:13:20 -0400
Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: In
	sm_mad_ctrl_send_err_cb, set
	init failure on PKeyTable and QoS initialization failure
Message-ID: <20090810131319.GA14915@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c
index 791c848..f0bc407 100644
--- a/opensm/opensm/osm_sm_mad_ctrl.c
+++ b/opensm/opensm/osm_sm_mad_ctrl.c
@@ -723,7 +723,10 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
 	    (p_smp->attr_id == IB_MAD_ATTR_PORT_INFO ||
 	     p_smp->attr_id == IB_MAD_ATTR_MCAST_FWD_TBL ||
 	     p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO ||
-	     p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL)) {
+	     p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL ||
+	     p_smp->attr_id == IB_MAD_ATTR_P_KEY_TABLE ||
+	     p_smp->attr_id == IB_MAD_ATTR_SLVL_TABLE ||
+	     p_smp->attr_id == IB_MAD_ATTR_VL_ARBITRATION)) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3119: "
 			"Set method failed for attribute 0x%X (%s)\n",
 			cl_ntoh16(p_smp->attr_id),


From hal.rosenstock at gmail.com  Mon Aug 10 06:45:58 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Aug 2009 09:45:58 -0400
Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support
In-Reply-To: <20090805082751.GA5599@mtls03>
References: <20090805082751.GA5599@mtls03>
Message-ID: <f0e08f230908100645u47d20435q81a774c1ea8097f6@mail.gmail.com>

On Wed, Aug 5, 2009 at 4:27 AM, Eli Cohen <eli at dev.mellanox.co.il> wrote:

> RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using
> Ethernet frames, enabling the deployment of IB semantics on lossless
> Ethernet
> fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned
> Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet
> management and SA services are not required for RDMAoE operation; Ethernet
> management practices are used instead. RDMAoE encodes IP addresses into its
> GIDs and resolves MAC addresses using the host IP stack. For multicast
> GIDs,
> standard IP to MAC mappings apply.
>
> To support RDMAoE, a new transport protocol was added to the IB core. An
> RDMA
> device can have ports with different transports, which are identified by a
> port
> transport attribute.  The RDMA Verbs API is syntactically unmodified. When
> referring to RDMAoE ports, Address handles are required to contain GIDs
> while
> LID fields are ignored. The Ethernet L2 information is subsequently
> obtained by
> the vendor-specific driver (both in kernel- and user-space) while modifying
> QPs
> to RTR and creating address handles.  As there is no SA in RDMAoE, the CMA
> code
> is modified to fill the necessary path record attributes locally before
> sending
> CM packets. Similarly, the CMA provides to the user the required address
> handle
> attributes when processing SIDR requests and joining multicast groups.
>
> In this patch set, an RDMAoE port is currently assigned a single GID,
> encoding
> the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE
> code
> temporarily uses IPv6 link-local addresses as GIDs instead of the IP
> address
> provided by the user, thereby supporting any IP address. In addition,
> multicast
> packets currently use the broadcast MAC.
>
> To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib
> drivers must be loaded, and the netdevice for the corresponding RDMAoE port
> must be running. Individual ports of a multi port HCA can be independently
> configured as Ethernet (with support for RDMAoE) or IB, as is already the
> case.


How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to boot
time or dynamic ?

-- Hal

<snip...>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/dab147e7/attachment.html>

From hal.rosenstock at gmail.com  Mon Aug 10 06:56:43 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Aug 2009 09:56:43 -0400
Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding
In-Reply-To: <20090805082929.GG5599@mtls03>
References: <20090805082929.GG5599@mtls03>
Message-ID: <f0e08f230908100656n424348d1mfd02f12e6190ebfe@mail.gmail.com>

On Wed, Aug 5, 2009 at 4:29 AM, Eli Cohen <eli at mellanox.co.il> wrote:

> Add support for RDMAoE device binding and IP --> GID resolution. Path
> resolving
> and multicast joining are implemented within cma.c by filling the responses
> and
> pushing the callbacks to the cma work queue. IP->GID resolution always
> yield
> IPv6 link local addresses - remote GIDs are derived from the destination
> MAC
> address of the remote port. Multicast GIDs are always mapped to broadcast
> MAC
> (all FFs). Some helper functions are added to ib_addr.h.
>
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>
> ---
>  drivers/infiniband/core/cma.c  |  150
> ++++++++++++++++++++++++++++++++++++++-
>  drivers/infiniband/core/ucma.c |   25 +++++--
>  include/rdma/ib_addr.h         |   87 +++++++++++++++++++++++
>  3 files changed, 251 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 866ff7f..8f5675b 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL");
>  #define CMA_CM_RESPONSE_TIMEOUT 20
>  #define CMA_MAX_CM_RETRIES 15
>  #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
> +#define RDMAOE_PACKET_LIFETIME 18
>
>  static void cma_add_one(struct ib_device *device);
>  static void cma_remove_one(struct ib_device *device);
> @@ -174,6 +175,12 @@ struct cma_ndev_work {
>        struct rdma_cm_event    event;
>  };
>
> +struct rdmaoe_mcast_work {
> +       struct work_struct       work;
> +       struct rdma_id_private  *id;
> +       struct cma_multicast    *mc;
> +};
> +
>  union cma_ip_addr {
>        struct in6_addr ip6;
>        struct {
> @@ -348,6 +355,9 @@ static int cma_acquire_dev(struct rdma_id_private
> *id_priv)
>                        case RDMA_TRANSPORT_IWARP:
>                                iw_addr_get_sgid(dev_addr, &gid);
>                                break;
> +                       case RDMA_TRANSPORT_RDMAOE:
> +                               rdmaoe_addr_get_sgid(dev_addr, &gid);
> +                               break;
>                        default:
>                                return -ENODEV;
>                        }
> @@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private
> *id_priv,
>  {
>        struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
>        int ret;
> +       u16 pkey;
> +
> +        if (rdma_port_get_transport(id_priv->id.device,
> id_priv->id.port_num) ==
> +           RDMA_TRANSPORT_IB)
> +               pkey = ib_addr_get_pkey(dev_addr);
> +       else
> +               pkey = 0xffff;
>
>        ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
> -                                 ib_addr_get_pkey(dev_addr),
> -                                 &qp_attr->pkey_index);
> +                                 pkey, &qp_attr->pkey_index);
>        if (ret)
>                return ret;
>
> @@ -609,6 +625,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct
> ib_qp_attr *qp_attr,
>        id_priv = container_of(id, struct rdma_id_private, id);
>        switch (rdma_port_get_transport(id_priv->id.device,
> id_priv->id.port_num)) {
>        case RDMA_TRANSPORT_IB:
> +       case RDMA_TRANSPORT_RDMAOE:
>                if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps))
>                        ret = cma_ib_init_qp_attr(id_priv, qp_attr,
> qp_attr_mask);
>                else
> @@ -836,7 +853,9 @@ static void cma_leave_mc_groups(struct rdma_id_private
> *id_priv)
>                mc = container_of(id_priv->mc_list.next,
>                                  struct cma_multicast, list);
>                list_del(&mc->list);
> -               ib_sa_free_multicast(mc->multicast.ib);
> +               if (rdma_port_get_transport(id_priv->cma_dev->device,
> id_priv->id.port_num) ==
> +                   RDMA_TRANSPORT_IB)
> +                       ib_sa_free_multicast(mc->multicast.ib);
>                kref_put(&mc->mcref, release_mc);
>        }
>  }
> @@ -855,6 +874,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
>                mutex_unlock(&lock);
>                switch (rdma_port_get_transport(id_priv->id.device,
> id_priv->id.port_num)) {
>                case RDMA_TRANSPORT_IB:
> +               case RDMA_TRANSPORT_RDMAOE:
>                        if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
>                                ib_destroy_cm_id(id_priv->cm_id.ib);
>                        break;
> @@ -1512,6 +1532,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
>        if (id->device) {
>                switch (rdma_port_get_transport(id->device, id->port_num)) {
>                case RDMA_TRANSPORT_IB:
> +               case RDMA_TRANSPORT_RDMAOE:
>                        ret = cma_ib_listen(id_priv);
>                        if (ret)
>                                goto err;
> @@ -1727,6 +1748,65 @@ static int cma_resolve_iw_route(struct
> rdma_id_private *id_priv, int timeout_ms)
>        return 0;
>  }
>
> +static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv)
> +{
> +       struct rdma_route *route = &id_priv->id.route;
> +       struct rdma_addr *addr = &route->addr;
> +       struct cma_work *work;
> +       int ret;
> +       struct sockaddr_in *src_addr = (struct sockaddr_in
> *)&route->addr.src_addr;
> +       struct sockaddr_in *dst_addr = (struct sockaddr_in
> *)&route->addr.dst_addr;
> +
> +       if (src_addr->sin_family != dst_addr->sin_family)
> +               return -EINVAL;
> +
> +       work = kzalloc(sizeof *work, GFP_KERNEL);
> +       if (!work)
> +               return -ENOMEM;
> +
> +       work->id = id_priv;
> +       INIT_WORK(&work->work, cma_work_handler);
> +
> +       route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL);
> +       if (!route->path_rec) {
> +               ret = -ENOMEM;
> +               goto err;
> +       }
> +
> +       route->num_paths = 1;
> +
> +       rdmaoe_mac_to_ll(&route->path_rec->sgid,
> addr->dev_addr.src_dev_addr);
> +       rdmaoe_mac_to_ll(&route->path_rec->dgid,
> addr->dev_addr.dst_dev_addr);
> +
> +       route->path_rec->hop_limit = 2;


Does HopLimit need to be 2 ? Isn't this all subnet local ?


>
> +       route->path_rec->reversible = 1;
> +       route->path_rec->pkey = cpu_to_be16(0xffff);
> +       route->path_rec->mtu_selector = 2;
> +       route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu);
> +       route->path_rec->rate_selector = 2;
> +       route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev);
> +       route->path_rec->packet_life_time_selector = 2;
> +       route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME;
> +
> +       work->old_state = CMA_ROUTE_QUERY;
> +       work->new_state = CMA_ROUTE_RESOLVED;
> +       if (!route->path_rec->mtu || !route->path_rec->rate) {
> +               work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
> +               work->event.status = -1;
> +       } else {
> +               work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
> +               work->event.status = 0;
> +       }
> +
> +       queue_work(cma_wq, &work->work);
> +
> +       return 0;
> +
> +err:
> +       kfree(work);
> +       return ret;
> +}
> +
>  int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
>  {
>        struct rdma_id_private *id_priv;
> @@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int
> timeout_ms)
>        case RDMA_TRANSPORT_IWARP:
>                ret = cma_resolve_iw_route(id_priv, timeout_ms);
>                break;
> +       case RDMA_TRANSPORT_RDMAOE:
> +               ret = cma_resolve_rdmaoe_route(id_priv);
> +               break;
>        default:
>                ret = -ENOSYS;
>                break;
> @@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct
> rdma_conn_param *conn_param)
>
>        switch (rdma_port_get_transport(id->device, id->port_num)) {
>        case RDMA_TRANSPORT_IB:
> +       case RDMA_TRANSPORT_RDMAOE:
>                if (cma_is_ud_ps(id->ps))
>                        ret = cma_resolve_ib_udp(id_priv, conn_param);
>                else
> @@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct
> rdma_conn_param *conn_param)
>
>        switch (rdma_port_get_transport(id->device, id->port_num)) {
>        case RDMA_TRANSPORT_IB:
> +       case RDMA_TRANSPORT_RDMAOE:
>                if (cma_is_ud_ps(id->ps))
>                        ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
>                                                conn_param->private_data,
> @@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void
> *private_data,
>
>        switch (rdma_port_get_transport(id->device, id->port_num)) {
>        case RDMA_TRANSPORT_IB:
> +       case RDMA_TRANSPORT_RDMAOE:
>                if (cma_is_ud_ps(id->ps))
>                        ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
>                                                private_data,
> private_data_len);
> @@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
>
>        switch (rdma_port_get_transport(id->device, id->port_num)) {
>        case RDMA_TRANSPORT_IB:
> +       case RDMA_TRANSPORT_RDMAOE:
>                ret = cma_modify_qp_err(id_priv);
>                if (ret)
>                        goto out;
> @@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct
> rdma_id_private *id_priv,
>        return 0;
>  }
>
> +
> +static void rdmaoe_mcast_work_handler(struct work_struct *work)
> +{
> +       struct rdmaoe_mcast_work *mw = container_of(work, struct
> rdmaoe_mcast_work, work);
> +       struct cma_multicast *mc = mw->mc;
> +       struct ib_sa_multicast *m = mc->multicast.ib;
> +
> +       mc->multicast.ib->context = mc;
> +       cma_ib_mc_handler(0, m);
> +       kfree(m);
> +       kfree(mw);
> +}
> +
> +static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv,
> +                                    struct cma_multicast *mc)
> +{
> +       struct rdmaoe_mcast_work *work;
> +       struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
> +
> +       if (cma_zero_addr((struct sockaddr *)&mc->addr))
> +               return -EINVAL;
> +
> +       work = kzalloc(sizeof *work, GFP_KERNEL);
> +       if (!work)
> +               return -ENOMEM;
> +
> +       mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast),
> GFP_KERNEL);
> +       if (!mc->multicast.ib) {
> +               kfree(work);
> +               return -ENOMEM;
> +       }
> +
> +       cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr,
> &mc->multicast.ib->rec.mgid);
> +       mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff);
> +       if (id_priv->id.ps == RDMA_PS_UDP)
> +               mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY);
> +       mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev);
> +       mc->multicast.ib->rec.hop_limit = 1;


Similar to the unicast comment above, is HopLimit 1 needed for multicast ?

-- Hal


>
> +       mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu);
> +       rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
> +       work->id = id_priv;
> +       work->mc = mc;
> +       INIT_WORK(&work->work, rdmaoe_mcast_work_handler);
> +
> +       queue_work(cma_wq, &work->work);
> +
> +       return 0;
> +}
> +
>  int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
>                        void *context)
>  {
> @@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
> sockaddr *addr,
>        case RDMA_TRANSPORT_IB:
>                ret = cma_join_ib_multicast(id_priv, mc);
>                break;
> +       case RDMA_TRANSPORT_RDMAOE:
> +               ret = cma_rdmaoe_join_multicast(id_priv, mc);
> +               break;
>        default:
>                ret = -ENOSYS;
>                break;
> @@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
> sockaddr *addr,
>                spin_unlock_irq(&id_priv->lock);
>                kfree(mc);
>        }
> +
>        return ret;
>  }
>  EXPORT_SYMBOL(rdma_join_multicast);
> @@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id,
> struct sockaddr *addr)
>                                ib_detach_mcast(id->qp,
>                                                &mc->multicast.ib->rec.mgid,
>                                                mc->multicast.ib->rec.mlid);
> -                       ib_sa_free_multicast(mc->multicast.ib);
> +                       if
> (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) ==
> +                           RDMA_TRANSPORT_IB)
> +                               ib_sa_free_multicast(mc->multicast.ib);
>                        kref_put(&mc->mcref, release_mc);
>                        return;
>                }
> diff --git a/drivers/infiniband/core/ucma.c
> b/drivers/infiniband/core/ucma.c
> index 24d9510..c7c9e92 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file
> *file,
>  }
>
>  static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
> -                              struct rdma_route *route)
> +                              struct rdma_route *route,
> +                              enum rdma_transport_type tt)
>  {
>        struct rdma_dev_addr *dev_addr;
>
> @@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct
> rdma_ucm_query_route_resp *resp,
>        switch (route->num_paths) {
>        case 0:
>                dev_addr = &route->addr.dev_addr;
> -               ib_addr_get_dgid(dev_addr,
> -                                (union ib_gid *) &resp->ib_route[0].dgid);
> -               ib_addr_get_sgid(dev_addr,
> -                                (union ib_gid *) &resp->ib_route[0].sgid);
> +               if (tt == RDMA_TRANSPORT_IB) {
> +                       ib_addr_get_dgid(dev_addr,
> +                                        (union ib_gid *)
> &resp->ib_route[0].dgid);
> +                       ib_addr_get_sgid(dev_addr,
> +                                        (union ib_gid *)
> &resp->ib_route[0].sgid);
> +               } else {
> +                       rdmaoe_mac_to_ll((union ib_gid *)
> &resp->ib_route[0].dgid,
> +                                        dev_addr->dst_dev_addr);
> +                       rdmaoe_addr_get_sgid(dev_addr,
> +                                        (union ib_gid *)
> &resp->ib_route[0].sgid);
> +               }
>                resp->ib_route[0].pkey =
> cpu_to_be16(ib_addr_get_pkey(dev_addr));
>                break;
>        case 2:
> @@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
>        struct ucma_context *ctx;
>        struct sockaddr *addr;
>        int ret = 0;
> +       enum rdma_transport_type tt;
>
>        if (out_len < sizeof(resp))
>                return -ENOSPC;
> @@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file
> *file,
>
>        resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
>        resp.port_num = ctx->cm_id->port_num;
> -       switch (rdma_port_get_transport(ctx->cm_id->device,
> ctx->cm_id->port_num)) {
> +       tt = rdma_port_get_transport(ctx->cm_id->device,
> ctx->cm_id->port_num);
> +       switch (tt) {
>        case RDMA_TRANSPORT_IB:
> -               ucma_copy_ib_route(&resp, &ctx->cm_id->route);
> +       case RDMA_TRANSPORT_RDMAOE:
> +               ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt);
>                break;
>        default:
>                break;
> diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
> index 483057b..66a848e 100644
> --- a/include/rdma/ib_addr.h
> +++ b/include/rdma/ib_addr.h
> @@ -39,6 +39,8 @@
>  #include <linux/netdevice.h>
>  #include <linux/socket.h>
>  #include <rdma/ib_verbs.h>
> +#include <linux/ethtool.h>
> +#include <rdma/ib_pack.h>
>
>  struct rdma_addr_client {
>        atomic_t refcount;
> @@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct
> rdma_dev_addr *dev_addr,
>        memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid);
>  }
>
> +static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac)
> +{
> +       memset(gid->raw, 0, 16);
> +       *((u32 *)gid->raw) = cpu_to_be32(0xfe800000);
> +       gid->raw[12] = 0xfe;
> +       gid->raw[11] = 0xff;
> +       memcpy(gid->raw + 13, mac + 3, 3);
> +       memcpy(gid->raw + 8, mac, 3);
> +       gid->raw[8] ^= 2;
> +}
> +
> +static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr,
> +                                       union ib_gid *gid)
> +{
> +       rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr);
> +}
> +
> +static inline enum ib_mtu rdmaoe_get_mtu(int mtu)
> +{
> +       /*
> +        * reduce IB headers from effective RDMAoE MTU. 28 stands for
> +        * atomic header which is the biggest possible header after BTH
> +        */
> +       mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28;
> +
> +       if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096))
> +               return IB_MTU_4096;
> +       else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048))
> +               return IB_MTU_2048;
> +       else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024))
> +               return IB_MTU_1024;
> +       else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512))
> +               return IB_MTU_512;
> +       else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256))
> +               return IB_MTU_256;
> +       else
> +               return 0;
> +}
> +
> +static inline int rdmaoe_get_rate(struct net_device *dev)
> +{
> +       struct ethtool_cmd cmd;
> +
> +       if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings ||
> +           dev->ethtool_ops->get_settings(dev, &cmd))
> +               return IB_RATE_PORT_CURRENT;
> +
> +       if (cmd.speed >= 40000)
> +               return IB_RATE_40_GBPS;
> +       else if (cmd.speed >= 30000)
> +               return IB_RATE_30_GBPS;
> +       else if (cmd.speed >= 20000)
> +               return IB_RATE_20_GBPS;
> +       else if (cmd.speed >= 10000)
> +               return IB_RATE_10_GBPS;
> +       else
> +               return IB_RATE_PORT_CURRENT;
> +}
> +
> +static inline int rdma_link_local_addr(struct in6_addr *addr)
> +{
> +       if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) &&
> +           addr->s6_addr32[1] == 0)
> +               return 1;
> +       else
> +               return 0;
> +}
> +
> +static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
> +{
> +       memcpy(mac, &addr->s6_addr[8], 3);
> +       memcpy(mac + 3, &addr->s6_addr[13], 3);
> +       mac[0] ^= 2;
> +}
> +
> +static inline int rdma_is_multicast_addr(struct in6_addr *addr)
> +{
> +       return addr->s6_addr[0] == 0xff ? 1 : 0;
> +}
> +
> +static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
> +{
> +       memset(mac, 0xff, 6);
> +}
> +
>  #endif /* IB_ADDR_H */
> --
> 1.6.3.3
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/ca28b47a/attachment.html>

From hal.rosenstock at gmail.com  Mon Aug 10 07:01:54 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Aug 2009 10:01:54 -0400
Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE 
	ports
In-Reply-To: <20090807032901.GB20589@mtls03>
References: <20090805082910.GE5599@mtls03>
	<376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com>
	<20090807032901.GB20589@mtls03>
Message-ID: <f0e08f230908100701t5dc3e49al9a2c4de4bedd0a00@mail.gmail.com>

On Thu, Aug 6, 2009 at 11:29 PM, Eli Cohen <eli at dev.mellanox.co.il> wrote:

> On Thu, Aug 06, 2009 at 11:05:47AM -0700, Sean Hefty wrote:
> >
> > Is there a need to expose QP1 to user space?  The CM is in the kernel,
> and
> > there's not an SA.
> >
>
> Good point. There seems to be no reason to expose it.


Might there be some GS service to expose ? Vendor MADs perhaps ? If not,
then not exposing QP1 should be OK.

-- Hal


> Will fix.
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/bc354ddc/attachment.html>

From gmpc at sanger.ac.uk  Mon Aug 10 07:30:27 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Mon, 10 Aug 2009 15:30:27 +0100
Subject: [ofa-general] ofed kernel  config.mk /  BACKPORT_INCLUDES
Message-ID: <4A802F03.2000507@sanger.ac.uk>

Hi all,

I am trying to build lustre 1.8.1 against OFED 1.4.2 and have uncovered a couple
of bugs regarding how BACKPORT_INCLUDES is handled in the ofa-kernel config.mk file.

The ofed_patch.sh script in the ofa_kernel package is incorrectly escaped, and
results in a mangled  BACKPORT_INCLUDES path.

The lustre ./configure script is also broken, and prepends and extra "-I"
infront of the BACKPORT_INCLUDES path.

Patches for both are attached.

Cheers,

Guy


-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed_patch.diff
Type: text/x-patch
Size: 563 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/6ad80838/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.patch
Type: text/x-patch
Size: 480 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/6ad80838/attachment-0001.bin>

From rgmiller at ornl.gov  Mon Aug 10 07:37:25 2009
From: rgmiller at ornl.gov (Miller, Ross G.)
Date: Mon, 10 Aug 2009 10:37:25 -0400
Subject: [ofa-general] Baseboard Management API
Message-ID: <C6A5A8E5.6EC%rgmiller@ornl.gov>

I read the posts back in March where the bm_call_via API was discussed and accepted.  I'm trying to write a simple utility that uses that function to query several IB switches, but all I get back is an error code.  Has anyone else used this function, and is there any sample code available that I could reference?

Thanks very much,

Ross G. Miller
Systems Integration Programmer
National Center for Computational Sciences
Oak Ridge National Laboratory


From hal.rosenstock at gmail.com  Mon Aug 10 07:49:01 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Aug 2009 10:49:01 -0400
Subject: [ofa-general] Baseboard Management API
In-Reply-To: <C6A5A8E5.6EC%rgmiller@ornl.gov>
References: <C6A5A8E5.6EC%rgmiller@ornl.gov>
Message-ID: <f0e08f230908100749l7b4e61dcj4287e3331f1f5e3e@mail.gmail.com>

On Mon, Aug 10, 2009 at 10:37 AM, Miller, Ross G. <rgmiller at ornl.gov> wrote:

> I read the posts back in March where the bm_call_via API was discussed and
> accepted.  I'm trying to write a simple utility that uses that function to
> query several IB switches,


Do those switches have BMAs ?


> but all I get back is an error code.


What error code ?


>   Has anyone else used this function, and is there any sample code
> available that I could reference?


AFAIK no code using this has been posted but you can look at ibping or
vendstat which use vendor MADs but should be similar.

-- Hal


>
> Thanks very much,
>
> Ross G. Miller
> Systems Integration Programmer
> National Center for Computational Sciences
> Oak Ridge National Laboratory
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/84aee8ac/attachment.html>

From eli at dev.mellanox.co.il  Mon Aug 10 07:56:54 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 10 Aug 2009 17:56:54 +0300
Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support
In-Reply-To: <f0e08f230908100645u47d20435q81a774c1ea8097f6@mail.gmail.com>
References: <20090805082751.GA5599@mtls03>
	<f0e08f230908100645u47d20435q81a774c1ea8097f6@mail.gmail.com>
Message-ID: <20090810145654.GA8688@mtls03>

On Mon, Aug 10, 2009 at 09:45:58AM -0400, Hal Rosenstock wrote:
> 
> How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to boot
> time or dynamic ?
> 

mlx4 allows changing port designation dynamically by writing to the
sysfs.


From hal.rosenstock at gmail.com  Mon Aug 10 08:03:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Aug 2009 11:03:36 -0400
Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support
In-Reply-To: <20090810145654.GA8688@mtls03>
References: <20090805082751.GA5599@mtls03>
	<f0e08f230908100645u47d20435q81a774c1ea8097f6@mail.gmail.com>
	<20090810145654.GA8688@mtls03>
Message-ID: <f0e08f230908100803k74a383bewb75f21128078058a@mail.gmail.com>

On Mon, Aug 10, 2009 at 10:56 AM, Eli Cohen <eli at dev.mellanox.co.il> wrote:

> On Mon, Aug 10, 2009 at 09:45:58AM -0400, Hal Rosenstock wrote:
> >
> > How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to
> boot
> > time or dynamic ?
> >
>
> mlx4 allows changing port designation dynamically by writing to the
> sysfs.
>

Nice feature :-) I think that currently some of the kernel components (in
addition to user space handling) will need some change to support this. I
don't think that was included in your patch series (unless I missed it which
is entirely possible).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/04c32f64/attachment.html>

From brian at sun.com  Mon Aug 10 08:17:32 2009
From: brian at sun.com (Brian J. Murrell)
Date: Mon, 10 Aug 2009 11:17:32 -0400
Subject: [ofa-general] ofed kernel  config.mk /  BACKPORT_INCLUDES
In-Reply-To: <4A802F03.2000507@sanger.ac.uk>
References: <4A802F03.2000507@sanger.ac.uk>
Message-ID: <1249917452.7132.192.camel@pc.interlinx.bc.ca>

[ Any further responses to this thread should drop the
general at lists.openfabrics.org list.  I only kept it here so that anyone
on that list that wishes to follow this thread knows that it will not be
continued on the openfabrics list ]

On Mon, 2009-08-10 at 15:30 +0100, Guy Coates wrote:
> Hi all,

Hi Guy,

> The lustre ./configure script is also broken, and prepends and extra
> "-I"
> infront of the BACKPORT_INCLUDES path.

I don't think so, but xtrace output from configure would verify.

> --- lustre-1.8.1/configure      2009-07-24 23:28:51.000000000 +0100
> +++ configure   2009-08-10 15:08:22.316488430 +0100
> @@ -5595,7 +5595,7 @@
>                 fi
>                 if test -n "$BACKPORT_INCLUDES"; then
>                         OFED_BACKPORT_PATH=`echo $BACKPORT_INCLUDES |
> sed "s#.*/src/ofa_kernel/#$O2IBPATH/#"`
> -                       EXTRA_LNET_INCLUDE="-I$OFED_BACKPORT_PATH
> $EXTRA_LNET_INCLUDE"
> +                       EXTRA_LNET_INCLUDE="$OFED_BACKPORT_PATH
> $EXTRA_LNET_INCLUDE"
>                         echo "$as_me:$LINENO: result: yes" >&5
>  echo "${ECHO_T}yes" >&6
>                 else

Notice that it's "-I$OFED_BACKPORT_PATH" that we are adding to
$EXTRA_LNET_INCLUDE, not "-I$BACKPORT_INCLUDES".  Further notice what
$OFED_BACKPORT_PATH actually is:

OFED_BACKPORT_PATH=`echo $BACKPORT_INCLUDES | sed "s#.*/src/ofa_kernel/#$O2IBPATH/#"`

Your patch failed to include enough context, but notice that configure
sources config.mk:

. $O2IBPATH/config.mk

and then uses the $BACKPORT_INCLUDES to derive an $OFED_BACKPORT_PATH
with a sed expression:

sed "s#.*/src/ofa_kernel/#$O2IBPATH/#"

Which would turn an example $BACKPORT_INCLUDES of
"-I/usr/src/ofa_kernel/kernel_addons/backport/2.6.18-EL5.3/include/"
into "foobar/kernel_addons/backport/2.6.18-EL5.3/include/" assuming
$O2IBPATH="foobar".

So as you can see, when adding $OFED_BACKPORT_PATH to gcc as an include
path, you need to prefix it with "-I".

Please do let me know if your experience is any different from my
explanation.  Please include some xtrace output if so.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/838040f7/attachment.sig>

From gmpc at sanger.ac.uk  Mon Aug 10 08:35:04 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Mon, 10 Aug 2009 16:35:04 +0100
Subject: [ofa-general] ofed kernel  config.mk /  BACKPORT_INCLUDES
In-Reply-To: <1249917452.7132.192.camel@pc.interlinx.bc.ca>
References: <4A802F03.2000507@sanger.ac.uk>
	<1249917452.7132.192.camel@pc.interlinx.bc.ca>
Message-ID: <4A803E28.7010504@sanger.ac.uk>

Hi Brian;

A fresh build of the ofa-kernel package gives the following in the build dir:


cat config.mk
BACKPORT_INCLUDES=-I${CWD}/kernel_addons/backport/2.6.22/include/

That is obviously wrong.


If I run the lustre configure, I get:


./configure --with-o2ib=/usr/src/modules/ofa-kernel
--with-linux=/scratch/linux-2.6.22.19

<snip>
checking whether to enable OpenIB gen2 support... no
configure: error: can't compile with OpenIB gen2 headers under
/usr/src/modules/ofa-kernel


config.log has the following set for LNET_INCLUDES:


EXTRA_LNET_INCLUDE='-I-I/kernel_addons/backport/2.6.22/include/
-I/usr/src/modules/ofa-kernel/include'


which results in:

configure:6885: cp conftest.c build && make -d modules  CC=gcc -f
/tmp/lustre-1.8.1/build/Makefile
LUSTRE_LINUX_CONFIG=/scratch/linux-2.6.22.19/.config
LINUXINCLUDE=-I-I/usr/src/modules/ker
nel_addons/backport/2.6.22/include/ -I/usr/src/modules/ofa-kernel/include
-I/scratch/linux-2.6.22.19/include -I/scratch/linux-2.6.22.19/include
-I/scratch/linux-2.6.22.19/include2 -include
include/linux/autoconf.h -o tmp_include_depends -o scripts -o
include/config/MARKER -C /scratch/linux-2.6.22.19
EXTRA_CFLAGS=-Werror-implicit-function-declaration -g -I/tmp/lustre-1.8.1/lne
t/include -I/tmp/lustre-1.8.1/lnet/include -I/tmp/lustre-1.8.1/lustre/include
-I/usr/src/modules/ofa-kernel/include  M=/tmp/lustre-1.8.1/build
In file included from /usr/src/modules/ofa-kernel/include/rdma/ib_addr.h:41,
                 from /usr/src/modules/ofa-kernel/include/rdma/rdma_cm.h:39,
                 from /tmp/lustre-1.8.1/build/conftest.c:36:
/usr/src/modules/ofa-kernel/include/rdma/ib_verbs.h:1724: warning: 'struct
dma_attrs' declared inside parameter list
/usr/src/modules/ofa-kernel/include/rdma/ib_verbs.h:1724: warning: its scope is
only this definition or declaration, which is probably not what you want


If I fix config.mk so that the correct path is present:


cat config.mk
BACKPORT_INCLUDES=-I/usr/src/modules/kernel_addons/backport/2.6.22/include/

configure still fails with:

checking whether to enable OpenIB gen2 support... no
configure: error: can't compile with OpenIB gen2 headers under
/usr/src/modules/ofa-kernel

config.log has the following incorrect EXTRA_LNET_INCLUDE:

EXTRA_LNET_INCLUDE='-I-I/usr/src/modules/kernel_addons/backport/2.6.22/include/
 -I/usr/src/modules/ofa-kernel/include'


With the two patches previously sent, everything builds (modulo a separate bug
in the OFED 2.6.22 backport includes).

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From rgmiller at ornl.gov  Mon Aug 10 08:44:08 2009
From: rgmiller at ornl.gov (Miller, Ross G.)
Date: Mon, 10 Aug 2009 11:44:08 -0400
Subject: [ofa-general] Baseboard Management API
In-Reply-To: <f0e08f230908100749l7b4e61dcj4287e3331f1f5e3e@mail.gmail.com>
Message-ID: <C6A5B888.6FA%rgmiller@ornl.gov>

Supposedly, the switches have BMA's, but I guess I'd better go check with the sysadmins and find out for sure.

Regarding the error code:

If I set the method to IB_MAD_METHOD_GET and the attrid to IB_BM_ATTR_BKEYINFO, then I get an error out of mad_rpc saying the status field was 0xC.  If I'm reading the architecture spec right, 0xC means that method/attrid combination isn't supported.

If I try IB_MAD_SEND and IB_BM_ATTR_GET_MODULE_STATUS, I receive no errors back, but I also receive no data.

FWIW: The code I've written is based loosely on vendstat.c, but for simplicity I've stripped it down quite a bit and just hard-coded the parameters (method, attrid, LID, etc...)  I'm trying to write a simple utility that will query the status of the redundant power supplies on the switches so our admins don't have to physically check each morning looking for failures.  Byte 5 of IB_BM_ATTR_GET_MODULE_STATUS should tell me exactly what I need, I think.  The only reason for trying the BKEYINFO was just to see if I could get anything to work at all.


Thanks,

Ross G. Miller
Systems Integration Programmer
National Center for Computational Sciences
Oak Ridge National Laboratory

On 8/10/09 10:49 AM, "Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:


On Mon, Aug 10, 2009 at 10:37 AM, Miller, Ross G. <rgmiller at ornl.gov> wrote:
I read the posts back in March where the bm_call_via API was discussed and accepted.  I'm trying to write a simple utility that uses that function to query several IB switches,

Do those switches have BMAs ?

but all I get back is an error code.

What error code ?

  Has anyone else used this function, and is there any sample code available that I could reference?

AFAIK no code using this has been posted but you can look at ibping or vendstat which use vendor MADs but should be similar.

-- Hal


Thanks very much,

Ross G. Miller
Systems Integration Programmer
National Center for Computational Sciences
Oak Ridge National Laboratory
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From marcin.slusarz at gmail.com  Mon Aug 10 09:07:25 2009
From: marcin.slusarz at gmail.com (Marcin Slusarz)
Date: Mon, 10 Aug 2009 18:07:25 +0200
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <adaocqo9sb4.fsf@cisco.com>
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>	<1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
	<adaocqo9sb4.fsf@cisco.com>
Message-ID: <4A8045BD.8010803@gmail.com>

Roland Dreier wrote:
>  >  drivers/infiniband/hw/cxgb3/iwch.c |    4 +---
>  >  drivers/infiniband/hw/mlx4/main.c  |    6 +-----
> 
>  > --- a/drivers/infiniband/hw/mlx4/main.c
>  > +++ b/drivers/infiniband/hw/mlx4/main.c
>  > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = {
>  >  
>  >  static void *mlx4_ib_add(struct mlx4_dev *dev)
>  >  {
>  > -	static int mlx4_ib_version_printed;
>  >  	struct mlx4_ib_dev *ibdev;
>  >  	int num_ports = 0;
>  >  	int i;
>  >  
>  > -	if (!mlx4_ib_version_printed) {
>  > -		printk(KERN_INFO "%s", mlx4_ib_version);
>  > -		++mlx4_ib_version_printed;
>  > -	}
>  > +	printk_once(KERN_INFO "%s", mlx4_ib_version);
>  >  
>  >  	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
>  >  		num_ports++;
> 
> Looks fine but there is near-identical code in
> drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert
> too.

Thanks for a hint. Updated patch below.

---
From: Marcin Slusarz <marcin.slusarz at gmail.com>
Date: Mon, 10 Aug 2009 18:01:49 +0200
Subject: [PATCH 10/14 v2] infiniband: use printk_once

Signed-off-by: Marcin Slusarz <marcin.slusarz at gmail.com>
Cc: Roland Dreier <rolandd at cisco.com>
Cc: Sean Hefty <sean.hefty at intel.com>
Cc: Hal Rosenstock <hal.rosenstock at gmail.com>
Cc: general at lists.openfabrics.org
---
 drivers/infiniband/hw/cxgb3/iwch.c       |    4 +---
 drivers/infiniband/hw/mlx4/main.c        |    6 +-----
 drivers/infiniband/hw/mthca/mthca_main.c |    6 +-----
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
index 26fc0a4..9cc99df 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -105,11 +105,9 @@ static void rnic_init(struct iwch_dev *rnicp)
 static void open_rnic_dev(struct t3cdev *tdev)
 {
 	struct iwch_dev *rnicp;
-	static int vers_printed;
 
 	PDBG("%s t3cdev %p\n", __func__,  tdev);
-	if (!vers_printed++)
-		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+	printk_once(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
 		       DRV_VERSION);
 	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
 	if (!rnicp) {
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..0b2f77a 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = {
 
 static void *mlx4_ib_add(struct mlx4_dev *dev)
 {
-	static int mlx4_ib_version_printed;
 	struct mlx4_ib_dev *ibdev;
 	int num_ports = 0;
 	int i;
 
-	if (!mlx4_ib_version_printed) {
-		printk(KERN_INFO "%s", mlx4_ib_version);
-		++mlx4_ib_version_printed;
-	}
+	printk_once(KERN_INFO "%s", mlx4_ib_version);
 
 	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
 		num_ports++;
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 13da9f1..2e4e043 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -1215,15 +1215,11 @@ int __mthca_restart_one(struct pci_dev *pdev)
 static int __devinit mthca_init_one(struct pci_dev *pdev,
 				    const struct pci_device_id *id)
 {
-	static int mthca_version_printed = 0;
 	int ret;
 
 	mutex_lock(&mthca_device_mutex);
 
-	if (!mthca_version_printed) {
-		printk(KERN_INFO "%s", mthca_version);
-		++mthca_version_printed;
-	}
+	printk_once(KERN_INFO "%s", mthca_version);
 
 	if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) {
 		printk(KERN_ERR PFX "%s has invalid driver data %lx\n",
-- 
1.6.3.3


From nashwath at gmail.com  Mon Aug 10 09:11:22 2009
From: nashwath at gmail.com (Ashwath Narasimhan)
Date: Mon, 10 Aug 2009 12:11:22 -0400
Subject: [ofa-general] Manipulating Credits in Infiniband
Message-ID: <ed1288770908100911h46524f4ch34cc6582bb1c03b@mail.gmail.com>

Hi,
I looked into the infiniband driver files. As I understand, in order to
limit the data rate we manipulate the credits on either ends. Since the
number of credits available depends on the receiver's work receive queue
size, I decided to limit the queue size to say 5 instead of 8192
(reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher layer
protocol is ipoib). I just want to confirm if I am doing the right thing?

-- 
regards,
Ashwath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/0f54828a/attachment.html>

From mdidomenico4 at gmail.com  Mon Aug 10 09:12:45 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 10 Aug 2009 12:12:45 -0400
Subject: [ofa-general] sun x4100 with IB
In-Reply-To: <e75d22a90908080722y184fe037wa6c8d8fbf80470e7@mail.gmail.com>
References: <e75d22a90908071437t7b195d43h318f1cd06ac0ffb@mail.gmail.com>
	<1249684671.13945.68.camel@rockymtn.cumminsconsultants.com>
	<e75d22a90908080722y184fe037wa6c8d8fbf80470e7@mail.gmail.com>
Message-ID: <e75d22a90908100912t509bf645x3cf652144a056762@mail.gmail.com>

The cards are MT23108.

ibdiagnet -lw 4x -ls 2.5 shows no fabric errors also confirmed on my
silverstorm switch.

The rack contains X4100 and X4100 M2 servers.  The M2 servers do not
exhibit this behavior, they're showing 750MB/sec local loopback on an
IMB pingpong run.

Errata 56CLK is enabled in the bios and mostly all other PCI-X or HT
Bridge settings are set at they're defaults (ie auto)

I fiddled with a few of the settings, but they neither made it worse
or better.  So i'm really not sure what option got changed when i
reset the bios to optimal defaults, but i wish i did...


On Sat, Aug 8, 2009 at 10:22 AM, Michael Di
Domenico<mdidomenico4 at gmail.com> wrote:
> Yes, its an infinihost III, i believe its MT23208, but dont quote me
> on that, i'm not at the machine currently
>
> Is there something specific you want to see in lspci -vvv, i can't
> easily cut and paste from the machine
>
> On Fri, Aug 7, 2009 at 6:37 PM, Robert Cummins<robertacummins at gmail.com> wrote:
>> Can you send the output from lspci -vvv?   What card are you using?  Is
>> it an Infinihost III SDR card?  What does ibdiagnet -lw 4x -ls 5 return?
>>
>> On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote:
>>> I have several Sun x4100 with Infiniband servers which appear to be
>>> running at 400MB/sec instead of 800MB/sec.  It's a freshly reformatted
>>> cluster converting from solaris to linux.  We also reset the bios
>>> settings with "load optimal defaults". Does anyone know which bios
>>> setting I changed to dump the BW?
>>>
>>> x4100
>>> mellanox ib
>>> ofed-1.4.1-rc6 w/ openmpi
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>


From brian at sun.com  Mon Aug 10 09:28:25 2009
From: brian at sun.com (Brian J. Murrell)
Date: Mon, 10 Aug 2009 12:28:25 -0400
Subject: [ofa-general] ofed kernel  config.mk /  BACKPORT_INCLUDES
In-Reply-To: <4A803E28.7010504@sanger.ac.uk>
References: <4A802F03.2000507@sanger.ac.uk>
	<1249917452.7132.192.camel@pc.interlinx.bc.ca>
	<4A803E28.7010504@sanger.ac.uk>
Message-ID: <1249921705.7132.306.camel@pc.interlinx.bc.ca>

On Mon, 2009-08-10 at 16:35 +0100, Guy Coates wrote:
> Hi Brian;

Hi Guy,

> cat config.mk
> BACKPORT_INCLUDES=-I${CWD}/kernel_addons/backport/2.6.22/include/

I believe that is wrong and is a result of the first patch in your
previous e-mail.

Certainly in the 1.4.1 build I did here for all of my testing, I have:

$ cat config.mk 
BACKPORT_INCLUDES=-I/usr/src/ofa_kernel/kernel_addons/backport/2.6.18-EL5.3/include/

And of course, once the first issue is fixed, your second issue, with
the lustre configure script, will go away.

> That is obviously wrong.
> 
> 
> If I run the lustre configure, I get:
> 
> 
> ./configure --with-o2ib=/usr/src/modules/ofa-kernel
> --with-linux=/scratch/linux-2.6.22.19
> 
> <snip>
> checking whether to enable OpenIB gen2 support... no
> configure: error: can't compile with OpenIB gen2 headers under
> /usr/src/modules/ofa-kernel

Of course you do, because the config.mk is wrong.

> EXTRA_LNET_INCLUDE='-I-I/kernel_addons/backport/2.6.22/include/
> -I/usr/src/modules/ofa-kernel/include'

Right.  Because the sed failed to accomplish it's replacement and took
the value from the config.mk verbatim.  As I said, once the root issue,
with config.mk is fixed, the lustre configure issue will also resolve.

> If I fix config.mk so that the correct path is present:
> 
> 
> cat config.mk
> BACKPORT_INCLUDES=-I/usr/src/modules/kernel_addons/backport/2.6.22/include/
                               ^^^^^^^
That's because you are relocating the sources during your ofa_kernel
build to something other than the default.  The code in the lustre
configure is assuming the default location.  Arguably lustre's configure
should handle this.  Please file a bug.

> With the two patches previously sent, everything builds

For you.  It still does not accomplish the goals of the original design
of that code in the configure script.

But the lustre configure discussion really does not belong on this list.
After you have filed your bug you should summarize in a followup to
lustre-discuss (removing this list from your followup) given that it was
not included on the CC list of this message.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090810/532501f6/attachment.sig>

From sean.hefty at intel.com  Mon Aug 10 09:32:25 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 10 Aug 2009 09:32:25 -0700
Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE
	ports
In-Reply-To: <f0e08f230908100701t5dc3e49al9a2c4de4bedd0a00@mail.gmail.com>
References: <20090805082910.GE5599@mtls03>	
	<376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com>	
	<20090807032901.GB20589@mtls03>
	<f0e08f230908100701t5dc3e49al9a2c4de4bedd0a00@mail.gmail.com>
Message-ID: <1BE4788B64784417A2043CE63F6B09A3@amr.corp.intel.com>

>Might there be some GS service to expose ? Vendor MADs perhaps ? If not, then
>not exposing QP1 should be OK.

At some point, exposing QP1 may make sense.  I was thinking more along the lines
of limiting the user space interfaces until things can be standardized. 


From rdreier at cisco.com  Mon Aug 10 10:42:18 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 10:42:18 -0700
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <200908100936.26963.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 10 Aug 2009 09:36:26 +0300")
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
	<1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
	<adaocqo9sb4.fsf@cisco.com>
	<200908100936.26963.jackm@dev.mellanox.co.il>
Message-ID: <ada7hxba7lx.fsf@cisco.com>


 > I'm a bit nervous about this one.  
 > printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h
 > (i.e., when the kernel is configured).  Otherwise, it gets defined to printk --
 > and it will always print in this case.
 > (see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272).

Umm... if CONFIG_PRINTK is turned off nothing prints, right?

 > Do you think that distributions will ALWAYS have CONFIG_PRINTK defined?

Yes, I suspect they do want to get kernel messages.

 - R.


From sean.hefty at intel.com  Mon Aug 10 12:31:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 10 Aug 2009 12:31:19 -0700
Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding
In-Reply-To: <20090805082929.GG5599@mtls03>
References: <20090805082929.GG5599@mtls03>
Message-ID: <55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com>

>@@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private
>*id_priv,
> {
> 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
> 	int ret;
>+	u16 pkey;
>+
>+        if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)
>==

nit: It looks like the if is indented by spaces, instead of a tab.

>+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv)
>+{
>+	struct rdma_route *route = &id_priv->id.route;
>+	struct rdma_addr *addr = &route->addr;
>+	struct cma_work *work;
>+	int ret;
>+	struct sockaddr_in *src_addr = (struct sockaddr_in *)&route-
>>addr.src_addr;
>+	struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route-
>>addr.dst_addr;
>+
>+	if (src_addr->sin_family != dst_addr->sin_family)
>+		return -EINVAL;
>+
>+	work = kzalloc(sizeof *work, GFP_KERNEL);
>+	if (!work)
>+		return -ENOMEM;
>+
>+	work->id = id_priv;
>+	INIT_WORK(&work->work, cma_work_handler);
>+
>+	route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL);
>+	if (!route->path_rec) {
>+		ret = -ENOMEM;
>+		goto err;
>+	}
>+
>+	route->num_paths = 1;
>+
>+	rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr);
>+	rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr);
>+
>+	route->path_rec->hop_limit = 2;
>+	route->path_rec->reversible = 1;
>+	route->path_rec->pkey = cpu_to_be16(0xffff);
>+	route->path_rec->mtu_selector = 2;
>+	route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu);
>+	route->path_rec->rate_selector = 2;
>+	route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev);
>+	route->path_rec->packet_life_time_selector = 2;
>+	route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME;
>+
>+	work->old_state = CMA_ROUTE_QUERY;
>+	work->new_state = CMA_ROUTE_RESOLVED;
>+	if (!route->path_rec->mtu || !route->path_rec->rate) {
>+		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
>+		work->event.status = -1;

Any reason not to fail immediately here and leave the id state unchanged?

>+	} else {
>+		work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
>+		work->event.status = 0;
>+	}
>+
>+	queue_work(cma_wq, &work->work);
>+
>+	return 0;
>+
>+err:
>+	kfree(work);
>+	return ret;
>+}
>+
> int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
> {
> 	struct rdma_id_private *id_priv;
>@@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int
>timeout_ms)
> 	case RDMA_TRANSPORT_IWARP:
> 		ret = cma_resolve_iw_route(id_priv, timeout_ms);
> 		break;
>+	case RDMA_TRANSPORT_RDMAOE:
>+		ret = cma_resolve_rdmaoe_route(id_priv);
>+		break;
> 	default:
> 		ret = -ENOSYS;
> 		break;
>@@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct
>rdma_conn_param *conn_param)
>
> 	switch (rdma_port_get_transport(id->device, id->port_num)) {
> 	case RDMA_TRANSPORT_IB:
>+	case RDMA_TRANSPORT_RDMAOE:
> 		if (cma_is_ud_ps(id->ps))
> 			ret = cma_resolve_ib_udp(id_priv, conn_param);
> 		else
>@@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct
>rdma_conn_param *conn_param)
>
> 	switch (rdma_port_get_transport(id->device, id->port_num)) {
> 	case RDMA_TRANSPORT_IB:
>+	case RDMA_TRANSPORT_RDMAOE:
> 		if (cma_is_ud_ps(id->ps))
> 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
> 						conn_param->private_data,
>@@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void
>*private_data,
>
> 	switch (rdma_port_get_transport(id->device, id->port_num)) {
> 	case RDMA_TRANSPORT_IB:
>+	case RDMA_TRANSPORT_RDMAOE:
> 		if (cma_is_ud_ps(id->ps))
> 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
> 						private_data, private_data_len);
>@@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
>
> 	switch (rdma_port_get_transport(id->device, id->port_num)) {
> 	case RDMA_TRANSPORT_IB:
>+	case RDMA_TRANSPORT_RDMAOE:
> 		ret = cma_modify_qp_err(id_priv);
> 		if (ret)
> 			goto out;
>@@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct rdma_id_private
>*id_priv,
> 	return 0;
> }
>
>+
>+static void rdmaoe_mcast_work_handler(struct work_struct *work)
>+{
>+	struct rdmaoe_mcast_work *mw = container_of(work, struct
>rdmaoe_mcast_work, work);
>+	struct cma_multicast *mc = mw->mc;
>+	struct ib_sa_multicast *m = mc->multicast.ib;
>+
>+	mc->multicast.ib->context = mc;
>+	cma_ib_mc_handler(0, m);
>+	kfree(m);
>+	kfree(mw);
>+}
>+
>+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv,
>+				     struct cma_multicast *mc)
>+{
>+	struct rdmaoe_mcast_work *work;
>+	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
>+
>+	if (cma_zero_addr((struct sockaddr *)&mc->addr))
>+		return -EINVAL;
>+
>+	work = kzalloc(sizeof *work, GFP_KERNEL);
>+	if (!work)
>+		return -ENOMEM;
>+
>+	mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL);
>+	if (!mc->multicast.ib) {
>+		kfree(work);
>+		return -ENOMEM;
>+	}

nit: I'd prefer to goto a common cleanup area to make it easier to add changes
in the future. 

>+
>+	cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib-
>>rec.mgid);
>+	mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff);
>+	if (id_priv->id.ps == RDMA_PS_UDP)
>+		mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY);
>+	mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev);
>+	mc->multicast.ib->rec.hop_limit = 1;
>+	mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu);

Do we need to check the rate/mtu here, like in resolve route?  Or should we be
good since we could successfully resolve the route?  Actually, can we just read
the data from the path record that gets stored with the id?

>+	rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
>+	work->id = id_priv;
>+	work->mc = mc;
>+	INIT_WORK(&work->work, rdmaoe_mcast_work_handler);
>+
>+	queue_work(cma_wq, &work->work);
>+
>+	return 0;
>+}
>+
> int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
> 			void *context)
> {
>@@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
>sockaddr *addr,
> 	case RDMA_TRANSPORT_IB:
> 		ret = cma_join_ib_multicast(id_priv, mc);
> 		break;
>+	case RDMA_TRANSPORT_RDMAOE:
>+		ret = cma_rdmaoe_join_multicast(id_priv, mc);
>+		break;
> 	default:
> 		ret = -ENOSYS;
> 		break;
>@@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
>sockaddr *addr,
> 		spin_unlock_irq(&id_priv->lock);
> 		kfree(mc);
> 	}
>+
> 	return ret;
> }
> EXPORT_SYMBOL(rdma_join_multicast);
>@@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct
>sockaddr *addr)
> 				ib_detach_mcast(id->qp,
> 						&mc->multicast.ib->rec.mgid,
> 						mc->multicast.ib->rec.mlid);
>-			ib_sa_free_multicast(mc->multicast.ib);
>+			if (rdma_port_get_transport(id_priv->cma_dev->device,
>id_priv->id.port_num) ==
>+			    RDMA_TRANSPORT_IB)
>+				ib_sa_free_multicast(mc->multicast.ib);
> 			kref_put(&mc->mcref, release_mc);
> 			return;
> 		}
>diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
>index 24d9510..c7c9e92 100644
>--- a/drivers/infiniband/core/ucma.c
>+++ b/drivers/infiniband/core/ucma.c
>@@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file *file,
> }
>
> static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
>-			       struct rdma_route *route)
>+			       struct rdma_route *route,
>+			       enum rdma_transport_type tt)
> {
> 	struct rdma_dev_addr *dev_addr;
>
>@@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct
>rdma_ucm_query_route_resp *resp,
> 	switch (route->num_paths) {
> 	case 0:
> 		dev_addr = &route->addr.dev_addr;
>-		ib_addr_get_dgid(dev_addr,
>-				 (union ib_gid *) &resp->ib_route[0].dgid);
>-		ib_addr_get_sgid(dev_addr,
>-				 (union ib_gid *) &resp->ib_route[0].sgid);
>+		if (tt == RDMA_TRANSPORT_IB) {
>+			ib_addr_get_dgid(dev_addr,
>+					 (union ib_gid *)
&resp->ib_route[0].dgid);
>+			ib_addr_get_sgid(dev_addr,
>+					 (union ib_gid *)
&resp->ib_route[0].sgid);
>+		} else {
>+			rdmaoe_mac_to_ll((union ib_gid *)
&resp->ib_route[0].dgid,
>+					 dev_addr->dst_dev_addr);
>+			rdmaoe_addr_get_sgid(dev_addr,
>+					 (union ib_gid *)
&resp->ib_route[0].sgid);
>+		}
> 		resp->ib_route[0].pkey =
cpu_to_be16(ib_addr_get_pkey(dev_addr));
> 		break;
> 	case 2:
>@@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
> 	struct ucma_context *ctx;
> 	struct sockaddr *addr;
> 	int ret = 0;
>+	enum rdma_transport_type tt;
>
> 	if (out_len < sizeof(resp))
> 		return -ENOSPC;
>@@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file *file,
>
> 	resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
> 	resp.port_num = ctx->cm_id->port_num;
>-	switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id-
>>port_num)) {
>+	tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num);
>+	switch (tt) {
> 	case RDMA_TRANSPORT_IB:
>-		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
>+	case RDMA_TRANSPORT_RDMAOE:
>+		ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt);

It seems simpler to just add a new call ucma_copy_rdmaoe_route, rather than
merging those two transports into a single copy function that then branches
based on the transport.

> 		break;
> 	default:
> 		break;
>diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
>index 483057b..66a848e 100644
>--- a/include/rdma/ib_addr.h
>+++ b/include/rdma/ib_addr.h
>@@ -39,6 +39,8 @@
> #include <linux/netdevice.h>
> #include <linux/socket.h>
> #include <rdma/ib_verbs.h>
>+#include <linux/ethtool.h>
>+#include <rdma/ib_pack.h>
>
> struct rdma_addr_client {
> 	atomic_t refcount;
>@@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr
>*dev_addr,
> 	memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid);
> }
>
>+static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac)
>+{
>+	memset(gid->raw, 0, 16);
>+	*((u32 *)gid->raw) = cpu_to_be32(0xfe800000);
>+	gid->raw[12] = 0xfe;
>+	gid->raw[11] = 0xff;
>+	memcpy(gid->raw + 13, mac + 3, 3);
>+	memcpy(gid->raw + 8, mac, 3);
>+	gid->raw[8] ^= 2;
>+}
>+
>+static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr,
>+					union ib_gid *gid)
>+{
>+	rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr);
>+}
>+
>+static inline enum ib_mtu rdmaoe_get_mtu(int mtu)
>+{
>+	/*
>+	 * reduce IB headers from effective RDMAoE MTU. 28 stands for
>+	 * atomic header which is the biggest possible header after BTH
>+	 */
>+	mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28;
>+
>+	if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096))
>+		return IB_MTU_4096;
>+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048))
>+		return IB_MTU_2048;
>+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024))
>+		return IB_MTU_1024;
>+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512))
>+		return IB_MTU_512;
>+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256))
>+		return IB_MTU_256;
>+	else
>+		return 0;
>+}
>+
>+static inline int rdmaoe_get_rate(struct net_device *dev)
>+{
>+	struct ethtool_cmd cmd;
>+
>+	if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings ||
>+	    dev->ethtool_ops->get_settings(dev, &cmd))
>+		return IB_RATE_PORT_CURRENT;
>+
>+	if (cmd.speed >= 40000)
>+		return IB_RATE_40_GBPS;
>+	else if (cmd.speed >= 30000)
>+		return IB_RATE_30_GBPS;
>+	else if (cmd.speed >= 20000)
>+		return IB_RATE_20_GBPS;
>+	else if (cmd.speed >= 10000)
>+		return IB_RATE_10_GBPS;
>+	else
>+		return IB_RATE_PORT_CURRENT;
>+}
>+
>+static inline int rdma_link_local_addr(struct in6_addr *addr)
>+{
>+	if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) &&
>+	    addr->s6_addr32[1] == 0)
>+		return 1;
>+	else
>+		return 0;
>+}

just replace the 'if' with 'return'

>+
>+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
>+{
>+	memcpy(mac, &addr->s6_addr[8], 3);
>+	memcpy(mac + 3, &addr->s6_addr[13], 3);
>+	mac[0] ^= 2;
>+}
>+
>+static inline int rdma_is_multicast_addr(struct in6_addr *addr)
>+{
>+	return addr->s6_addr[0] == 0xff ? 1 : 0;
>+}
>+
>+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
>+{
>+	memset(mac, 0xff, 6);
>+}

I don't think we want all of these inline, in particular rdmaoe_mac_to_ll,
rdmaoe_get_mtu , rdmaoe_get_rate.


From rdreier at cisco.com  Mon Aug 10 13:30:44 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 13:30:44 -0700
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has
	not allocated
In-Reply-To: <20090810084527.GA2446@mtls03> (Eli Cohen's message of "Mon, 10
	Aug 2009 11:45:27 +0300")
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	<20090810084527.GA2446@mtls03>
Message-ID: <adaljlr8l8r.fsf@cisco.com>


 > Looking at mlx4_write_mtt_chunk() I see that it calls
 > mlx4_table_find() with a pointer to single dma_addr_t - dma_handle -
 > while the dma addresses for the ICM memory is actually a list of
 > different addresses covering possibly different sizes. I think
 > mlx4_table_find() should be changed to support that, and then we can
 > use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device()
 > with the correct dma addresses.

No, I think we're careful that we write MTT ranges that don't cross a
page so there shouldn't be any problem.


From rdreier at cisco.com  Mon Aug 10 13:32:32 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 13:32:32 -0700
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has
	not allocated
In-Reply-To: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	(Bart Van Assche's message of "Sat, 8 Aug 2009 19:49:22 +0200")
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
Message-ID: <adahbwf8l5r.fsf@cisco.com>


 > Has anyone ever encountered a message like the one below ? This message was
 > generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and
 > before any out-of-tree kernel modules were loaded.
 > 
 > ------------[ cut here ]------------
 > WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0()
 > Hardware name: P5Q DELUXE
 > mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it
 > has not allocated [device address=0x0000000139482000] [size=4096 bytes]
 > Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel
 > snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801
 > rtc_core hid_belkin mlx4_core(
 > +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev
 > serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx
 > xor raid0 sd_mod crc_t10dif
 > ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic
 > ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal
 > processor thermal_sys hwmon
 > Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6
 > Call Trace:
 >  [<ffffffff8039bc7c>] ? check_sync+0x47c/0x4b0
 >  [<ffffffff80248b48>] warn_slowpath_common+0x78/0xd0
 >  [<ffffffff80248bfc>] warn_slowpath_fmt+0x3c/0x40
 >  [<ffffffff80517769>] ? _spin_lock_irqsave+0x49/0x60
 >  [<ffffffff8039b8ab>] ? check_sync+0xab/0x4b0
 >  [<ffffffff8039bc7c>] check_sync+0x47c/0x4b0
 >  [<ffffffff802724ac>] ? mark_held_locks+0x6c/0x90
 >  [<ffffffff8039be1d>] debug_dma_sync_single_for_cpu+0x1d/0x20
 >  [<ffffffffa024a969>] mlx4_write_mtt+0x159/0x1e0 [mlx4_core]

I think the problem is that there really isn't any way truly supported
by the DMA API to do a partial sync on something we mapped with
"map_sg".  I guess we really should just give up on virtual mapping etc
and use dma_map_single to map the ICM memory; I doubt it has any
measurable downside, even on platforms where dma_sync_single is a NOP.

 - R.


From rdreier at cisco.com  Mon Aug 10 13:33:40 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 13:33:40 -0700
Subject: [ofa-general] IB kernel modules and the kobject release() method
In-Reply-To: <20090808034817.GA30697@suse.de> (Greg KH's message of "Fri, 7
	Aug 2009 20:48:17 -0700")
References: <e2e108260908060943u344bbe03k2baab01b204c9cca@mail.gmail.com>
	<adad478hmi8.fsf@cisco.com>
	<e2e108260908061146y47ae45f5j6b8085d44cd1c45b@mail.gmail.com>
	<adaocqsg339.fsf@cisco.com>
	<e2e108260908061229v2c605aabp7cf66cbe568d6755@mail.gmail.com>
	<adafxc4g1e7.fsf@cisco.com>
	<e2e108260908070026s10658adl2c4a9a5b3eba1a08@mail.gmail.com>
	<20090808034817.GA30697@suse.de>
Message-ID: <adad4738l3v.fsf@cisco.com>


 > No, it still makes sense :)

So what's the fix for this?  If even you have trouble understanding
kobject lifetimes and the requirement for a release function, is there
hope for anyone else?

 - R.


From rdreier at cisco.com  Mon Aug 10 13:48:28 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 13:48:28 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion
	dependency detected
In-Reply-To: <e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	(Bart Van Assche's message of "Fri, 7 Aug 2009 11:58:11 +0200")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
Message-ID: <ada8whr8kf7.fsf@cisco.com>


 > The lockdep report I obtained this morning with a 2.6.30.4 kernel and
 > the two patches applied has been attached to the kernel bugzilla
 > entry. This lockdep report was generated while testing the SRPT target
 > software. I have double checked that the SRPT target implementation
 > does not hold any spinlocks or mutexes while calling functions in the
 > IB core. This means that the SRPT target code cannot have caused any
 > of the reported lock cycles.

Lockdep is not quite so simple as what you checked, but yes, in this
case it does appear to be pointing a real (albeit spectacularly
unlikely) deadlock in the core IB stack:

  ib_cm takes cm_id_priv->lock and calls ib_post_send_mad()
  from there, ib_mad takes mad_agent_priv->lock

  in another context, ib_mad takes mad_agent_priv->lock and does
  cancel_delayed_work(&mad_agent_priv->timed_work) (and internally
  cancel_delayed_work() does del_timer_sync())

  finally, in another context a communication established event can
  occur and generate a callback (in interrupt context) to ib_cm where it
  takes cm_id_priv->lock

So there can be a chain that deadlocks: if the timer for the timed_work
is running on a CPU, and the interrupt for the communication established
event occurs while the timer is running, then that interrupt handler can
try to take cm_id_priv->lock.

However on another CPU, someone could already be holding
cm_id_priv->lock and call into ib_post_send_mad(), and spinning on
mad_agent_priv->lock, while on yet another CPU, someone could be holding
mad_agent_priv->lock and doing cancel_delayed_work().

And that will deadlock waiting in del_timer_sync() since the timer has
been interrupted by an interrupt handler that will spin on a spinlock
that is part of this chain.

I'm not sure what the right fix is.  It does seem to me that this should
be fixed within the ib_mad module, since doing del_timer_sync() within a
spinlocked region seems like the fundamental problem.  However I'm not
sure what the best way to rewrite the ib_mad usage is.

 > By the way, I noticed that while many subsystems in the Linux kernel
 > use event queues to report information to higher software layers, that
 > the IB core makes extensive use of callback functions. The combination
 > of nested locking and callback functions can easily lead to lock
 > inversion. This effect is well known in the operating system world --
 > see e.g. the talk by John Ousterhout about multithreaded versus
 > event-driven software (http://home.pacbell.net/ouster/threads.pdf,
 > 1996).

I'm not sure what you mean by this.  What would be an example of a
subsystem that uses event queues to report information?  I think the
design of the RDMA stack is quite parallel to most other Linux
subsystems, and we don't have anything as deadlock prone as, say, the
network stack's rtnl.

Trying to queue events up instead of calling back from interrupt context
is not all that simple, since one cannot reliably allocate memory, and
one must deal with synchonization with the consuming context etc.  It's
probably at least as deadlock-prone to try and queue as it is to just
call back.

Osterhout's talk certainly makes sense for a certain class of userspace
apps, but he explicitly says that event driven programming only uses one
CPU, and of course userspace doesn't have hard interrupt handlers or
anything like that.  So the kernel is more complex just because the
environment it runs under is a little trickier than what the kernel
provides for userspace.

 - R.


From sean.hefty at intel.com  Mon Aug 10 16:03:30 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 10 Aug 2009 16:03:30 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock
	inversion	dependency detected
In-Reply-To: <ada8whr8kf7.fsf@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>	<adavdm0weue.fsf@cisco.com>	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>	<adar5wow9r7.fsf@cisco.com>	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>	<adaws6bt8lf.fsf@cisco.com>	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>	<adatz0mi03d.fsf@cisco.com>	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>	<adaws5gg71x.fsf@cisco.com>	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
Message-ID: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>

>And that will deadlock waiting in del_timer_sync() since the timer has
>been interrupted by an interrupt handler that will spin on a spinlock
>that is part of this chain.
>
>I'm not sure what the right fix is.  It does seem to me that this should
>be fixed within the ib_mad module, since doing del_timer_sync() within a
>spinlocked region seems like the fundamental problem.  However I'm not
>sure what the best way to rewrite the ib_mad usage is.

If I followed this correctly, will moving calls to cancel_delayed_work() outside
of any spinlocks fix this?  (If so, it's not immediately obvious to me what the
best fix is either.)

- Sean


From rdreier at cisco.com  Mon Aug 10 18:59:03 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Aug 2009 18:59:03 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock
	inversion	dependency detected
In-Reply-To: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> (Sean
	Hefty's message of "Mon, 10 Aug 2009 16:03:30 -0700")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
Message-ID: <adazla76rh4.fsf@cisco.com>


 > >And that will deadlock waiting in del_timer_sync() since the timer has
 > >been interrupted by an interrupt handler that will spin on a spinlock
 > >that is part of this chain.
 > >
 > >I'm not sure what the right fix is.  It does seem to me that this should
 > >be fixed within the ib_mad module, since doing del_timer_sync() within a
 > >spinlocked region seems like the fundamental problem.  However I'm not
 > >sure what the best way to rewrite the ib_mad usage is.
 > 
 > If I followed this correctly, will moving calls to cancel_delayed_work() outside
 > of any spinlocks fix this?  (If so, it's not immediately obvious to me what the
 > best fix is either.)

Yes, I think that if cancel_delayed_work() and hence del_timer_sync() is
outside of any other locks then there is no deadlock -- you can think of
del_timer_sync() as being like a lock (which is how lockdep tracks it too).
But of course we can't really do that because that leaves the timeout
tracking unlocked and racy in the mad module.

The best idea I can come up with so far is to move to an explicit timer
in the mad module, so that we can do mod_timer() inside the lock rather
than having to do the equivalent of del_timer_sync() + add_timer()
(implicitly through the delayed work API).  But that unfortunately is
somewhat invasive surgery for the mad module... definitely doable but
ideally there would be an easier way.

I guess we could add a "requeue_delayed_work()" API to the kernel
workqueue stuff that does mod_timer() instead of adding it, but it might
be tricky to get the interface to that right.

 - R.


From jackm at dev.mellanox.co.il  Tue Aug 11 00:17:21 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 11 Aug 2009 10:17:21 +0300
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <ada7hxba7lx.fsf@cisco.com>
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
	<200908100936.26963.jackm@dev.mellanox.co.il>
	<ada7hxba7lx.fsf@cisco.com>
Message-ID: <200908111017.22153.jackm@dev.mellanox.co.il>

On Monday 10 August 2009 20:42, Roland Dreier wrote:
> 
>  > I'm a bit nervous about this one.  
>  > printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h
>  > (i.e., when the kernel is configured).  Otherwise, it gets defined to printk --
>  > and it will always print in this case.
>  > (see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272).
> 
> Umm... if CONFIG_PRINTK is turned off nothing prints, right?
Jiri Slaby pointed that out to me -- i.e., that printk itself is defined to do nothing but
return 0 if CONFIG_PRINTK is not defined. (I missed that when looking at file kernel.h).

I thought no answer was needed (sorry about that) -- Jiri was so obviously correct.

I've got no problem with the patch.

-Jack


From jackm at dev.mellanox.co.il  Tue Aug 11 00:21:01 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 11 Aug 2009 10:21:01 +0300
Subject: [ofa-general] [PATCH V2] mlx4: Do not allow ib userspace open while
	device is being removed
Message-ID: <200908111021.01612.jackm@dev.mellanox.co.il>

Userspace apps are supposed to release all ib device resources if
they receive a fatal async event (IBV_EVENT_DEVICE_FATAL).  However,
the app has no way of knowing when the device has come back up, except
to repeatedly attempt ibv_open_device() until it succeeds.

However, currently there is no protection against open succeeding when
the device is in the midst of the removal following the fatal event.
In this case, the open will succeed, but as a result the device waits
in the middle of its removal until the new app releases its ib resources
 -- and the new app will not do so, since the open succeeded at a point
following the fatal event generation.

This patch adds an "active" flag to the device. The active flag is set to
false (in the fatal event flow) before the "fatal" event is generated,
so any subsequent ibv_dev_open() call to the device will fail until the
device comes back up, thus preventing the above deadlock.

V2: move active flag from net to hw/mlx4, and use only for fatal event flow.
(per feedback from Roland).

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---
Roland,
this is a continuation of thread:
http://lists.openfabrics.org/pipermail/general/2009-July/060668.html

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..4effc19 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -342,6 +342,9 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev,
 	struct mlx4_ib_alloc_ucontext_resp resp;
 	int err;
 
+	if (!dev->ib_active)
+	    return ERR_PTR(-EAGAIN);
+
 	resp.qp_tab_size      = dev->dev->caps.num_qps;
 	resp.bf_reg_size      = dev->dev->caps.bf_reg_size;
 	resp.bf_regs_per_page = dev->dev->caps.bf_regs_per_page;
@@ -673,6 +676,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 			goto err_reg;
 	}
 
+	ibdev->ib_active = 1;
+
 	return ibdev;
 
 err_reg:
@@ -729,6 +734,7 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr,
 		break;
 
 	case MLX4_DEV_EVENT_CATASTROPHIC_ERROR:
+    		ibdev->ib_active = 0;
 		ibev.event = IB_EVENT_DEVICE_FATAL;
 		break;
 
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 8a7dd67..b22df97 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -175,6 +175,7 @@ struct mlx4_ib_dev {
 	spinlock_t		sm_lock;
 
 	struct mutex		cap_mask_mutex;
+	int			ib_active;
 };
 
 static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev)


From vlad at lists.openfabrics.org  Tue Aug 11 03:01:35 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 11 Aug 2009 03:01:35 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090811-0200 daily build status
Message-ID: <20090811100136.03E06E4026E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:300: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:311: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From ofedrnicuser at yahoo.com  Tue Aug 11 03:58:51 2009
From: ofedrnicuser at yahoo.com (Bill N)
Date: Tue, 11 Aug 2009 03:58:51 -0700 (PDT)
Subject: [ofa-general] which ofed-1.4 module ensure compliance with NSMR for
	having same size pbl entries?
Message-ID: <272817.39129.qm@web111203.mail.gq1.yahoo.com>

Hi,

I am trying to figure out which module (usr/kernel) ensures the compliance with RDMA iWarp verb specification 9.2.6.2 (Register Non-Shared MR)?

ibv_reg_mr() of libibverbs takes address and length.
So someone has to ensure that given address-length region(s) are exactly divided in to the small same size regions before it reaches the lowest layer iwarp/ib drivers. (I guess thats page size!)

After looking at ib_uverbs_reg_mr() below

if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK))
         return -EINVAL;

Looks like this is achieved by having page aligned start address.
As start address is page aligned all the incoming blocks of memory will be of same size (provided length is multiple of page size).

1. Is caller of the ibv_reg_mr() should ensure that it allocates page aligned memory?

2. Should caller of ibv_reg_mr() ensure that length always multiple of page size?

3. Assuming yes, to Q-2, which function/module in the kernel checks that length is multiple of page size?

Regards,
Bill


From rdreier at cisco.com  Tue Aug 11 09:23:11 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 11 Aug 2009 09:23:11 -0700
Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open
	while device is being removed
In-Reply-To: <200908111021.01612.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 11 Aug 2009 10:21:01 +0300")
References: <200908111021.01612.jackm@dev.mellanox.co.il>
Message-ID: <adaprb2721c.fsf@cisco.com>


 > this is a continuation of thread:
 > http://lists.openfabrics.org/pipermail/general/2009-July/060668.html

Thanks for the pointer... it lets me reload my context.  I see you
didn't answer the question about mthca -- does it suffer from this
problem as well?

 - R.


From rdreier at cisco.com  Tue Aug 11 09:40:19 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 11 Aug 2009 09:40:19 -0700
Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once
In-Reply-To: <4A8045BD.8010803@gmail.com> (Marcin Slusarz's message of "Mon,
	10 Aug 2009 18:07:25 +0200")
References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com>
	<1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com>
	<adaocqo9sb4.fsf@cisco.com> <4A8045BD.8010803@gmail.com>
Message-ID: <adaljlq718s.fsf@cisco.com>

thanks, applied.


From weiny2 at llnl.gov  Tue Aug 11 10:04:28 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 11 Aug 2009 10:04:28 -0700
Subject: [ofa-general] [PATCH] libibmad: clear packet buffer correctly before
 formating and sending
Message-ID: <20090811100428.c4fb6c5e.weiny2@llnl.gov>

I found this bug a while back but forgot to submit a patch.

I don't think this will affect the issues Mr. Miller was having with BM, as I believe the BM stuff he was trying all expected a response (thereby calling mad_rpc instead).  But it could be worth a try.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Tue, 11 Aug 2009 10:00:25 -0700
Subject: [PATCH] libibmad: clear packet buffer correctly before formating and sending


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 libibmad/src/serv.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c
index fad1e5b..4d557c2 100644
--- a/libibmad/src/serv.c
+++ b/libibmad/src/serv.c
@@ -59,7 +59,7 @@ int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
 	uint8_t pktbuf[1024];
 	void *umad = pktbuf;
 
-	memset(pktbuf, 0, umad_size());
+	memset(pktbuf, 0, umad_size() + IB_MAD_SIZE);
 
 	DEBUG("rmpp %p data %p", rmpp, data);
 
-- 
1.5.4.5


From hnrose at comcast.net  Tue Aug 11 11:40:11 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 11 Aug 2009 14:40:11 -0400
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: Some minor
	simplifications
Message-ID: <20090811184011.GA5666@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index 7826578..febd7f6 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -135,10 +135,8 @@ osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t *p_physp)
 {
 	osm_qos_port_t *p =
 	    (osm_qos_port_t *) calloc(1, sizeof(osm_qos_port_t));
-	if (!p)
-		return NULL;
-
-	p->p_physp = p_physp;
+	if (p)
+		p->p_physp = p_physp;
 	return p;
 }
 
@@ -149,11 +147,8 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create()
 {
 	osm_qos_port_group_t *p =
 	    (osm_qos_port_group_t *) calloc(1, sizeof(osm_qos_port_group_t));
-	if (!p)
-		return NULL;
-
-	cl_qmap_init(&p->port_map);
-
+	if (p)
+		cl_qmap_init(&p->port_map);
 	return p;
 }
 
@@ -192,14 +187,12 @@ osm_qos_vlarb_scope_t *osm_qos_policy_vlarb_scope_create()
 {
 	osm_qos_vlarb_scope_t *p =
 	    (osm_qos_vlarb_scope_t *) calloc(1, sizeof(osm_qos_vlarb_scope_t));
-	if (!p)
-		return NULL;
-
-	cl_list_init(&p->group_list, 10);
-	cl_list_init(&p->across_list, 10);
-	cl_list_init(&p->vlarb_high_list, 10);
-	cl_list_init(&p->vlarb_low_list, 10);
-
+	if (p) {
+		cl_list_init(&p->group_list, 10);
+		cl_list_init(&p->across_list, 10);
+		cl_list_init(&p->vlarb_high_list, 10);
+		cl_list_init(&p->vlarb_low_list, 10);
+	}
 	return p;
 }
 
@@ -236,13 +229,11 @@ osm_qos_sl2vl_scope_t *osm_qos_policy_sl2vl_scope_create()
 {
 	osm_qos_sl2vl_scope_t *p =
 	    (osm_qos_sl2vl_scope_t *) calloc(1, sizeof(osm_qos_sl2vl_scope_t));
-	if (!p)
-		return NULL;
-
-	cl_list_init(&p->group_list, 10);
-	cl_list_init(&p->across_from_list, 10);
-	cl_list_init(&p->across_to_list, 10);
-
+	if (p) {
+		cl_list_init(&p->group_list, 10);
+		cl_list_init(&p->across_from_list, 10);
+		cl_list_init(&p->across_to_list, 10);
+	}
 	return p;
 }
 
@@ -276,8 +267,6 @@ osm_qos_level_t *osm_qos_policy_qos_level_create()
 {
 	osm_qos_level_t *p =
 	    (osm_qos_level_t *) calloc(1, sizeof(osm_qos_level_t));
-	if (!p)
-		return NULL;
 	return p;
 }
 
@@ -355,14 +344,12 @@ osm_qos_match_rule_t *osm_qos_policy_match_rule_create()
 {
 	osm_qos_match_rule_t *p =
 	    (osm_qos_match_rule_t *) calloc(1, sizeof(osm_qos_match_rule_t));
-	if (!p)
-		return NULL;
-
-	cl_list_init(&p->source_list, 10);
-	cl_list_init(&p->source_group_list, 10);
-	cl_list_init(&p->destination_list, 10);
-	cl_list_init(&p->destination_group_list, 10);
-
+	if (p) {
+		cl_list_init(&p->source_list, 10);
+		cl_list_init(&p->source_group_list, 10);
+		cl_list_init(&p->destination_list, 10);
+		cl_list_init(&p->destination_group_list, 10);
+	}
 	return p;
 }
 

From bart.vanassche at gmail.com  Tue Aug 11 13:29:42 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Tue, 11 Aug 2009 22:29:42 +0200
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion 
	dependency detected
In-Reply-To: <ada8whr8kf7.fsf@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
Message-ID: <e2e108260908111329t3528eacdw9f61b6a67451fcf5@mail.gmail.com>

On Mon, Aug 10, 2009 at 10:48 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > The lockdep report I obtained this morning with a 2.6.30.4 kernel and
>  > the two patches applied has been attached to the kernel bugzilla
>  > entry. This lockdep report was generated while testing the SRPT target
>  > software. I have double checked that the SRPT target implementation
>  > does not hold any spinlocks or mutexes while calling functions in the
>  > IB core. This means that the SRPT target code cannot have caused any
>  > of the reported lock cycles.
>
> Lockdep is not quite so simple as what you checked, but yes, in this
> case it does appear to be pointing a real (albeit spectacularly
> unlikely) deadlock in the core IB stack:
>
>  ib_cm takes cm_id_priv->lock and calls ib_post_send_mad()
>  from there, ib_mad takes mad_agent_priv->lock
>
>  in another context, ib_mad takes mad_agent_priv->lock and does
>  cancel_delayed_work(&mad_agent_priv->timed_work) (and internally
>  cancel_delayed_work() does del_timer_sync())
>
>  finally, in another context a communication established event can
>  occur and generate a callback (in interrupt context) to ib_cm where it
>  takes cm_id_priv->lock
>
> So there can be a chain that deadlocks: if the timer for the timed_work
> is running on a CPU, and the interrupt for the communication established
> event occurs while the timer is running, then that interrupt handler can
> try to take cm_id_priv->lock.
>
> However on another CPU, someone could already be holding
> cm_id_priv->lock and call into ib_post_send_mad(), and spinning on
> mad_agent_priv->lock, while on yet another CPU, someone could be holding
> mad_agent_priv->lock and doing cancel_delayed_work().
>
> And that will deadlock waiting in del_timer_sync() since the timer has
> been interrupted by an interrupt handler that will spin on a spinlock
> that is part of this chain.
>
> I'm not sure what the right fix is.  It does seem to me that this should
> be fixed within the ib_mad module, since doing del_timer_sync() within a
> spinlocked region seems like the fundamental problem.  However I'm not
> sure what the best way to rewrite the ib_mad usage is.

It's already good news that the potential lock cycle has been deduced
from the lockdep reports. I know that it can take a lot of work to
analyze such reports.

Even if it is really unlikely that this lock cycle would cause a
deadlock, it would be great if this lock cycle could be removed. I'm
not the only developer of kernel modules who runs tests with lockdep
enabled, and it is unpractical to analyze long logfiles full of known
lock cycles to find a single lock cycle caused by newly added or
recently modified code.

>  > By the way, I noticed that while many subsystems in the Linux kernel
>  > use event queues to report information to higher software layers, that
>  > the IB core makes extensive use of callback functions. The combination
>  > of nested locking and callback functions can easily lead to lock
>  > inversion. This effect is well known in the operating system world --
>  > see e.g. the talk by John Ousterhout about multithreaded versus
>  > event-driven software (http://home.pacbell.net/ouster/threads.pdf,
>  > 1996).
>
> I'm not sure what you mean by this.  What would be an example of a
> subsystem that uses event queues to report information?  I think the
> design of the RDMA stack is quite parallel to most other Linux
> subsystems, and we don't have anything as deadlock prone as, say, the
> network stack's rtnl.

What I had in mind as an example is the netlink socket mechanism,
although this is a mechanism for sending notifications from the kernel
to userspace.

> Trying to queue events up instead of calling back from interrupt context
> is not all that simple, since one cannot reliably allocate memory, and
> one must deal with synchonization with the consuming context etc.  It's
> probably at least as deadlock-prone to try and queue as it is to just
> call back.

One possible approach when having to queue events from interrupt
context is to queue these events in a fixed size queue that has been
allocated outside interrupt context, and make it possible for the
event consumer to detect the queue overflow condition. When a queue
overflow happens it is the responsibility of the event consumer to
query the state of the event producer. This is a more complex approach
than callback functions but has the advantage that there never can be
a lock cycle involving locks of both the event producer and the event
consumer.

I'm not inventing anything new here -- this is exactly how netlink sockets work.

Bart.


From halves at linux.vnet.ibm.com  Tue Aug 11 13:37:54 2009
From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves)
Date: Tue, 11 Aug 2009 17:37:54 -0300
Subject: [ofa-general] Chelsio cards
Message-ID: <1250023074.16631.1.camel@halves-ltc>

Hi guys, 

I have a doubt, which ofed features are supported by Chelsio cards?

Thanks, 

-- 
Higor Aparecido Vieira Alves
Software Engineer
Linux Technology Center 
IBM Systems & Technology Group


From swise at opengridcomputing.com  Tue Aug 11 14:15:42 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 11 Aug 2009 16:15:42 -0500
Subject: [ofa-general] Chelsio cards
In-Reply-To: <1250023074.16631.1.camel@halves-ltc>
References: <1250023074.16631.1.camel@halves-ltc>
Message-ID: <4A81DF7E.6020907@opengridcomputing.com>

Higor Aparecido Vieira Alves wrote:
> Hi guys, 
>
> I have a doubt, which ofed features are supported by Chelsio cards?
>
> Thanks, 
>
>   
The following are supported via rdma on chelsio cards with ofed-1.4.x 
and 1.5:

User mode:
openmpi
mvapich2
udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP 
MPI, Scali MPI)
rdmacm (required for connection setup)
ibverbs (RC QP only)
perftest (ib_rdma_bw and ib_rdma_lat only)
qperf
rping

Kernel mode:
core verbs (RC QP only)
rdmacm (required for connection setup)
nfsrdma


Steve.


From sashak at voltaire.com  Tue Aug 11 14:22:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 00:22:21 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status
 from do_ucast_file_load when file name is not provided
In-Reply-To: <20090806181928.GA21698@comcast.net>
References: <20090806181928.GA21698@comcast.net>
Message-ID: <20090811212221.GG25501@me>

Hi Hal,

On 14:19 Thu 06 Aug     , Hal Rosenstock wrote:
> @@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context)
>  		OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE,
>  			"LFTs file name is not given; "
>  			"using default routing algorithm\n");
> -		return 1;
> +		return -1;

This "fix" is not correct. Routing engine method returns "> 0" value
when fallback to default is requested. In particular in case of 'file'
engine it is legal to provide only LFTs file and not provide LID matrix
file.

Sasha


From hal.rosenstock at gmail.com  Tue Aug 11 14:33:20 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 11 Aug 2009 17:33:20 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090811212221.GG25501@me>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
Message-ID: <f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>

Hi Sasha,

On Tue, Aug 11, 2009 at 5:22 PM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi Hal,
>
> On 14:19 Thu 06 Aug     , Hal Rosenstock wrote:
> > @@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context)
> >               OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE,
> >                       "LFTs file name is not given; "
> >                       "using default routing algorithm\n");
> > -             return 1;
> > +             return -1;
>
> This "fix" is not correct. Routing engine method returns "> 0" value
> when fallback to default is requested. In particular in case of 'file'
> engine it is legal to provide only LFTs file and not provide LID matrix
> file.


Is it supposed to use file when no files (LFT or LID matrix) are supplied ?
That's what seems to happen (with no fallback).

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090811/b312b1fa/attachment.html>

From worleys at gmail.com  Tue Aug 11 14:52:13 2009
From: worleys at gmail.com (Chris Worley)
Date: Tue, 11 Aug 2009 15:52:13 -0600
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and 
	eventually hangs
In-Reply-To: <e2e108260908100340p71efed9u72cf996be0843edd@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
	<e2e108260908100340p71efed9u72cf996be0843edd@mail.gmail.com>
Message-ID: <f3177b9e0908111452k3d531657tcff3d2cfee030196@mail.gmail.com>

On Mon, Aug 10, 2009 at 4:40 AM, Bart Van
Assche<bart.vanassche at gmail.com> wrote:
> On Sun, Aug 9, 2009 at 7:09 PM, Chris Worley <worleys at gmail.com> wrote:
>>
>> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off
>> the CD.. never updated) and it's embedded IB stack (not the latest
>> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info").
>>
>> I'm running a W2008S (fully patched) initiator w/
>> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.
>>
>> Using Mellanox QDR cards/switch.
>>
>> Writes over SRP, as measured from the initiator using IOMeter, get
>> proper performance (i.e. 1.2GB/s).
>>
>> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s).
>> And while reading, IOMeter eventually hangs the system (Windows
>> becomes unresponsive to GUI interaction).  In this state, I see iostat
>> reporting transfers at the same low read rate from the target... so
>> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it
>> acts like it's a "skipping record" (sorry of you young folks don't
>> know what that is... but I can't think of another way to describe it)
>> and never moving on to the next benchmark, just endlessly repeating
>> the same I/O over and over again.  If I unload then reload the mlx4_ib
>> driver on the target, then the Windows system quickly returns, but
>> IOMeter remains hung and needs killed.
>
> The throughput of the SRP protocol strongly depends on the block size used
> for I/O. The results I obtained with IOmeter are:
> * For a block size of 32 KB: 396 MB/s for reading and 321 MB/s for writing.
> * For a block size of 1 MB: 1383 MB/s for reading and 1151 MB/s for writing.
> These results are about 90% of the throughput obtained with dd.
>
> Setup details:
> * Two Mellanox ConnectX DDR cards connected back to back, operating in PCIe
> 2.0 mode.
> * Target: vanilla 2.6.30.4 kernel + SCST patches + the two patches attached
> to http://bugzilla.kernel.org/show_bug.cgi?id=13757 + SCST r1030.
> * Initiator: openSUSE 11.0 (contains a patched 2.6.27.25 kernel) with
> openSUSE 11.0 OFED components + Linux version of IOmeter's dynamo + IOmeter
> GUI running in a virtual machine.
> * I/O-scheduler used by SRP initiator: noop.

Thanks for the recommendations (to both Bart and Joe).

I setup my target exactly as you prescribe... but my initiator is
still Windows (version of WInOF at top): performance as relayed by
IOMeter starts high and the average slowly decreases.  Watching the
instantaneous throughput, there seem to be longer and longer lags of
poor performance. between moments of good performance.  I need to run
this against a Linux initiator to see if the problems are w/ WinOF.

Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the
performance was steady and getting close to acceptable.  In a 15 hour
test that cycles through sequential and random LBA's and R/W mixes
from block sizes from 1MB to 512B, it worked well and got decent
performance until it hit 1KB sequential reads which hung IOMeter; no
messages on the Linux side (all looked okay).  IBSRP on the Windows
side just said "a reset to device was issued" every 15 to 30 seconds
after the problem started. I reloaded the IB stack on the Linux side,
and was able to get it restarted.

Still a lot of combinations to test.

Thanks,

Chris
>
> Bart.
>


From chenyon1 at iit.edu  Fri Aug  7 12:55:38 2009
From: chenyon1 at iit.edu (Yong Chen)
Date: Fri, 07 Aug 2009 19:55:38 +0000 (GMT)
Subject: [ofa-general] [hpc-announce] Call for Attendance: P2S2-2009 Workshop
Message-ID: <fb1fdafe5f54.4a7c86ba@iit.edu>

Dear Colleagues,

The Second International Workshop on Parallel Programming Models and
Systems Software for High-End Computing (P2S2) will be held in Vienna,
Austria, on Sept. 22nd, 2009 in conjunction with The 38th
International Conference on Parallel Processing (ICPP-2009). The
workshop program has been finalized and can be found here:
http://www.mcs.anl.gov/events/workshops/p2s2/pro.html (listed below
for your reference).

We welcome you attend the P2S2-2009 workshop and look forward to
seeing you in Vienna, Austria!

===============================================================================

Session 1: Opening 
Time: 09:00 - 10:30, Location: Room F3 (89), Chair: Pavan Balaji, Argonne
National Laboratory

Opening Remarks (D. K. Panda, Pavan Balaji and Abhinav Vishnu)

Invited Keynote by Dr. Pete Beckman, Argonne National Laboratory, "Challenges
for System Software on Exascale Platforms"

10:30 - 11:00 Coffee Break

Session 2: Software for Large-scale Systems
Time: 11:00 - 12:30, Location: Room F3 (89), Chair: Tom Peterka, Argonne
National Laboratory
	1. "Characterizing the Performance of Big Memory on Blue Gene Linux"
	Kazutomo Yoshii, Kamil Iskra, P. Chris Broekema, Harish Naik and Pete Beckman

	2. "Optimization of Preconditioned Parallel Iterative Solvers for
Finite-Element Applications using Hybrid 
	Parallel Programming Models on T2K Open Supercomputer (Todai Combined Cluster)"
	Kengo Nakajima

	3. "Analyzing Checkpointing Trends for Applications on Peta-scale Systems"
	Harish Naik, Rinku Gupta and Pete Beckman

12:30 - 14:00 Lunch

Session 3: Communication and I/O
Time: 14:00 - 15:30, Location: Room F3 (89), Chair: Abhinav Vishnu, Pacific
Northwest National Laboratory
	1. "Designing and Evaluating MPI-2 Dynamic Process Management Support for
InfiniBand"
	Tejus Gangadharappa, Matthew Koop and Dhabaleswar K Panda

	2. "CkDirect: Unsynchronized One-Sided Communication in a Message-Driven Paradigm"
	Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele and Laxmikant Kale

	3. "Exploiting Latent I/O Asynchrony in Petascale Science Applications"
	Patrick Widener, Matthew Wolf, Hasan Abbasi, Scott McManus, Mary Payne, Patrick
Bridges and Karsten Schwan

	4. "Gears4Net - An Asynchronous Programming Model"
	Martin Saternus, Torben Weis, Sebastian Holzapfel and Arno Wacker

15:30 - 16:00 Coffee Break

Session 4: Software for Multicore Architectures
Time: 16:00 - 17:30, Location: Room F3 (89), Chair: Ron Brightwell, Sandia
National Laboratory
	1. "Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom
Method on Multi-core Platforms"
	Changjun Hu, Yali Liu and Jianjiang Li

	2. "Open Source Software Support for the OpenMP Runtime API for Profiling"
	Oscar Hernandez, Van Bui, Richard Kufrin and Barbara Chapman

	3. "Just-In-Time Renaming and Lazy Write-Back on the Cell/B.E."
	Pieter Bellens, Rosa Badia and Jesus Labarta


From niftyompi at niftyegg.com  Tue Aug 11 19:37:59 2009
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Tue, 11 Aug 2009 19:37:59 -0700
Subject: [ofa-general] Manipulating Credits in Infiniband
In-Reply-To: <ed1288770908100911h46524f4ch34cc6582bb1c03b@mail.gmail.com>
References: <ed1288770908100911h46524f4ch34cc6582bb1c03b@mail.gmail.com>
Message-ID: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net>

On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote:
> 
>    I looked into the infiniband driver files. As I understand, in order to
>    limit the data rate we manipulate the credits on either ends. Since the
>    number of credits available depends on the receiver's work receive
>    queue size, I decided to limit the queue size to say 5 instead of 8192
>    (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher
>    layer protocol is ipoib). I just want to confirm if I am doing the
>    right thing?

Data rate is not manipulated by credits.
Credits and queue sizes are different and have different purposes.

Visit the Infiniband Trade Association web site and grab the IB
specifications to understand some of the hardware level parts.

	http://www.infinibandta.org/

InfiniBand offers credit based flow control and given the nature of
modern IB switches and processors a very small credit count can still
result in full data rate.    Having said that flow control is the lowest
level throttle in the system.   Reducing the credit count forces the
higher levels in the protocol stack to source or sink the data through
the hardware before any more can be delivered.   Thus flow control can
simplify the implementation of higher level protocols.   It can also be used
to cost reduce or simplify hardware design (smaller hardware buffers).

The IB specifications are way too long.  Start with this FAQ.

       http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf

The IB specification is way too full of optional features.  A vendor may
have XYZ working fine and dandy on one card and since it is optional not
at all on another.

The various queue sizes for the various protocols built on top of
IB establish transfer behavior in keeping with system interrupt,
system process time slice, system kernel activity loads and needs.
It is counter intuitive but in some cases small queues result in
more responsive and agile systems, especially in the presence of errors.

Since there are often multiple protocols on the IB stack all protocols
will be impacted by credit tinkering.  Most vendors know their hardware
so most drivers will have credit related code optimum.

In the case of TCP/IP the interaction between IB bandwidths&MTU (IPoIB),
ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU can 
be "interesting" depending on host names, subnets, routing etc.   TCP/IP
has lots of tuning flags well above the IB driver.   I see 500+ net.*
sysctl knobs on this system.

As you change things do make the changes on all the moving parts, benchmark
and keep a log.   Since there are multiple IB hardware vendors 
it is important to track hardware specifics.  "lspci" is a good tool
to gather chip info.   With some cards you also need specifics about 
the active firmware.

So go forth (RPN forever) and conquer.


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From rdreier at cisco.com  Tue Aug 11 20:34:04 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 11 Aug 2009 20:34:04 -0700
Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion
	dependency detected
In-Reply-To: <e2e108260908111329t3528eacdw9f61b6a67451fcf5@mail.gmail.com>
	(Bart Van Assche's message of "Tue, 11 Aug 2009 22:29:42 +0200")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<e2e108260908111329t3528eacdw9f61b6a67451fcf5@mail.gmail.com>
Message-ID: <ada8whp7ljn.fsf@cisco.com>


 > Even if it is really unlikely that this lock cycle would cause a
 > deadlock, it would be great if this lock cycle could be removed. I'm
 > not the only developer of kernel modules who runs tests with lockdep
 > enabled, and it is unpractical to analyze long logfiles full of known
 > lock cycles to find a single lock cycle caused by newly added or
 > recently modified code.

I agree that we should fix all lockdep issues -- the impact of them is
even worse than you realize, because once a single cycle is detected,
lockdep must turn itself off until a reboot, because of course you can't
detect new cycles in a graph that already has a cycle.

 > One possible approach when having to queue events from interrupt
 > context is to queue these events in a fixed size queue that has been
 > allocated outside interrupt context, and make it possible for the
 > event consumer to detect the queue overflow condition. When a queue
 > overflow happens it is the responsibility of the event consumer to
 > query the state of the event producer. This is a more complex approach
 > than callback functions but has the advantage that there never can be
 > a lock cycle involving locks of both the event producer and the event
 > consumer.

I think in most cases dealing with queue overflow is going to lead to
way more bugs than callbacks in interrupt context.  Of course, when
passing events on to userspace, we don't have the luxury of being able
to call userspace in interrupt context, so we have to look for the next
best thing.  But within the kernel we can take the simpler more robust
approach.

 - R.


From alekseys at voltaire.com  Tue Aug 11 23:13:35 2009
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 12 Aug 2009 09:13:35 +0300
Subject: [ofa-general] Chelsio cards
In-Reply-To: <4A81DF7E.6020907@opengridcomputing.com>
References: <1250023074.16631.1.camel@halves-ltc>
	<4A81DF7E.6020907@opengridcomputing.com>
Message-ID: <4A825D8F.8030102@voltaire.com>

> 
> User mode:
> openmpi
> mvapich2
> udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP 
> MPI, Scali MPI)
> rdmacm (required for connection setup)
> ibverbs (RC QP only)
> perftest (ib_rdma_bw and ib_rdma_lat only)
> qperf
> rping
Another two binaries in user space
ib_rdma_lat
ib_rdma_bw


From bart.vanassche at gmail.com  Tue Aug 11 23:15:03 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Wed, 12 Aug 2009 08:15:03 +0200
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and 
	eventually hangs
In-Reply-To: <f3177b9e0908111452k3d531657tcff3d2cfee030196@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
	<e2e108260908100340p71efed9u72cf996be0843edd@mail.gmail.com>
	<f3177b9e0908111452k3d531657tcff3d2cfee030196@mail.gmail.com>
Message-ID: <e2e108260908112315y3e902d7ay775d65c16d8e561e@mail.gmail.com>

On Tue, Aug 11, 2009 at 11:52 PM, Chris Worley<worleys at gmail.com> wrote:
> I setup my target exactly as you prescribe... but my initiator is
> still Windows (version of WInOF at top): performance as relayed by
> IOMeter starts high and the average slowly decreases.  Watching the
> instantaneous throughput, there seem to be longer and longer lags of
> poor performance. between moments of good performance.  I need to run
> this against a Linux initiator to see if the problems are w/ WinOF.
>
> Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the
> performance was steady and getting close to acceptable.  In a 15 hour
> test that cycles through sequential and random LBA's and R/W mixes
> from block sizes from 1MB to 512B, it worked well and got decent
> performance until it hit 1KB sequential reads which hung IOMeter; no
> messages on the Linux side (all looked okay).  IBSRP on the Windows
> side just said "a reset to device was issued" every 15 to 30 seconds
> after the problem started. I reloaded the IB stack on the Linux side,
> and was able to get it restarted.
>
> Still a lot of combinations to test.

Which trace settings are you using on the target ? Enabling the proper
trace settings via /proc/scsi_tgt/trace_level might reveal whether you
are e.g. hitting the QUEUE_FULL condition. See also scst/README.

Bart.


From eli at dev.mellanox.co.il  Tue Aug 11 23:17:49 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 12 Aug 2009 09:17:49 +0300
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it
	has not allocated
In-Reply-To: <adaljlr8l8r.fsf@cisco.com>
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	<20090810084527.GA2446@mtls03> <adaljlr8l8r.fsf@cisco.com>
Message-ID: <20090812061749.GA20719@mtls03>

On Mon, Aug 10, 2009 at 01:30:44PM -0700, Roland Dreier wrote:
> 
>  > Looking at mlx4_write_mtt_chunk() I see that it calls
>  > mlx4_table_find() with a pointer to single dma_addr_t - dma_handle -
>  > while the dma addresses for the ICM memory is actually a list of
>  > different addresses covering possibly different sizes. I think
>  > mlx4_table_find() should be changed to support that, and then we can
>  > use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device()
>  > with the correct dma addresses.
> 
> No, I think we're careful that we write MTT ranges that don't cross a
> page so there shouldn't be any problem.

But a contiguous ICM memory does not map, in general, to a contiguous
DMA memory, so if dma_sync_single_for_*() does not harm anything
than it does not do anything useful either.


From rdreier at cisco.com  Tue Aug 11 23:23:48 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 11 Aug 2009 23:23:48 -0700
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has
	not allocated
In-Reply-To: <20090812061749.GA20719@mtls03> (Eli Cohen's message of "Wed, 12
	Aug 2009 09:17:49 +0300")
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	<20090810084527.GA2446@mtls03> <adaljlr8l8r.fsf@cisco.com>
	<20090812061749.GA20719@mtls03>
Message-ID: <ada4osd7dor.fsf@cisco.com>


 > But a contiguous ICM memory does not map, in general, to a contiguous
 > DMA memory, so if dma_sync_single_for_*() does not harm anything
 > than it does not do anything useful either.

Maybe I'm missing your point, but mlx4_table_find() does go to some
trouble to find the right DMA address for the object being looked up.
Of course it could be buggy but I still don't see why we would need a
list of DMA addresses when we know we are only going to sync part of one
page?

I think the right thing to do to fix this is to switch to using
map_single instead of map_sg, and then use dma_sync_single_range_for_xxx
to sync the subset we care about.

 - R.


From eli at dev.mellanox.co.il  Wed Aug 12 00:05:58 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 12 Aug 2009 10:05:58 +0300
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it
	has not allocated
In-Reply-To: <ada4osd7dor.fsf@cisco.com>
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	<20090810084527.GA2446@mtls03> <adaljlr8l8r.fsf@cisco.com>
	<20090812061749.GA20719@mtls03> <ada4osd7dor.fsf@cisco.com>
Message-ID: <20090812070558.GA23123@mtls03>

On Tue, Aug 11, 2009 at 11:23:48PM -0700, Roland Dreier wrote:
> 
> Maybe I'm missing your point, but mlx4_table_find() does go to some
> trouble to find the right DMA address for the object being looked up.
> Of course it could be buggy but I still don't see why we would need a
> list of DMA addresses when we know we are only going to sync part of one
> page?

Is this is always true? What if you allocated an ICM buffer that uses
none adjacent pages? In this case you would need more than one call to
dma_sync_single_for_*(), isn't it?

> 
> I think the right thing to do to fix this is to switch to using
> map_single instead of map_sg, and then use dma_sync_single_range_for_xxx
> to sync the subset we care about.
> 
>  - R.


From eli at dev.mellanox.co.il  Wed Aug 12 01:20:45 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 12 Aug 2009 11:20:45 +0300
Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding
In-Reply-To: <55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com>
References: <20090805082929.GG5599@mtls03>
	<55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com>
Message-ID: <20090812082045.GB23123@mtls03>

On Mon, Aug 10, 2009 at 12:31:19PM -0700, Sean Hefty wrote:
> >@@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private
> >*id_priv,
> > {
> > 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
> > 	int ret;
> >+	u16 pkey;
> >+
> >+        if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)
> >==
> 
> nit: It looks like the if is indented by spaces, instead of a tab.

Will fix, thanks.

> 
> >+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv)
> >+{
> >+	work->old_state = CMA_ROUTE_QUERY;
> >+	work->new_state = CMA_ROUTE_RESOLVED;
> >+	if (!route->path_rec->mtu || !route->path_rec->rate) {
> >+		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
> >+		work->event.status = -1;
> 
> Any reason not to fail immediately here and leave the id state unchanged?

No real reason. Immediate failure is just as good as a deffered one. I
will change that and also remove change the rule to ommit
"!route->path_rec->rate".

> 
> >+	} else {
> >+		work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
> >+		work->event.status = 0;
> >+	}
> >+
> >+	queue_work(cma_wq, &work->work);
> >+
> >+	kfree(mw);
> >+}
> >+
> >+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv,
> >+				     struct cma_multicast *mc)
> >+{
> >+	struct rdmaoe_mcast_work *work;
> >+	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
> >+
> >+	if (cma_zero_addr((struct sockaddr *)&mc->addr))
> >+		return -EINVAL;
> >+
> >+	work = kzalloc(sizeof *work, GFP_KERNEL);
> >+	if (!work)
> >+		return -ENOMEM;
> >+
> >+	mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL);
> >+	if (!mc->multicast.ib) {
> >+		kfree(work);
> >+		return -ENOMEM;
> >+	}
> 
> nit: I'd prefer to goto a common cleanup area to make it easier to add changes
> in the future. 

Will change that.
> 
> >+
> >+	cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib-
> >>rec.mgid);
> >+	mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff);
> >+	if (id_priv->id.ps == RDMA_PS_UDP)
> >+		mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY);
> >+	mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev);
> >+	mc->multicast.ib->rec.hop_limit = 1;
> >+	mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu);
> 
> Do we need to check the rate/mtu here, like in resolve route?  Or should we be
> good since we could successfully resolve the route?  Actually, can we just read
> the data from the path record that gets stored with the id?
> 
I believe that querying the mtu again to get an up to date vlaue will
be better, plus adding a check for the mtu to be none zero or
returning immediately with -EINVAL

> >+	rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
> >+	work->id = id_priv;
> >+	work->mc = mc;
> >+	INIT_WORK(&work->work, rdmaoe_mcast_work_handler);
> >+
> >+	queue_work(cma_wq, &work->work);
> >+
> >+	return 0;
> >+}
> >+
> > int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
> > 			void *context)
> > {
> >@@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
> >sockaddr *addr,
> > 	case RDMA_TRANSPORT_IB:
> > 		ret = cma_join_ib_multicast(id_priv, mc);
> > 		break;
> >+	case RDMA_TRANSPORT_RDMAOE:
> >+		ret = cma_rdmaoe_join_multicast(id_priv, mc);
> >+		break;
> > 	default:
> > 		ret = -ENOSYS;
> > 		break;
> >@@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct
> >sockaddr *addr,
> > 		spin_unlock_irq(&id_priv->lock);
> > 		kfree(mc);
> > 	}
> >+
> > 	return ret;
> > }
> > EXPORT_SYMBOL(rdma_join_multicast);
> >>port_num)) {
> >+	tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num);
> >+	switch (tt) {
> > 	case RDMA_TRANSPORT_IB:
> >-		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
> >+	case RDMA_TRANSPORT_RDMAOE:
> >+		ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt);
> 
> It seems simpler to just add a new call ucma_copy_rdmaoe_route, rather than
> merging those two transports into a single copy function that then branches
> based on the transport.

Agree, will change.
> 
> > 		break;
> > 	default:
> > 		break;
> >diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
> >index 483057b..66a848e 100644
> >--- a/include/rdma/ib_addr.h
> >+++ b/include/rdma/ib_addr.h
> >@@ -39,6 +39,8 @@
> > #include <linux/netdevice.h>
> > #include <linux/socket.h>
> > #include <rdma/ib_verbs.h>
> >+#include <linux/ethtool.h>
> >+#include <rdma/ib_pack.h>
> >
> > struct rdma_addr_client {
> > 	atomic_t refcount;
> >@@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr
> >*dev_addr,
> > 	memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid);
> > }
> >
> >+

> >+static inline int rdma_link_local_addr(struct in6_addr *addr)
> >+{
> >+	if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) &&
> >+	    addr->s6_addr32[1] == 0)
> >+		return 1;
> >+	else
> >+		return 0;
> >+}
> 
> just replace the 'if' with 'return'
Will do.

> 
> >+
> >+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
> >+{
> >+	memcpy(mac, &addr->s6_addr[8], 3);
> >+	memcpy(mac + 3, &addr->s6_addr[13], 3);
> >+	mac[0] ^= 2;
> >+}
> >+
> >+static inline int rdma_is_multicast_addr(struct in6_addr *addr)
> >+{
> >+	return addr->s6_addr[0] == 0xff ? 1 : 0;
> >+}
> >+
> >+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
> >+{
> >+	memset(mac, 0xff, 6);
> >+}
> 
> I don't think we want all of these inline, in particular rdmaoe_mac_to_ll,
> rdmaoe_get_mtu , rdmaoe_get_rate.

They're quite simple functions - what would you prefer, export them?
Why?

> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Wed Aug 12 01:51:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 11:51:11 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
Message-ID: <20090812085111.GH25501@me>

On 17:33 Tue 11 Aug     , Hal Rosenstock wrote:
> 
> Is it supposed to use file when no files (LFT or LID matrix) are supplied ?

No.

> That's what seems to happen (with no fallback).

Look in osm_ucast_mgr.c how ucast_mgr_route() works - it will run
default ucast mgr methods in case if 'file' returns '1'.

Sasha


From sashak at voltaire.com  Wed Aug 12 02:03:29 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 12:03:29 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c:
 osm_mcast_tbl_get_block returns boolean
In-Reply-To: <20090807134305.GA30766@comcast.net>
References: <20090807134305.GA30766@comcast.net>
Message-ID: <20090812090329.GI25501@me>

On 09:43 Fri 07 Aug     , Hal Rosenstock wrote:
> 
> so use TRUE/FALSE rather than IB_INVALID_PARMETER
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
> index 82850be..38c06c1 100644
> --- a/opensm/opensm/osm_mcast_tbl.c
> +++ b/opensm/opensm/osm_mcast_tbl.c
> @@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
>  	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
>  
>  	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
> -		return (IB_INVALID_PARAMETER);
> +		return (TRUE);

In this case p_block array is not initialized, so just returning 'TRUE'
is not a good idea.

Actually if we are hitting this case it can indicate an inconsistent
mcast_tbl initialization - I would suggest to rework this part and
likely to drop this check at all.

Sasha


From sashak at voltaire.com  Wed Aug 12 02:10:22 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 12:10:22 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_qos_policy.c: Some minor
	simplifications
In-Reply-To: <20090811184011.GA5666@comcast.net>
References: <20090811184011.GA5666@comcast.net>
Message-ID: <20090812091022.GL25501@me>

On 14:40 Tue 11 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From jackm at dev.mellanox.co.il  Wed Aug 12 02:15:39 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 12 Aug 2009 12:15:39 +0300
Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open
	while device is being removed
In-Reply-To: <adaprb2721c.fsf@cisco.com>
References: <200908111021.01612.jackm@dev.mellanox.co.il>
	<adaprb2721c.fsf@cisco.com>
Message-ID: <200908121215.39767.jackm@dev.mellanox.co.il>

On Tuesday 11 August 2009 19:23, Roland Dreier wrote:
> 
>  > this is a continuation of thread:
>  > http://lists.openfabrics.org/pipermail/general/2009-July/060668.html
> 
> I see you
> didn't answer the question about mthca -- does it suffer from this
> problem as well?
> 
Sorry about that.  Yes, mthca also suffers from this problem.
I'm sending a patch right now:
mthca: Do not allow ib userspace open following device internal error

-Jack


From jackm at dev.mellanox.co.il  Wed Aug 12 02:15:46 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 12 Aug 2009 12:15:46 +0300
Subject: [ofa-general] [PATCH] mthca: Do not allow ib userspace open
	following device internal error
Message-ID: <200908121215.46221.jackm@dev.mellanox.co.il>

Userspace apps are supposed to release all ib device resources if
they receive a fatal async event (IBV_EVENT_DEVICE_FATAL).  However,
the app has no way of knowing when the device has come back up, except
to repeatedly attempt ibv_open_device() until it succeeds.

However, currently there is no protection against open succeeding when
the device is in the midst of the removal following the fatal event.
In this case, the open will succeed, but as a result the device waits
in the middle of its removal until the new app releases its ib resources
 -- and the new app will not do so, since the open succeeded at a point
following the fatal event generation.

This patch adds an "active" flag to the device. The active flag is set to
false (in the fatal event flow) before the "fatal" event is generated,
so any subsequent ibv_dev_open() call to the device will fail until the
device comes back up, thus preventing the above deadlock.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---
Roland,
You are right, mthca also needs such a patch.

This will prevent user-level apps from allocating a device context following
a device internal catastrophic error.

BTW, if the administrator has disabled device reset on fatal (by default, it is
enabled), user-apps will simply need to wait for admin intervention
(rmmod and insmod on low-level driver).  IMHO, this is OK -- following an internal
error, the device must be reset anyway, so there is no point in allowing new apps
to attempt to run.

diff --git a/drivers/infiniband/hw/mthca/mthca_catas.c b/drivers/infiniband/hw/mthca/mthca_catas.c
index 65ad359..ad8b26b 100644
--- a/drivers/infiniband/hw/mthca/mthca_catas.c
+++ b/drivers/infiniband/hw/mthca/mthca_catas.c
@@ -88,6 +88,7 @@ static void handle_catas(struct mthca_dev *dev)
 	event.device = &dev->ib_dev;
 	event.event  = IB_EVENT_DEVICE_FATAL;
 	event.element.port_num = 0;
+	dev->active = 0;
 
 	ib_dispatch_event(&event);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index 9ef611f..c1e2bcb 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -357,6 +357,7 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+	int		      active;
 };
 
 #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 13da9f1..118a386 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -1116,6 +1116,8 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type)
 	pci_set_drvdata(pdev, mdev);
 	mdev->hca_type = hca_type;
 
+	mdev->active = 1;
+
 	return 0;
 
 err_unregister:
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 87ad889..bcf7a40 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -334,6 +334,9 @@ static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev,
 	struct mthca_ucontext           *context;
 	int                              err;
 
+	if (!(to_mdev(ibdev)->active))
+		return ERR_PTR(-EAGAIN);
+
 	memset(&uresp, 0, sizeof uresp);
 
 	uresp.qp_tab_size = to_mdev(ibdev)->limits.num_qps;


From vlad at lists.openfabrics.org  Wed Aug 12 03:03:46 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 12 Aug 2009 03:03:46 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090812-0200 daily build status
Message-ID: <20090812100346.B57A9E61D61@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:300: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:311: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hal.rosenstock at gmail.com  Wed Aug 12 03:06:48 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 06:06:48 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090812085111.GH25501@me>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
Message-ID: <f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>

On Wed, Aug 12, 2009 at 4:51 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 17:33 Tue 11 Aug     , Hal Rosenstock wrote:
> >
> > Is it supposed to use file when no files (LFT or LID matrix) are supplied ?
>
> No.
>
> > That's what seems to happen (with no fallback).
>
> Look in osm_ucast_mgr.c how ucast_mgr_route() works - it will run
> default ucast mgr methods in case if 'file' returns '1'.

Yes, I had looked at that.

What I see is the following when no files are specified:
osm_ucast_mgr_process: file tables configured on all switches

so file doesn't appear to be falling back in this case.

-- Hal

>
>
>
> Sasha


From sashak at voltaire.com  Wed Aug 12 03:25:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 13:25:01 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
Message-ID: <20090812102501.GN25501@me>

On 06:06 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> What I see is the following when no files are specified:
> osm_ucast_mgr_process: file tables configured on all switches
> 
> so file doesn't appear to be falling back in this case.

	if (!r->build_lid_matrices ||
	    (ret = r->build_lid_matrices(r->context)) > 0)
		ret = osm_ucast_mgr_build_lid_matrices(&osm->sm.ucast_mgr);

So when method is defined and it returns a positive value (file name is
not specified) OpenSM will build lid matrices using default algorithm
and will continue with LFT file. This is how things were supposed to
work.

Sasha


From sashak at voltaire.com  Wed Aug 12 03:26:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 13:26:32 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: clear packet buffer correctly
 before formating and sending
In-Reply-To: <20090811100428.c4fb6c5e.weiny2@llnl.gov>
References: <20090811100428.c4fb6c5e.weiny2@llnl.gov>
Message-ID: <20090812102632.GO25501@me>

On 10:04 Tue 11 Aug     , Ira Weiny wrote:
> I found this bug a while back but forgot to submit a patch.
> 
> I don't think this will affect the issues Mr. Miller was having with BM, as I believe the BM stuff he was trying all expected a response (thereby calling mad_rpc instead).  But it could be worth a try.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Tue, 11 Aug 2009 10:00:25 -0700
> Subject: [PATCH] libibmad: clear packet buffer correctly before formating and sending
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Wed Aug 12 03:27:56 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 06:27:56 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: 
	osm_mcast_tbl_get_block returns boolean
In-Reply-To: <20090812090329.GI25501@me>
References: <20090807134305.GA30766@comcast.net> <20090812090329.GI25501@me>
Message-ID: <f0e08f230908120327y39bb2274v7fdc6a85f7bb0414@mail.gmail.com>

On Wed, Aug 12, 2009 at 5:03 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 09:43 Fri 07 Aug     , Hal Rosenstock wrote:
> >
> > so use TRUE/FALSE rather than IB_INVALID_PARMETER
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> > ---
> > diff --git a/opensm/opensm/osm_mcast_tbl.c
> b/opensm/opensm/osm_mcast_tbl.c
> > index 82850be..38c06c1 100644
> > --- a/opensm/opensm/osm_mcast_tbl.c
> > +++ b/opensm/opensm/osm_mcast_tbl.c
> > @@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const
> p_tbl,
> >       mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
> >
> >       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
> > -             return (IB_INVALID_PARAMETER);
> > +             return (TRUE);
>
> In this case p_block array is not initialized, so just returning 'TRUE'
> is not a good idea.


That's how it's handled now in the code.


>
>
> Actually if we are hitting this case it can indicate an inconsistent
> mcast_tbl initialization


That's one possibility. The other would be a SA query for a specific
block that is not validated first.


> - I would suggest to rework this part and likely to drop this check at all.


Dropping it should be OK as long as all callers validate prior to calling
(and it looks like they do).

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/5a52d5e1/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug 12 03:36:22 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 06:36:22 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090812102501.GN25501@me>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
Message-ID: <f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>

On Wed, Aug 12, 2009 at 6:25 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:06 Wed 12 Aug     , Hal Rosenstock wrote:
> >
> > What I see is the following when no files are specified:
> > osm_ucast_mgr_process: file tables configured on all switches
> >
> > so file doesn't appear to be falling back in this case.
>
>        if (!r->build_lid_matrices ||
>            (ret = r->build_lid_matrices(r->context)) > 0)
>                ret = osm_ucast_mgr_build_lid_matrices(&osm->sm.ucast_mgr);
>
> So when method is defined and it returns a positive value (file name is
> not specified) OpenSM will build lid matrices using default algorithm
> and will continue with LFT file. This is how things were supposed to
> work.


Does that really make sense for this case when there no files supplied and
file is specified ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/55ffcfd7/attachment.html>

From sashak at voltaire.com  Wed Aug 12 03:42:39 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 13:42:39 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c:
	osm_mcast_tbl_get_block returns boolean
In-Reply-To: <f0e08f230908120327y39bb2274v7fdc6a85f7bb0414@mail.gmail.com>
References: <20090807134305.GA30766@comcast.net> <20090812090329.GI25501@me>
	<f0e08f230908120327y39bb2274v7fdc6a85f7bb0414@mail.gmail.com>
Message-ID: <20090812104239.GP25501@me>

On 06:27 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> That's one possibility. The other would be a SA query for a specific
> block that is not validated first.

And there is no any reason to return "success" status with a garbage
data.

Sasha


From sashak at voltaire.com  Wed Aug 12 03:46:49 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 13:46:49 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
Message-ID: <20090812104649.GQ25501@me>

On 06:36 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> Does that really make sense for this case when there no files supplied and
> file is specified ?

This is for case when one of the files (or both) is *not* specified,
like this:

   opensm ... -R file -U lft-file

, which is pretty useful.

Sasha


From hal.rosenstock at gmail.com  Wed Aug 12 03:48:30 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 06:48:30 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090812104649.GQ25501@me>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
Message-ID: <f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>

On Wed, Aug 12, 2009 at 6:46 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:36 Wed 12 Aug     , Hal Rosenstock wrote:
> >
> > Does that really make sense for this case when there no files supplied
> and
> > file is specified ?
>
> This is for case when one of the files (or both) is *not* specified,
> like this:
>
>   opensm ... -R file -U lft-file
>
> , which is pretty useful.


I'm asking about the utility of:

opensm -R file

(no -U and no -M)

Why shouldn't that case fallback ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/c858b716/attachment.html>

From sashak at voltaire.com  Wed Aug 12 03:57:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 13:57:34 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
	<f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
Message-ID: <20090812105734.GR25501@me>

On 06:48 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> I'm asking about the utility of:
> 
> opensm -R file
> 
> (no -U and no -M)

(but then the discussion is not related to proposed patch since this
breaks '-R file -U lft-file' case)

> 
> Why shouldn't that case fallback ?

Why it should if user is asking to run such configuration?

Sasha


From hal.rosenstock at gmail.com  Wed Aug 12 03:59:53 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 06:59:53 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090812105734.GR25501@me>
References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
	<f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
	<20090812105734.GR25501@me>
Message-ID: <f0e08f230908120359p5aa99733taa97af689f52b0e9@mail.gmail.com>

On Wed, Aug 12, 2009 at 6:57 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:48 Wed 12 Aug     , Hal Rosenstock wrote:
> >
> > I'm asking about the utility of:
> >
> > opensm -R file
> >
> > (no -U and no -M)
>
> (but then the discussion is not related to proposed patch since this
> breaks '-R file -U lft-file' case)


Right; I was discussing the original motivation for the patch.


>
>
> >
> > Why shouldn't that case fallback ?
>
> Why it should if user is asking to run such configuration?


Because it's broken request (no warning of nothing useful is going to be
done). Don't we try to fallback in broken scenarios ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/df892189/attachment.html>

From sashak at voltaire.com  Wed Aug 12 04:16:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 14:16:12 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908120359p5aa99733taa97af689f52b0e9@mail.gmail.com>
References: <20090811212221.GG25501@me>
	<f0e08f230908111433p7a5e36edtf3b0212208fcf1b7@mail.gmail.com>
	<20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
	<f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
	<20090812105734.GR25501@me>
	<f0e08f230908120359p5aa99733taa97af689f52b0e9@mail.gmail.com>
Message-ID: <20090812111612.GS25501@me>

On 06:59 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> Because it's broken request (no warning of nothing useful is going to be
> done).

There are warnings about unspecified files.

> Don't we try to fallback in broken scenarios ?

It is not obvious in this case - for instance one may want to run OpenSM,
to fetch LFT as template, modify them and reload by just adding '-U',
etc.. Anyway it is user's decision there.

Sasha


From hal.rosenstock at gmail.com  Wed Aug 12 04:26:31 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 07:26:31 -0400
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return 
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <20090812111612.GS25501@me>
References: <20090811212221.GG25501@me> <20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
	<f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
	<20090812105734.GR25501@me>
	<f0e08f230908120359p5aa99733taa97af689f52b0e9@mail.gmail.com>
	<20090812111612.GS25501@me>
Message-ID: <f0e08f230908120426j4ae0cdb3h79e886d297fc0d22@mail.gmail.com>

On Wed, Aug 12, 2009 at 7:16 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:59 Wed 12 Aug     , Hal Rosenstock wrote:
> >
> > Because it's broken request (no warning of nothing useful is going to be
> > done).
>
> There are warnings about unspecified files.


These messages are VERBOSE in log level so they're not normally seen. INFO
might be better.


>
>
> > Don't we try to fallback in broken scenarios ?
>
> It is not obvious in this case - for instance one may want to run OpenSM,
> to fetch LFT as template, modify them and reload by just adding '-U',
> etc..


dumping LFTs in case of no files doesn't make for a very good template to
start with.


> Anyway it is user's decision there.


Perhaps but we usually help with poor decisions.

-- Hal


>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/22402f42/attachment.html>

From halves at linux.vnet.ibm.com  Wed Aug 12 05:17:53 2009
From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves)
Date: Wed, 12 Aug 2009 09:17:53 -0300
Subject: [ofa-general] Chelsio cards
In-Reply-To: <4A825D8F.8030102@voltaire.com>
References: <1250023074.16631.1.camel@halves-ltc>
	<4A81DF7E.6020907@opengridcomputing.com>
	<4A825D8F.8030102@voltaire.com>
Message-ID: <1250079473.7238.4.camel@halves-ltc>

Hi guys,

Thanks a lot. 

Em Qua, 2009-08-12 às 09:13 +0300, Aleksey Senin escreveu:
> > 
> > User mode:
> > openmpi
> > mvapich2
> > udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP 
> > MPI, Scali MPI)
> > rdmacm (required for connection setup)
> > ibverbs (RC QP only)
> > perftest (ib_rdma_bw and ib_rdma_lat only)
> > qperf
> > rping
> Another two binaries in user space
> ib_rdma_lat
> ib_rdma_bw
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Higor Aparecido Vieira Alves
Software Engineer
Linux Technology Center 
IBM Systems & Technology Group


From hnrose at comcast.net  Wed Aug 12 06:22:47 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 12 Aug 2009 09:22:47 -0400
Subject: [ofa-general] [PATCH] opensm/osm_mcast_tbl.c: In
	osm_mcast_tbl_get_block, eliminate unneeded check
Message-ID: <20090812132247.GA15084@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
index 82850be..029a735 100644
--- a/opensm/opensm/osm_mcast_tbl.c
+++ b/opensm/opensm/osm_mcast_tbl.c
@@ -272,9 +272,6 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
 
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
-	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
-		return (IB_INVALID_PARAMETER);
-
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
 		p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position];
 

From sashak at voltaire.com  Wed Aug 12 08:53:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Aug 2009 18:53:10 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return
	status from do_ucast_file_load when file name is not provided
In-Reply-To: <f0e08f230908120426j4ae0cdb3h79e886d297fc0d22@mail.gmail.com>
References: <20090812085111.GH25501@me>
	<f0e08f230908120306t2a79ad79n74c2decad702542a@mail.gmail.com>
	<20090812102501.GN25501@me>
	<f0e08f230908120336g4db38ce3j2630e9dafce87867@mail.gmail.com>
	<20090812104649.GQ25501@me>
	<f0e08f230908120348j6232938cu5dc834cc4ad4246d@mail.gmail.com>
	<20090812105734.GR25501@me>
	<f0e08f230908120359p5aa99733taa97af689f52b0e9@mail.gmail.com>
	<20090812111612.GS25501@me>
	<f0e08f230908120426j4ae0cdb3h79e886d297fc0d22@mail.gmail.com>
Message-ID: <20090812155310.GU25501@me>

On 07:26 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> These messages are VERBOSE in log level so they're not normally seen. INFO
> might be better.

VERBOSE is more than enough IMO.

> dumping LFTs in case of no files doesn't make for a very good template to
> start with.

Why not? It will be typical minhop.

> Perhaps but we usually help with poor decisions.

We can indicate an invalid decision, in this case it is *not* invalid
(may be useless, but we cannot know for sure).

Sasha


From rdreier at cisco.com  Wed Aug 12 08:52:36 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 12 Aug 2009 08:52:36 -0700
Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has
	not allocated
In-Reply-To: <20090812070558.GA23123@mtls03> (Eli Cohen's message of "Wed, 12
	Aug 2009 10:05:58 +0300")
References: <e2e108260908081049xdf7b78fp80e1e23748b3b5c1@mail.gmail.com>
	<20090810084527.GA2446@mtls03> <adaljlr8l8r.fsf@cisco.com>
	<20090812061749.GA20719@mtls03> <ada4osd7dor.fsf@cisco.com>
	<20090812070558.GA23123@mtls03>
Message-ID: <adazla558sb.fsf@cisco.com>


 > Is this is always true? What if you allocated an ICM buffer that uses
 > none adjacent pages? In this case you would need more than one call to
 > dma_sync_single_for_*(), isn't it?

As I said originally, the code only writes one page at a time from CPU
into ICM, so you never need more than one dma_sync at a time.

 - R.


From hnrose at comcast.net  Wed Aug 12 10:06:14 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 12 Aug 2009 13:06:14 -0400
Subject: [ofa-general] [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes
Message-ID: <20090812170614.GA16298@comcast.net>


IB/mad: Allow tuning of QP0 and QP1 sizes

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..7e553c3 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
 MODULE_AUTHOR("Sean Hefty");
 
+int mad_sendq_size = IB_MAD_QP_SEND_SIZE;
+int mad_recvq_size = IB_MAD_QP_RECV_SIZE;
+
+module_param_named(send_queue_size, mad_sendq_size, int, 0444);
+MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests");
+module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
+MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests");
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 	qp_init_attr.send_cq = qp_info->port_priv->cq;
 	qp_init_attr.recv_cq = qp_info->port_priv->cq;
 	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_wr = mad_sendq_size;
+	qp_init_attr.cap.max_recv_wr = mad_recvq_size;
 	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
@@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 		goto error;
 	}
 	/* Use minimum queue sizes unless the CQ is resized */
-	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
-	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
+	qp_info->send_queue.max_active = mad_sendq_size;
+	qp_info->recv_queue.max_active = mad_recvq_size;
 	return 0;
 
 error:
@@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
-	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	cq_size = (mad_sendq_size + mad_recvq_size) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void)
 {
 	int ret;
 
+	mad_recvq_size = roundup_pow_of_two(mad_recvq_size);
+	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
+	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
+
+	mad_sendq_size = roundup_pow_of_two(mad_sendq_size);
+	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
+	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
+
 	spin_lock_init(&ib_mad_port_list_lock);
 
 	ib_mad_cache = kmem_cache_create("ib_mad",
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 05ce331..9430ab4 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation. All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -49,6 +50,8 @@
 /* QP and CQ parameters */
 #define IB_MAD_QP_SEND_SIZE	128
 #define IB_MAD_QP_RECV_SIZE	512
+#define IB_MAD_QP_MIN_SIZE	64
+#define IB_MAD_QP_MAX_SIZE	8192
 #define IB_MAD_SEND_REQ_MAX_SG	2
 #define IB_MAD_RECV_REQ_MAX_SG	1
 

From sean.hefty at intel.com  Wed Aug 12 10:09:58 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 12 Aug 2009 10:09:58 -0700
Subject: [ofa-general] RE: [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes
In-Reply-To: <20090812170614.GA16298@comcast.net>
References: <20090812170614.GA16298@comcast.net>
Message-ID: <81E288C717914E548EE8023C330E10C3@amr.corp.intel.com>

>+	mad_recvq_size = roundup_pow_of_two(mad_recvq_size);
>+	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
>+	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
>+
>+	mad_sendq_size = roundup_pow_of_two(mad_sendq_size);
>+	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
>+	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);

Why round up to a power of two or have min/max restrictions?


From hal.rosenstock at gmail.com  Wed Aug 12 10:13:37 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Aug 2009 13:13:37 -0400
Subject: [ofa-general] RE: [PATCH] IB/mad: Allow tuning of QP0 and QP1 
	sizes
In-Reply-To: <81E288C717914E548EE8023C330E10C3@amr.corp.intel.com>
References: <20090812170614.GA16298@comcast.net>
	<81E288C717914E548EE8023C330E10C3@amr.corp.intel.com>
Message-ID: <f0e08f230908121013x69d78d6cod6356cb91ed9386b@mail.gmail.com>

On Wed, Aug 12, 2009 at 1:09 PM, Sean Hefty <sean.hefty at intel.com> wrote:

> >+      mad_recvq_size = roundup_pow_of_two(mad_recvq_size);
> >+      mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
> >+      mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
> >+
> >+      mad_sendq_size = roundup_pow_of_two(mad_sendq_size);
> >+      mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
> >+      mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
>
> Why round up to a power of two or have min/max restrictions?


power of two is arbitrary and could be removed. min is also somewhat
arbitrary but didn't want to allow it too much smaller than it already is
(default for this patch). max truly is a maximum as create QP fails with
larger size (didn't try this across all HCAs though).


>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090812/b47d2d57/attachment.html>

From rdreier at cisco.com  Wed Aug 12 10:17:15 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 12 Aug 2009 10:17:15 -0700
Subject: [ofa-general] Re: [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes
In-Reply-To: <20090812170614.GA16298@comcast.net> (Hal Rosenstock's message of
	"Wed, 12 Aug 2009 13:06:14 -0400")
References: <20090812170614.GA16298@comcast.net>
Message-ID: <adavdkt54v8.fsf@cisco.com>


 > IB/mad: Allow tuning of QP0 and QP1 sizes

-ENOCHANGELOG

Why allow this tuning?  In a couple years, when I'm reading the kernel
changelog, what am I going to need to know to see why we did this?

 - R.


From hnrose at comcast.net  Wed Aug 12 10:22:51 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 12 Aug 2009 13:22:51 -0400
Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes
Message-ID: <20090812172251.GA16446@comcast.net>


IB/mad: Allow tuning of QP0 and QP1 sizes

MADs are UD and can be dropped if there are no receives posted.
Send side tuning is done for symmetry with receive.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Added changelog

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..7e553c3 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
 MODULE_AUTHOR("Sean Hefty");
 
+int mad_sendq_size = IB_MAD_QP_SEND_SIZE;
+int mad_recvq_size = IB_MAD_QP_RECV_SIZE;
+
+module_param_named(send_queue_size, mad_sendq_size, int, 0444);
+MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests");
+module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
+MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests");
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 	qp_init_attr.send_cq = qp_info->port_priv->cq;
 	qp_init_attr.recv_cq = qp_info->port_priv->cq;
 	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_wr = mad_sendq_size;
+	qp_init_attr.cap.max_recv_wr = mad_recvq_size;
 	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
@@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 		goto error;
 	}
 	/* Use minimum queue sizes unless the CQ is resized */
-	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
-	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
+	qp_info->send_queue.max_active = mad_sendq_size;
+	qp_info->recv_queue.max_active = mad_recvq_size;
 	return 0;
 
 error:
@@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
-	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	cq_size = (mad_sendq_size + mad_recvq_size) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void)
 {
 	int ret;
 
+	mad_recvq_size = roundup_pow_of_two(mad_recvq_size);
+	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
+	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
+
+	mad_sendq_size = roundup_pow_of_two(mad_sendq_size);
+	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
+	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
+
 	spin_lock_init(&ib_mad_port_list_lock);
 
 	ib_mad_cache = kmem_cache_create("ib_mad",
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 05ce331..9430ab4 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation. All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -49,6 +50,8 @@
 /* QP and CQ parameters */
 #define IB_MAD_QP_SEND_SIZE	128
 #define IB_MAD_QP_RECV_SIZE	512
+#define IB_MAD_QP_MIN_SIZE	64
+#define IB_MAD_QP_MAX_SIZE	8192
 #define IB_MAD_SEND_REQ_MAX_SG	2
 #define IB_MAD_RECV_REQ_MAX_SG	1
 

From suri at baymicrosystems.com  Wed Aug 12 11:48:43 2009
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Wed, 12 Aug 2009 14:48:43 -0400
Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes
In-Reply-To: <20090812172251.GA16446@comcast.net>
References: <20090812172251.GA16446@comcast.net>
Message-ID: <9985926A1C2A496B89AC63ACE13093D5@md.baymicrosystems.com>

Hal:

1. Aren't you going to remove the power_of_two?
2. Also, don't you need permissions to be 644?

-Suri

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf
> Of Hal Rosenstock
> Sent: Wednesday, August 12, 2009 1:23 PM
> To: rdreier at cisco.com; sean.hefty at intel.com
> Cc: general at lists.openfabrics.org
> Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes
> 
> 
> IB/mad: Allow tuning of QP0 and QP1 sizes
> 
> MADs are UD and can be dropped if there are no receives posted.
> Send side tuning is done for symmetry with receive.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added changelog
> 
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index de922a0..7e553c3 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2005 Intel Corporation.  All rights reserved.
>   * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API");
>  MODULE_AUTHOR("Hal Rosenstock");
>  MODULE_AUTHOR("Sean Hefty");
> 
> +int mad_sendq_size = IB_MAD_QP_SEND_SIZE;
> +int mad_recvq_size = IB_MAD_QP_RECV_SIZE;
> +
> +module_param_named(send_queue_size, mad_sendq_size, int, 0444);
> +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests");
> +module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
> +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests");
> +
>  static struct kmem_cache *ib_mad_cache;
> 
>  static struct list_head ib_mad_port_list;
> @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
>  	qp_init_attr.send_cq = qp_info->port_priv->cq;
>  	qp_init_attr.recv_cq = qp_info->port_priv->cq;
>  	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
> -	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
> -	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
> +	qp_init_attr.cap.max_send_wr = mad_sendq_size;
> +	qp_init_attr.cap.max_recv_wr = mad_recvq_size;
>  	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
>  	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
>  	qp_init_attr.qp_type = qp_type;
> @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
>  		goto error;
>  	}
>  	/* Use minimum queue sizes unless the CQ is resized */
> -	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
> -	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
> +	qp_info->send_queue.max_active = mad_sendq_size;
> +	qp_info->recv_queue.max_active = mad_recvq_size;
>  	return 0;
> 
>  error:
> @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device,
>  	init_mad_qp(port_priv, &port_priv->qp_info[0]);
>  	init_mad_qp(port_priv, &port_priv->qp_info[1]);
> 
> -	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
> +	cq_size = (mad_sendq_size + mad_recvq_size) * 2;
>  	port_priv->cq = ib_create_cq(port_priv->device,
>  				     ib_mad_thread_completion_handler,
>  				     NULL, port_priv, cq_size, 0);
> @@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void)
>  {
>  	int ret;
> 
> +	mad_recvq_size = roundup_pow_of_two(mad_recvq_size);
> +	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
> +	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
> +
> +	mad_sendq_size = roundup_pow_of_two(mad_sendq_size);
> +	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
> +	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
> +
>  	spin_lock_init(&ib_mad_port_list_lock);
> 
>  	ib_mad_cache = kmem_cache_create("ib_mad",
> diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
> index 05ce331..9430ab4 100644
> --- a/drivers/infiniband/core/mad_priv.h
> +++ b/drivers/infiniband/core/mad_priv.h
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2005 Intel Corporation. All rights reserved.
>   * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -49,6 +50,8 @@
>  /* QP and CQ parameters */
>  #define IB_MAD_QP_SEND_SIZE	128
>  #define IB_MAD_QP_RECV_SIZE	512
> +#define IB_MAD_QP_MIN_SIZE	64
> +#define IB_MAD_QP_MAX_SIZE	8192
>  #define IB_MAD_SEND_REQ_MAX_SG	2
>  #define IB_MAD_RECV_REQ_MAX_SG	1
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hnrose at comcast.net  Wed Aug 12 12:27:05 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 12 Aug 2009 15:27:05 -0400
Subject: [ofa-general] [PATCHv3] IB/mad: Allow tuning of QP0 and QP1 sizes
Message-ID: <20090812192705.GA16704@comcast.net>


IB/mad: Allow tuning of QP0 and QP1 sizes

MADs are UD and can be dropped if there are no receives posted.
Send side tuning is done for symmetry with receive.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v2:
Removed roundup_pow_of_two of receive and send sizes
Changed module paramater permissions to 0644

Changes since v1:
Added changelog

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..ff9bc22 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
 MODULE_AUTHOR("Sean Hefty");
 
+int mad_sendq_size = IB_MAD_QP_SEND_SIZE;
+int mad_recvq_size = IB_MAD_QP_RECV_SIZE;
+
+module_param_named(send_queue_size, mad_sendq_size, int, 0644);
+MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests");
+module_param_named(recv_queue_size, mad_recvq_size, int, 0644);
+MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests");
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 	qp_init_attr.send_cq = qp_info->port_priv->cq;
 	qp_init_attr.recv_cq = qp_info->port_priv->cq;
 	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_wr = mad_sendq_size;
+	qp_init_attr.cap.max_recv_wr = mad_recvq_size;
 	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
@@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 		goto error;
 	}
 	/* Use minimum queue sizes unless the CQ is resized */
-	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
-	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
+	qp_info->send_queue.max_active = mad_sendq_size;
+	qp_info->recv_queue.max_active = mad_recvq_size;
 	return 0;
 
 error:
@@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
-	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	cq_size = (mad_sendq_size + mad_recvq_size) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2984,6 +2993,12 @@ static int __init ib_mad_init_module(void)
 {
 	int ret;
 
+	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
+	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
+
+	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
+	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
+
 	spin_lock_init(&ib_mad_port_list_lock);
 
 	ib_mad_cache = kmem_cache_create("ib_mad",
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 05ce331..9430ab4 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation. All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -49,6 +50,8 @@
 /* QP and CQ parameters */
 #define IB_MAD_QP_SEND_SIZE	128
 #define IB_MAD_QP_RECV_SIZE	512
+#define IB_MAD_QP_MIN_SIZE	64
+#define IB_MAD_QP_MAX_SIZE	8192
 #define IB_MAD_SEND_REQ_MAX_SG	2
 #define IB_MAD_RECV_REQ_MAX_SG	1
 

From rdreier at cisco.com  Wed Aug 12 14:20:02 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 12 Aug 2009 14:20:02 -0700
Subject: [ofa-general] [PATCHv3] IB/mad: Allow tuning of QP0 and QP1 sizes
In-Reply-To: <20090812192705.GA16704@comcast.net> (Hal Rosenstock's message of
	"Wed, 12 Aug 2009 15:27:05 -0400")
References: <20090812192705.GA16704@comcast.net>
Message-ID: <adahbwc6871.fsf@cisco.com>


 > Changed module paramater permissions to 0644

Does it really work if someone changes the module parameter at runtime
after the module is loaded?

 - R.


From akepner at sgi.com  Wed Aug 12 15:59:26 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 12 Aug 2009 15:59:26 -0700
Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
Message-ID: <20090812225926.GD24786@sgi.com>


We have a customer who has repeatedly had system panics with 
the following signature:


Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP:
<ffffffff882c2c5c>{:ib_cm:ib_cm_init_qp_attr+580}
PGD 3a2db6067 PUD 0
Oops: 0000 [1] SMP
last sysfs file: /class/infiniband/mlx4_0/node_guid
CPU 4
Modules linked in: i2c_dev sg sd_mod crc32c libcrc32c iscsi_tcp libiscsi
scsi_transport_iscsi rdma_ucm rdma_cm
iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_cxgb3 cxgb3
firmware_class mlx4_ib ib_mthca ib_mad
 ib_core loop numatools xpmem worm mlx4_core libata i2c_i801 scsi_mod i2c_core
shpchp pci_hotplug nfs lockd nfs
_acl af_packet sunrpc e1000
Pid: 3256, comm: star Tainted: G     U 2.6.16.60-0.34-smp #1
RIP: 0010:[<ffffffff882c2c5c>]
<ffffffff882c2c5c>{:ib_cm:ib_cm_init_qp_attr+580}
RSP: 0018:ffff810369d09d38  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff810419678c00 RCX: 0000000000000008
RDX: 0000000000000246 RSI: ffff810419678d18 RDI: ffff810369d09e70
RBP: ffff810369d09e18 R08: 000000030000003d R09: 0000000000000000
R10: ffff810369d09e18 R11: 0000000000000088 R12: ffff810369d09d88
R13: 0000000000000000 R14: ffff810419678c80 R15: 00000000403500b0
FS:  0000000040354940(0063) GS:ffff810420ffbbc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000039f0c4000 CR4: 00000000000006e0
Process star (pid: 3256, threadinfo ffff810369d08000, task ffff8103b81b5830)
Stack: ffff810419678a00 ffff810369d09d88 ffff810369d09e18 ffff810369d09e18
       0000000040143430 ffffffff882fb6d5 ffff810376261540 ffff81040bea4740
       ffff810376261540 ffffffff88309285
Call Trace: <ffffffff882fb6d5>{:rdma_cm:rdma_init_qp_attr+209}
       <ffffffff88309285>{:rdma_ucm:ucma_init_qp_attr+160}
       <ffffffff802ea55a>{thread_return+0}
<ffffffff8830832e>{:rdma_ucm:ucma_write+115}
       <ffffffff80186662>{vfs_write+215} <ffffffff80186c2b>{sys_write+69}
      <ffffffff8010adba>{system_call+126}

Code: 8a 40 10 88 85 85 00 00 00 8b 83 38 01 00 00 66 89 45 7a 8a
RIP <ffffffff882c2c5c>{:ib_cm:ib_cm_init_qp_attr+580} RSP <ffff810369d09d38>


>From a crash dump, I determined that we died in cm_init_qp_rts_attr() 
(it's inline, so it doesn't show up in the traceback) on the line 
labeled below:


static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
                               struct ib_qp_attr *qp_attr,
                               int *qp_attr_mask)
{
        ........
        if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) {
                .....
        } else {
               *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE;
               qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; <-die


cm_id_priv->alt_av.port is NULL, so it looks as if there's a race 
initializing 'alt_av'.


They are running quite old code (OFED 1.3.1), but I'm not aware of 
anything which would change this behavior in more recent versions, 
though I certainly may have missed something.

Anyone seen similar? Any ideas for a fix, or workaround?

-- 
Arthur


From sean.hefty at intel.com  Wed Aug 12 16:20:41 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 12 Aug 2009 16:20:41 -0700
Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
In-Reply-To: <20090812225926.GD24786@sgi.com>
References: <20090812225926.GD24786@sgi.com>
Message-ID: <F59ABB373B6F43E187BB9E63625CC261@amr.corp.intel.com>

>Call Trace: <ffffffff882fb6d5>{:rdma_cm:rdma_init_qp_attr+209}
>       <ffffffff88309285>{:rdma_ucm:ucma_init_qp_attr+160}
>       <ffffffff802ea55a>{thread_return+0}
><ffffffff8830832e>{:rdma_ucm:ucma_write+115}
>       <ffffffff80186662>{vfs_write+215} <ffffffff80186c2b>{sys_write+69}
>      <ffffffff8010adba>{system_call+126}

The rdma_cm is being used, so alternate path information is not used.

>static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
>                               struct ib_qp_attr *qp_attr,
>                               int *qp_attr_mask)
>{
>        ........
>        if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) {
>                .....
>        } else {
>               *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE;
>               qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; <-die

The rdma_cm should always send us through the if portion, and I would expect
alt_av to be NULL.  Maybe the cm_id is corrupted..?  Is there any chance that
the remote side is trying to load an alternate path?  Getting the value of the
lap_state may help, to see if it's at least a valid lap_state value.

- Sean


From akepner at sgi.com  Wed Aug 12 16:14:08 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 12 Aug 2009 16:14:08 -0700
Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
In-Reply-To: <F59ABB373B6F43E187BB9E63625CC261@amr.corp.intel.com>
References: <20090812225926.GD24786@sgi.com>
	<F59ABB373B6F43E187BB9E63625CC261@amr.corp.intel.com>
Message-ID: <20090812231408.GE24786@sgi.com>

On Wed, Aug 12, 2009 at 04:20:41PM -0700, Sean Hefty wrote:
> ....
> The rdma_cm should always send us through the if portion, and I would expect
> alt_av to be NULL.  Maybe the cm_id is corrupted..?  Is there any chance that
> the remote side is trying to load an alternate path?  Getting the value of the
> lap_state may help, to see if it's at least a valid lap_state value.
> 

Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT.

-- 
Arthur


From sean.hefty at intel.com  Wed Aug 12 16:29:28 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 12 Aug 2009 16:29:28 -0700
Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
In-Reply-To: <20090812231408.GE24786@sgi.com>
References: <20090812225926.GD24786@sgi.com>
	<F59ABB373B6F43E187BB9E63625CC261@amr.corp.intel.com>
	<20090812231408.GE24786@sgi.com>
Message-ID: <F5502132348443A9BB4AACAF1C609D2B@amr.corp.intel.com>

>Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT.

Errr... not sure how that happened.  I don't know if ofed 1.3 has this feature
or not, but can you cat:

/sys/class/infiniband_cm/<device>/<port_num>/cm_tx_msgs/lap

if it exists?  Are both sides using the rdma_cm to communicate?  Does anything
in the app (either side) try to do something with alternate paths?

- Sean


From weiny2 at llnl.gov  Wed Aug 12 16:53:20 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 12 Aug 2009 16:53:20 -0700
Subject: [ofa-general] [PATCH] infiniband-diags/libibnetdisc: remove all
 IBPANIC's and clean up error handling
Message-ID: <20090812165320.66ea08a5.weiny2@llnl.gov>

This patch applies after:

	libibnetdisc: fix potential memory leak of port object

Which I sent last week but I don't think has made it up stream.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 12 Aug 2009 16:13:56 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/src/chassis.c   |  124 ++++++++++++++++---------
 infiniband-diags/libibnetdisc/src/chassis.h   |    2 +-
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |   75 ++++++++++-----
 3 files changed, 132 insertions(+), 69 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 76a02a6..efa4ed5 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -323,7 +323,7 @@ char spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0
 char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
 /*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
 
-static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport)
+static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport)
 {
 	ibnd_node_t *n = (ibnd_node_t *)node;
 
@@ -345,12 +345,14 @@ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport)
 		n->ch_slotnum = spine4_slot_2_slb[lineport->portnum];
 		n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum];
 	} else {
-		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
-		node->node.guid);
+		IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64,
+			node->node.guid);
+		return (-1);
 	}
+	return (0);
 }
 
-static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport)
+static int get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport)
 {
 	ibnd_node_t *n = (ibnd_node_t *)node;
 	uint64_t guessnum = 0;
@@ -385,12 +387,14 @@ static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport)
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
 	} else {
-		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
-		spineport->node->guid);
+		IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64,
+			spineport->node->guid);
+		return (-1);
 	}
+	return (0);
 }
 
-static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport)
+static int get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport)
 {
 	n->ch_slot = LINE_CS;
 	if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) {
@@ -410,9 +414,11 @@ static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport)
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
 	} else {
-		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
-		spineport->node->guid);
+		IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64,
+			spineport->node->guid);
+		return (-1);
 	}
+	return (0);
 }
 
 /* forward declare this */
@@ -422,7 +428,7 @@ static void voltaire_portmap(ibnd_port_t *port);
 	It could be optimized so, but time overhead is very small
 	and its only diag.util
 */
-static void fill_voltaire_chassis_record(struct ibnd_node *node)
+static int fill_voltaire_chassis_record(struct ibnd_node *node)
 {
 	ibnd_node_t *n = (ibnd_node_t *)node;
 	int p = 0;
@@ -430,7 +436,7 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node)
 	struct ibnd_node *remnode = 0;
 
 	if (node->ch_found) /* somehow this node has already been passed */
-		return;
+		return (0);
 	node->ch_found = 1;
 
 	/* node is router only in case of using unique lid */
@@ -456,7 +462,8 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node)
 			}
 			if (!n->ch_type)
 				/* we assume here that remoteport belongs to line */
-				get_sfb_slot(node, port->remoteport);
+				if (get_sfb_slot(node, port->remoteport))
+					return (-1);
 
 				/* we could break here, but need to find if more routers connected */
 		}
@@ -467,7 +474,8 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node)
 			if (!port || port->portnum > 12 || !port->remoteport)
 				continue;
 			/* we assume here that remoteport belongs to spine */
-			get_slb_slot(n, port->remoteport);
+			if (get_slb_slot(n, port->remoteport))
+				return (-1);
 			break;
 		}
 	}
@@ -480,15 +488,17 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node)
 		voltaire_portmap(port);
 	}
 
-	return;
+	return (0);
 }
 
 static int get_line_index(ibnd_node_t *node)
 {
 	int retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum;
 
-	if (retval > LINES_MAX_NUM || retval < 1)
-		IBPANIC("Internal error");
+	if (retval > LINES_MAX_NUM || retval < 1) {
+		IBND_ERROR("Internal error\n");
+		return (-1);
+	}
 	return retval;
 }
 
@@ -501,34 +511,44 @@ static int get_spine_index(ibnd_node_t *node)
 	else
 		retval = node->ch_slotnum;
 
-	if (retval > SPINES_MAX_NUM || retval < 1)
-		IBPANIC("Internal error");
+	if (retval > SPINES_MAX_NUM || retval < 1) {
+		IBND_ERROR("Internal error\n");
+		return (-1);
+	}
 	return retval;
 }
 
-static void insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis)
+static int insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis)
 {
 	int i = get_line_index(node);
 
+	if (i < 0)
+		return (i);
+
 	if (chassis->linenode[i])
-		return;		/* already filled slot */
+		return (0);		/* already filled slot */
 
 	chassis->linenode[i] = node;
 	node->chassis = chassis;
+	return (0);
 }
 
-static void insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis)
+static int insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis)
 {
 	int i = get_spine_index(node);
 
+	if (i < 0)
+		return (i);
+
 	if (chassis->spinenode[i])
-		return;		/* already filled slot */
+		return (0);		/* already filled slot */
 
 	chassis->spinenode[i] = node;
 	node->chassis = chassis;
+	return (0);
 }
 
-static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis)
+static int pass_on_lines_catch_spines(ibnd_chassis_t *chassis)
 {
 	ibnd_node_t *node, *remnode;
 	ibnd_port_t *port;
@@ -549,12 +569,14 @@ static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis)
 
 			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
 				continue;	/* some error - spine not initialized ? FIXME */
-			insert_spine(remnode, chassis);
+			if (insert_spine(remnode, chassis))
+				return (-1);
 		}
 	}
+	return (0);
 }
 
-static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis)
+static int pass_on_spines_catch_lines(ibnd_chassis_t *chassis)
 {
 	ibnd_node_t *node, *remnode;
 	ibnd_port_t *port;
@@ -572,9 +594,11 @@ static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis)
 
 			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
 				continue;	/* some error - line/router not initialized ? FIXME */
-			insert_line_router(remnode, chassis);
+			if (insert_line_router(remnode, chassis))
+				return (-1);
 		}
 	}
+	return (0);
 }
 
 /*
@@ -602,14 +626,15 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t *chassis)
 	in that chassis
 	chassis structure = structure of one standalone chassis
 */
-static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis)
+static int build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis)
 {
 	int p = 0;
 	struct ibnd_node *remnode = 0;
 	ibnd_port_t *port = 0;
 
 	/* we get here with node = chassis_spine */
-	insert_spine((ibnd_node_t *)node, chassis);
+	if (insert_spine((ibnd_node_t *)node, chassis))
+		return (-1);
 
 	/* loop: pass on all ports of node */
 	for (p = 1; p <= node->node.numports; p++ ) {
@@ -624,17 +649,23 @@ static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis)
 		insert_line_router(&(remnode->node), chassis);
 	}
 
-	pass_on_lines_catch_spines(chassis);
+	if (pass_on_lines_catch_spines(chassis))
+		return (-1);
 	/* this pass needed for to catch routers, since routers connected only */
 	/* to spines in slot 1 or 4 and we could miss them first time */
-	pass_on_spines_catch_lines(chassis);
+	if (pass_on_spines_catch_lines(chassis))
+		return (-1);
 
 	/* additional 2 passes needed for to overcome a problem of pure "in-chassis" */
 	/* connectivity - extra pass to ensure that all related chips/modules */
 	/* inserted into the chassis */
-	pass_on_lines_catch_spines(chassis);
-	pass_on_spines_catch_lines(chassis);
+	if (pass_on_lines_catch_spines(chassis))
+		return (-1);
+	if (pass_on_spines_catch_lines(chassis))
+		return (-1);
 	pass_on_spines_interpolate_chguid(chassis);
+
+	return (0);
 }
 
 /*========================================================*/
@@ -724,10 +755,12 @@ voltaire_portmap(ibnd_port_t *port)
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
 }
 
-static void add_chassis(struct ibnd_fabric *fabric)
+static int add_chassis(struct ibnd_fabric *fabric)
 {
-	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t))))
-		IBPANIC("out of mem");
+	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
+		IBND_ERROR("OOM: failed to allocate chassis object\n");
+		return (-1);
+	}
 
 	if (fabric->first_chassis == NULL) {
 		fabric->first_chassis = fabric->current_chassis;
@@ -736,6 +769,7 @@ static void add_chassis(struct ibnd_fabric *fabric)
 		fabric->last_chassis->next = fabric->current_chassis;
 		fabric->last_chassis = fabric->current_chassis;
 	}
+	return (0);
 }
 
 static void
@@ -756,10 +790,9 @@ add_node_to_chassis(ibnd_chassis_t *chassis, ibnd_node_t *node)
 	3. pass on non Voltaire nodes (SystemImageGUID based grouping)
 	4. now group non Voltaire nodes by SystemImageGUID
 	Returns:
-	Pointer to the first chassis in a NULL terminated list of chassis in
-	the fabric specified.
+	0 on success, -1 on failure
 */
-ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric)
+int group_nodes(struct ibnd_fabric *fabric)
 {
 	struct ibnd_node *node;
 	int dist;
@@ -776,7 +809,8 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric)
 	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
-				fill_voltaire_chassis_record(node);
+				if (fill_voltaire_chassis_record(node))
+					return (-1);
 		}
 	}
 
@@ -791,9 +825,11 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric)
 					|| (node->node.chassis && node->node.chassis->chassisnum)
 					|| !is_spine(node))
 				continue;
-			add_chassis(fabric);
+			if (add_chassis(fabric))
+				return (-1);
 			fabric->current_chassis->chassisnum = ++chassisnum;
-			build_chassis(node, fabric->current_chassis);
+			if (build_chassis(node, fabric->current_chassis))
+				return (-1);
 		}
 	}
 
@@ -809,7 +845,8 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric)
 					chassis->nodecount++;
 				else {
 					/* Possible new chassis */
-					add_chassis(fabric);
+					if (add_chassis(fabric))
+						return (-1);
 					fabric->current_chassis->chassisguid =
 							get_chassisguid((ibnd_node_t *)node);
 					fabric->current_chassis->nodecount = 1;
@@ -842,5 +879,6 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric)
 			dist++;
 	}
 
-	return (fabric->first_chassis);
+	fabric->fabric.chassis = fabric->first_chassis;
+	return (0);
 }
diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h
index 16dad49..ecb21c9 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.h
+++ b/infiniband-diags/libibnetdisc/src/chassis.h
@@ -80,6 +80,6 @@
 enum ibnd_chassis_type { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT };
 enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS };
 
-ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric);
+int group_nodes(struct ibnd_fabric *fabric);
 
 #endif	/* _CHASSIS_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 27ae9f3..6c31300 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -121,12 +121,13 @@ static int
 query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
 	   struct ibnd_node *inode, struct ibnd_port *iport, ib_portid_t *portid)
 {
+	int rc = 0;
 	ibnd_node_t *node = &(inode->node);
 	ibnd_port_t *port = &(iport->port);
 	void *nd = inode->node.nodedesc;
 
-	if (query_node_info(ibmad_port, fabric, inode, portid))
-		return -1;
+	if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0)
+		return rc;
 
 	port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F);
 	port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F);
@@ -169,8 +170,10 @@ query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
 static int
 add_port_to_dpath(ib_dr_path_t *path, int nextport)
 {
-	if (path->cnt+2 >= sizeof(path->p))
+	if (path->cnt+2 >= sizeof(path->p)) {
+		IBND_ERROR("DR path has grown too long\n");
 		return -1;
+	}
 	++path->cnt;
 	path->p[path->cnt] = (uint8_t) nextport;
 	return path->cnt;
@@ -186,8 +189,10 @@ extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
 		/* If we were LID routed we need to set up the drslid */
 		if (!f->selfportid.lid)
 			if (ib_resolve_self_via(&f->selfportid, NULL, NULL,
-					ibmad_port) < 0)
+					ibmad_port) < 0) {
+				IBND_ERROR("Failed to resolve self\n");
 				return -1;
+			}
 
 		portid->drpath.drslid = (uint16_t) f->selfportid.lid;
 		portid->drpath.drdlid = 0xFFFF;
@@ -413,8 +418,10 @@ create_node(struct ibnd_fabric *fabric, struct ibnd_node *temp, ib_portid_t *pat
 	struct ibnd_node *node;
 
 	node = malloc(sizeof(*node));
-	if (!node)
-		IBPANIC("OOM: node creation failed\n");
+	if (!node) {
+		IBND_ERROR("OOM: node creation failed\n");
+		return (NULL);
+	}
 
 	memcpy(node, temp, sizeof(*node));
 	node->node.dist = dist;
@@ -455,8 +462,10 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd
 	}
 
 	port = malloc(sizeof(*port));
-	if (!port)
+	if (!port) {
+		IBND_ERROR("Failed to allocate port\n");
 		return NULL;
+	}
 
 	memcpy(port, temp, sizeof(*port));
 	port->port.node = (ibnd_node_t *)node;
@@ -489,6 +498,7 @@ get_remote_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
 		struct ibnd_node *node, struct ibnd_port *port, ib_portid_t *path,
 		int portnum, int dist)
 {
+	int rc = 0;
 	struct ibnd_node node_buf;
 	struct ibnd_port port_buf;
 	struct ibnd_node *remotenode, *oldnode;
@@ -501,43 +511,51 @@ get_remote_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
 
 	if (mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F)
 			!= IB_PORT_PHYS_STATE_LINKUP)
-		return -1;
+		return 1; /* positive == non-fatal error */
 
 	if (extend_dpath(ibmad_port, fabric, path, portnum) < 0)
 		return -1;
 
 	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
-		IBND_DEBUG("NodeInfo on %s failed, skipping port",
+		IBND_ERROR("Query remote node (%s) failed, skipping port\n",
 			   portid2str(path));
 		path->drpath.cnt--;	/* restore path */
-		return -1;
+		return 1; /* positive == non-fatal error */
 	}
 
 	oldnode = find_existing_node(fabric, &node_buf);
 	if (oldnode)
 		remotenode = oldnode;
-	else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1)))
-		IBPANIC("no memory");
+	else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) {
+		rc = -1;
+		goto error;
+	}
 
 	oldport = find_existing_port_node(remotenode, &port_buf);
 	if (oldport) {
 		remoteport = oldport;
-	} else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf)))
-		IBPANIC("no memory");
+	} else if (!(remoteport = add_port_to_node(fabric, remotenode,
+			&port_buf))) {
+		IBND_ERROR("OOM failed to add port to node\n");
+		rc = -1;
+		goto error;
+	}
 
 	dump_endnode(path, oldnode ? "known remote" : "new remote",
 			remotenode, remoteport);
 
 	link_ports(node, port, remotenode, remoteport);
 
+error:
 	path->drpath.cnt--;	/* restore path */
-	return 0;
+	return (rc);
 }
 
 ibnd_fabric_t *
 ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 			ib_portid_t *from, int hops)
 {
+	int rc = 0;
 	struct ibnd_fabric *fabric = NULL;
 	ib_portid_t my_portid = {0};
 	struct ibnd_node node_buf;
@@ -563,8 +581,10 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 
 	fabric = malloc(sizeof(*fabric));
 
-	if (!fabric)
-		IBPANIC("OOM: failed to malloc ibnd_fabric_t\n");
+	if (!fabric) {
+		IBND_ERROR("OOM: failed to malloc ibnd_fabric_t\n");
+		return (NULL);
+	}
 
 	memset(fabric, 0, sizeof(*fabric));
 
@@ -586,11 +606,14 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 
 	port = add_port_to_node(fabric, node, &port_buf);
 	if (!port)
-		IBPANIC("out of memory");
+		goto error;
 
-	if(get_remote_node(ibmad_port, fabric, node, port, from,
+	rc = get_remote_node(ibmad_port, fabric, node, port, from,
 				mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
-				0) < 0)
+				0);
+	if (rc < 0)
+		goto error;
+	if (rc > 0) /* non-fatal error, nothing more to be done */
 		return ((ibnd_fabric_t *)fabric);
 
 	for (dist = 0; dist <= max_hops; dist++) {
@@ -608,7 +631,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 					continue;
 
 				if (get_port_info(ibmad_port, fabric, &port_buf, i, path)) {
-					IBND_DEBUG("can't reach node %s port %d", portid2str(path), i);
+					IBND_ERROR("can't reach node %s port %d", portid2str(path), i);
 					continue;
 				}
 
@@ -618,7 +641,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 
 				port = add_port_to_node(fabric, node, &port_buf);
 				if (!port)
-					IBPANIC("out of memory");
+					goto error;
 
 				/* If switch, set port GUID to node port GUID */
 				if (node->node.type == IB_NODE_SWITCH) {
@@ -626,13 +649,15 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 								0, IB_NODE_PORT_GUID_F);
 				}
 
-				get_remote_node(ibmad_port, fabric, node, port,
-						path, i, dist);
+				if (get_remote_node(ibmad_port, fabric, node, port,
+						path, i, dist) < 0)
+					goto error;
 			}
 		}
 	}
 
-	fabric->fabric.chassis = group_nodes(fabric);
+	if (group_nodes(fabric))
+		goto error;
 
 	return ((ibnd_fabric_t *)fabric);
 error:
-- 
1.5.4.5


From nashwath at gmail.com  Wed Aug 12 23:41:37 2009
From: nashwath at gmail.com (Ashwath Narasimhan)
Date: Thu, 13 Aug 2009 02:41:37 -0400
Subject: [ofa-general] Manipulating Credits in Infiniband
In-Reply-To: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net>
References: <ed1288770908100911h46524f4ch34cc6582bb1c03b@mail.gmail.com>
	<20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net>
Message-ID: <ed1288770908122341k33937aco5428ec7465acf90d@mail.gmail.com>

Dear Tom/all
  I understand the end to end credit based flow control at the link layer
where we have a 32 bit Flow control packet being sent for each VL (with FCCL
and FCTBS fields) but I fail to understand where this scheme is implemented
in the driver. (OFED linux- 1.4 stack, hw-mthca) . I can see a file with a
credit table mapped to different credits counts and another that computes
the AETH based on this credit table.
1. Is this the place where the flow control packets are formulated?
2. If yes, I don't see them computing this for each VL. why? If no, is it a
mid layer flow control?
3. And thats why I have this basic question--> is the link layer implemented
as part of OFED stack at all? or does it go into the hardware HCA as
firmware? As I understand the hardware vendor only provides verbs to
communicate with the HCA.

Pardon me if i am bundling you all with a lot with questions. I am new to
all this and I am trying my best to understand the stack.
Thank you,
Ashwath


On Tue, Aug 11, 2009 at 10:37 PM, Nifty Tom Mitchell <niftyompi at niftyegg.com
> wrote:

> On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote:
> >
> >    I looked into the infiniband driver files. As I understand, in order
> to
> >    limit the data rate we manipulate the credits on either ends. Since
> the
> >    number of credits available depends on the receiver's work receive
> >    queue size, I decided to limit the queue size to say 5 instead of 8192
> >    (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher
> >    layer protocol is ipoib). I just want to confirm if I am doing the
> >    right thing?
>
> Data rate is not manipulated by credits.
> Credits and queue sizes are different and have different purposes.
>
> Visit the Infiniband Trade Association web site and grab the IB
> specifications to understand some of the hardware level parts.
>
>        http://www.infinibandta.org/
>
> InfiniBand offers credit based flow control and given the nature of
> modern IB switches and processors a very small credit count can still
> result in full data rate.    Having said that flow control is the lowest
> level throttle in the system.   Reducing the credit count forces the
> higher levels in the protocol stack to source or sink the data through
> the hardware before any more can be delivered.   Thus flow control can
> simplify the implementation of higher level protocols.   It can also be
> used
> to cost reduce or simplify hardware design (smaller hardware buffers).
>
> The IB specifications are way too long.  Start with this FAQ.
>
>       http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf
>
> The IB specification is way too full of optional features.  A vendor may
> have XYZ working fine and dandy on one card and since it is optional not
> at all on another.
>
> The various queue sizes for the various protocols built on top of
> IB establish transfer behavior in keeping with system interrupt,
> system process time slice, system kernel activity loads and needs.
> It is counter intuitive but in some cases small queues result in
> more responsive and agile systems, especially in the presence of errors.
>
> Since there are often multiple protocols on the IB stack all protocols
> will be impacted by credit tinkering.  Most vendors know their hardware
> so most drivers will have credit related code optimum.
>
> In the case of TCP/IP the interaction between IB bandwidths&MTU (IPoIB),
> ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU can
> be "interesting" depending on host names, subnets, routing etc.   TCP/IP
> has lots of tuning flags well above the IB driver.   I see 500+ net.*
> sysctl knobs on this system.
>
> As you change things do make the changes on all the moving parts, benchmark
> and keep a log.   Since there are multiple IB hardware vendors
> it is important to track hardware specifics.  "lspci" is a good tool
> to gather chip info.   With some cards you also need specifics about
> the active firmware.
>
> So go forth (RPN forever) and conquer.
>
>
> --
>        T o m  M i t c h e l l
>        Found me a new hat, now what?
>
>


-- 
regards,
Ashwath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090813/6577386a/attachment.html>

From vlad at lists.openfabrics.org  Thu Aug 13 03:05:38 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 13 Aug 2009 03:05:38 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090813-0200 daily build status
Message-ID: <20090813100538.76A16E28273@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Thu Aug 13 04:36:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Aug 2009 14:36:20 +0300
Subject: [ofa-general] Re: [PATCH] opensm/complib: account for nsec overflow
 in timeout values
In-Reply-To: <20090806183716.c08bbea3.weiny2@llnl.gov>
References: <20090806183716.c08bbea3.weiny2@llnl.gov>
Message-ID: <20090813113620.GV25501@me>

Hi Ira,

On 18:37 Thu 06 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 6 Aug 2009 18:31:46 -0700
> Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  opensm/complib/cl_event.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c
> index d14b2f4..4bc8d37 100644
> --- a/opensm/complib/cl_event.c
> +++ b/opensm/complib/cl_event.c
> @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event,
>  	} else {
>  		/* Get the current time */
>  		if (gettimeofday(&curtime, NULL) == 0) {
> -			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000);
> -			timeout.tv_nsec =
> -			    (curtime.tv_usec + (wait_us % 1000000)) * 1000;
> +			uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000))

Do you really need fixed size (uint32_t) variable here?

> +						* 1000;
> +			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000)
> +						+ (n_sec % 1000000000);

Did you mean (n_sec / 1000000000)?

Sasha

> +			timeout.tv_nsec = n_sec % 1000000000;
>  
>  			wait_ret = pthread_cond_timedwait(&p_event->condvar,
>  							  &p_event->mutex,
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Thu Aug 13 04:41:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Aug 2009 14:41:04 +0300
Subject: [ofa-general] Re: [PATCH] libibnetdisc: fix potential memory leak of
	port object
In-Reply-To: <20090807090703.2b857dea.weiny2@llnl.gov>
References: <20090807090703.2b857dea.weiny2@llnl.gov>
Message-ID: <20090813114104.GW25501@me>

On 09:07 Fri 07 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Fri, 7 Aug 2009 09:05:44 -0700
> Subject: [PATCH] libibnetdisc: fix potential memory leak of port object
> 
> 	NOTE: This moves the port allocation below the port array allocation
> 	failure rather than free the port allocation after port array
> 	allocation fails.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Aug 13 04:49:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Aug 2009 14:49:10 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/libibnetdisc: remove all
 IBPANIC's and clean up error handling
In-Reply-To: <20090812165320.66ea08a5.weiny2@llnl.gov>
References: <20090812165320.66ea08a5.weiny2@llnl.gov>
Message-ID: <20090813114910.GX25501@me>

On 16:53 Wed 12 Aug     , Ira Weiny wrote:
> This patch applies after:
> 
> 	libibnetdisc: fix potential memory leak of port object
> 
> Which I sent last week but I don't think has made it up stream.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 12 Aug 2009 16:13:56 -0700
> Subject: [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Aug 13 05:28:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Aug 2009 15:28:11 +0300
Subject: [ofa-general] [PATCH] libibnetdisc/ibnetdisc.c: typo fix
In-Reply-To: <20090812165320.66ea08a5.weiny2@llnl.gov>
References: <20090812165320.66ea08a5.weiny2@llnl.gov>
Message-ID: <20090813122811.GY25501@me>


Fix statement completion typo ',' -> ';'.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 6c31300..b4bf52d 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -213,7 +213,7 @@ dump_endnode(ib_portid_t *path, char *prompt,
 	if (!show_progress)
 		return;
 
-	mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)),
+	mad_dump_node_type(type, 64, &(node->node.type), sizeof(int));
 
 	printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n",
 		portid2str(path), prompt, type,
-- 
1.6.4


From sashak at voltaire.com  Thu Aug 13 05:51:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Aug 2009 15:51:25 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: In
 osm_mcast_tbl_get_block, eliminate unneeded check
In-Reply-To: <20090812132247.GA15084@comcast.net>
References: <20090812132247.GA15084@comcast.net>
Message-ID: <20090813125125.GZ25501@me>

On 09:22 Wed 12 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From weiny2 at llnl.gov  Thu Aug 13 09:06:02 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 09:06:02 -0700
Subject: [ofa-general] Re: [PATCH v2] opensm/complib: account for nsec
 overflow in timeout values
In-Reply-To: <20090813113620.GV25501@me>
References: <20090806183716.c08bbea3.weiny2@llnl.gov>
	<20090813113620.GV25501@me>
Message-ID: <20090813090602.226b2695.weiny2@llnl.gov>

On Thu, 13 Aug 2009 14:36:20 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 18:37 Thu 06 Aug     , Ira Weiny wrote:
> > 
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Thu, 6 Aug 2009 18:31:46 -0700
> > Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values
> > 
> > 
> > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> > ---
> >  opensm/complib/cl_event.c |    8 +++++---
> >  1 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c
> > index d14b2f4..4bc8d37 100644
> > --- a/opensm/complib/cl_event.c
> > +++ b/opensm/complib/cl_event.c
> > @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event,
> >  	} else {
> >  		/* Get the current time */
> >  		if (gettimeofday(&curtime, NULL) == 0) {
> > -			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000);
> > -			timeout.tv_nsec =
> > -			    (curtime.tv_usec + (wait_us % 1000000)) * 1000;
> > +			uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000))
> 
> Do you really need fixed size (uint32_t) variable here?

Well I need at least int32_t.  I chose unsigned because we are not trying to go back in time.  I don't like leaving this as "int".  As rare as it might be, a compiler could chose 16bits for an int and that is not big enough, right?

> 
> > +						* 1000;
> > +			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000)
> > +						+ (n_sec % 1000000000);
> 
> Did you mean (n_sec / 1000000000)?

<sigh> yes... :-(

New patch below,
Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 6 Aug 2009 18:31:46 -0700
Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 opensm/complib/cl_event.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c
index d14b2f4..3f17262 100644
--- a/opensm/complib/cl_event.c
+++ b/opensm/complib/cl_event.c
@@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event,
 	} else {
 		/* Get the current time */
 		if (gettimeofday(&curtime, NULL) == 0) {
-			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000);
-			timeout.tv_nsec =
-			    (curtime.tv_usec + (wait_us % 1000000)) * 1000;
+			uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000))
+						* 1000;
+			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000)
+						+ (n_sec / 1000000000);
+			timeout.tv_nsec = n_sec % 1000000000;
 
 			wait_ret = pthread_cond_timedwait(&p_event->condvar,
 							  &p_event->mutex,
-- 
1.5.4.5


From sean.hefty at intel.com  Thu Aug 13 11:19:00 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 13 Aug 2009 11:19:00 -0700
Subject: [ofa-general] will opensm respond to requests that do not originate
	from qp1
Message-ID: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>

Does anyone know off the top of their heads if opensm will respond correctly to
SA MADs that are not sent from QP1?

- Sean


From hal.rosenstock at gmail.com  Thu Aug 13 12:41:11 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 13 Aug 2009 15:41:11 -0400
Subject: [ofa-general] will opensm respond to requests that do not 
	originate from qp1
In-Reply-To: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
Message-ID: <f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>

On 8/13/09, Sean Hefty <sean.hefty at intel.com> wrote:
>
> Does anyone know off the top of their heads if opensm will respond
> correctly to
> SA MADs that are not sent from QP1?


I don't have the code in front of me right now (I can validate tomorrow) but
don't think that should be a problem as for responses it just takes the
incoming source QP and uses that for the dest QP. Are you suspecting some
issue here ?

-- Hal

- Sean
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090813/b954210f/attachment.html>

From sean.hefty at intel.com  Thu Aug 13 12:51:52 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 13 Aug 2009 12:51:52 -0700
Subject: [ofa-general] will opensm respond to requests that do not
	originate from qp1
In-Reply-To: <f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
Message-ID: <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>

>I don't have the code in front of me right now (I can validate tomorrow) but
>don't think that should be a problem as for responses it just takes the
>incoming source QP and uses that for the dest QP. Are you suspecting some issue
>here ?

I just wanted to verify it before going too far down the path of sending MADs to
the SA on a different QP.  I have nothing that indicates any issue.  Not being
familiar with the opensm code, nothing jumped out at me as the place to look,
but between your reply and Jim's, I'm good assuming that it'll work.  Thanks.

- Sean


From jgunthorpe at obsidianresearch.com  Thu Aug 13 13:00:23 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 13 Aug 2009 14:00:23 -0600
Subject: [ofa-general] will opensm respond to requests that do not
	originate from qp1
In-Reply-To: <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
	<6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>
Message-ID: <20090813200023.GO16677@obsidianresearch.com>

On Thu, Aug 13, 2009 at 12:51:52PM -0700, Sean Hefty wrote:
> >I don't have the code in front of me right now (I can validate tomorrow) but
> >don't think that should be a problem as for responses it just takes the
> >incoming source QP and uses that for the dest QP. Are you suspecting some issue
> >here ?
> 
> I just wanted to verify it before going too far down the path of sending MADs to
> the SA on a different QP.  I have nothing that indicates any issue.  Not being
> familiar with the opensm code, nothing jumped out at me as the place to look,
> but between your reply and Jim's, I'm good assuming that it'll work.  Thanks.

Speaking of which, do we have an API to get the node's SM_Key for SA
packet construction?

Jason


From sean.hefty at intel.com  Thu Aug 13 13:14:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 13 Aug 2009 13:14:19 -0700
Subject: [ofa-general] will opensm respond to requests that do
	not	originate from qp1
In-Reply-To: <20090813200023.GO16677@obsidianresearch.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
	<6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>
	<20090813200023.GO16677@obsidianresearch.com>
Message-ID: <F4A03316B00D4CBD9334B87E8C51A8B2@amr.corp.intel.com>

>Speaking of which, do we have an API to get the node's SM_Key for SA
>packet construction?

Not that I'm aware of.  The ib-diags take the smkey as a command line option.

- Sean


From rdreier at cisco.com  Thu Aug 13 13:25:04 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 13 Aug 2009 13:25:04 -0700
Subject: [ofa-general] Re: [PATCH v3] ib/core: fix for send multicast group
	send leave retry
In-Reply-To: <48A06A66.7070605@Voltaire.COM> (Yossi Etigin's message of "Mon, 
	11 Aug 2008 19:35:50 +0300")
References: <48A06A66.7070605@Voltaire.COM>
Message-ID: <adaskfv4g2n.fsf@cisco.com>

thanks, applied at long long last.


From rdreier at cisco.com  Thu Aug 13 13:29:18 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 13 Aug 2009 13:29:18 -0700
Subject: [ofa-general] [PATCH] uverbs: return ENOSYS for unimplemented
	commands (not EINVAL)
In-Reply-To: <200812021943.44732.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 2 Dec 2008 19:43:44 +0200")
References: <200812021943.44732.jackm@dev.mellanox.co.il>
Message-ID: <adaocqj4fvl.fsf@cisco.com>

after meditating about this, I really think this is the right approach.
So I applied this patch.


From jgunthorpe at obsidianresearch.com  Thu Aug 13 14:09:24 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 13 Aug 2009 15:09:24 -0600
Subject: [ofa-general] will opensm respond to requests that do not
	originate from qp1
In-Reply-To: <F4A03316B00D4CBD9334B87E8C51A8B2@amr.corp.intel.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
	<6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>
	<20090813200023.GO16677@obsidianresearch.com>
	<F4A03316B00D4CBD9334B87E8C51A8B2@amr.corp.intel.com>
Message-ID: <20090813210924.GQ16677@obsidianresearch.com>

On Thu, Aug 13, 2009 at 01:14:19PM -0700, Sean Hefty wrote:
> >Speaking of which, do we have an API to get the node's SM_Key for SA
> >packet construction?
> 
> Not that I'm aware of.  The ib-diags take the smkey as a command line option.

Hmm, and the kernel wires it to zero. That's uncool.

So, any process that can create a QP can alter, say, the nodes
multicast group membership.

Thats a bit of a security problem.

I admit though, I haven't been able to discern what the SM_Key should
be set to from the spec..

-- 
Jason Gunthorpe <jgunthorpe at obsidianresearch.com>        (780)4406067x832
Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada


From weiny2 at llnl.gov  Thu Aug 13 20:42:36 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:42:36 -0700
Subject: [ofa-general] [PATCH 0/5] Further clean up of libibnetdisc interface
Message-ID: <20090813204236.36a161f3.weiny2@llnl.gov>

The following patches clean up the interface for the libibnetdisc.  The main
reasons for these changes are 3 fold.

   1) there were some problems with having the structures split between
      internal and external data.  (I thought I was being clever but it is
      not worth it.)
   2) I have, waiting in the wings, a multi-threaded implementation which
      further improves performance, especially on a fabric with problems
      (unresponsive nodes etc).  These patches lay the groundwork for some of
      the changes I will need for this implementation.
   3) I would really like to get the interface changed before this goes out
      with OFED 1.5 or any infiniband-diags release.

I have split the patches up to chunks which I think are pretty manageable.  Let
me know if there are issues or you prefer combined patches.

Ira

-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From weiny2 at llnl.gov  Thu Aug 13 20:42:42 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:42:42 -0700
Subject: [ofa-general] [PATCH 1/5] libibnetdisc: make all fields of
	ibnd_node_t public
Message-ID: <20090813204242.b659d8f5.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Tue, 11 Aug 2009 15:15:21 -0700
Subject: [PATCH] libibnetdisc: make all fields of ibnd_node_t public


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   12 +-
 infiniband-diags/libibnetdisc/src/chassis.c        |  147 ++++++++---------
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |  173 ++++++++++----------
 infiniband-diags/libibnetdisc/src/internal.h       |   22 +--
 4 files changed, 166 insertions(+), 188 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 121709d..e7f5f6a 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -45,8 +45,8 @@ struct port;			/* forward declare */
 /** =========================================================================
  * Node
  */
-typedef struct node {
-	struct node *next;	/* all node list in fabric */
+typedef struct ibnd_node {
+	struct ibnd_node *next;	/* all node list in fabric */
 
 	ib_portid_t path_portid;	/* path from "from_node" */
 	int dist;		/* num of hops from "from_node" */
@@ -72,12 +72,18 @@ typedef struct node {
 				   items MAY BE NULL!  (ie 0 == switches only) */
 
 	/* chassis info */
-	struct node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
+	struct ibnd_node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
 	struct chassis *chassis;	/* if != NULL the chassis this node belongs to */
 	unsigned char ch_type;
 	unsigned char ch_anafanum;
 	unsigned char ch_slotnum;
 	unsigned char ch_slot;
+
+	/* internal use only */
+	unsigned char ch_found;
+	struct ibnd_node *htnext;	/* hash table list */
+	struct ibnd_node *dnext;	/* nodesdist next */
+	struct ibnd_node *type_next;	/* next based on type */
 } ibnd_node_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 120b4b6..0dd259a 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -239,68 +239,68 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
 		return 0;
 }
 
-static int is_router(struct ibnd_node *n)
+static int is_router(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_IB_FC_ROUTER ||
 		devid == VTR_DEVID_IB_IP_ROUTER);
 }
 
-static int is_spine_9096(struct ibnd_node *n)
+static int is_spine_9096(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SFB4 || devid == VTR_DEVID_SFB4_DDR);
 }
 
-static int is_spine_9288(struct ibnd_node *n)
+static int is_spine_9288(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SFB12 || devid == VTR_DEVID_SFB12_DDR);
 }
 
-static int is_spine_2004(struct ibnd_node *n)
+static int is_spine_2004(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SFB2004);
 }
 
-static int is_spine_2012(struct ibnd_node *n)
+static int is_spine_2012(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SFB2012);
 }
 
-static int is_spine(struct ibnd_node *n)
+static int is_spine(ibnd_node_t * n)
 {
 	return (is_spine_9096(n) || is_spine_9288(n) ||
 		is_spine_2004(n) || is_spine_2012(n));
 }
 
-static int is_line_24(struct ibnd_node *n)
+static int is_line_24(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
-	return (devid == VTR_DEVID_SLB24 || devid == VTR_DEVID_SLB24_DDR ||
-		devid == VTR_DEVID_SRB2004);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
+	return (devid == VTR_DEVID_SLB24 ||
+		devid == VTR_DEVID_SLB24_DDR || devid == VTR_DEVID_SRB2004);
 }
 
-static int is_line_8(struct ibnd_node *n)
+static int is_line_8(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SLB8);
 }
 
-static int is_line_2024(struct ibnd_node *n)
+static int is_line_2024(ibnd_node_t * n)
 {
-	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
+	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
 	return (devid == VTR_DEVID_SLB2024);
 }
 
-static int is_line(struct ibnd_node *n)
+static int is_line(ibnd_node_t * n)
 {
 	return (is_line_24(n) || is_line_8(n) || is_line_2024(n));
 }
 
-int is_chassis_switch(struct ibnd_node *n)
+int is_chassis_switch(ibnd_node_t * n)
 {
 	return (is_spine(n) || is_line(n));
 }
@@ -349,7 +349,7 @@ char anafa_spine4_slot_2_slb[25] = {
 
 /*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
 
-static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport)
+static int get_sfb_slot(ibnd_node_t * node, ibnd_port_t * lineport)
 {
 	ibnd_node_t *n = (ibnd_node_t *) node;
 
@@ -372,25 +372,24 @@ static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport)
 		n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum];
 	} else {
 		IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64,
-			   node->node.guid);
+			   node->guid);
 		return (-1);
 	}
 	return (0);
 }
 
-static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
+static int get_router_slot(ibnd_node_t * n, ibnd_port_t * spineport)
 {
-	ibnd_node_t *n = (ibnd_node_t *) node;
 	uint64_t guessnum = 0;
 
-	node->ch_found = 1;
+	n->ch_found = 1;
 
 	n->ch_slot = SRBD_CS;
-	if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) {
+	if (is_spine_9096(spineport->node)) {
 		n->ch_type = ISR9096_CT;
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
-	} else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_9288(spineport->node)) {
 		n->ch_type = ISR9288_CT;
 		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
 		/* this is a smart guess based on nodeguids order on sFB-12 module */
@@ -399,7 +398,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
 		/* module 2 <--> remote anafa 2 */
 		/* module 3 <--> remote anafa 1 */
 		n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2));
-	} else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_2012(spineport->node)) {
 		n->ch_type = ISR2012_CT;
 		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
 		/* this is a smart guess based on nodeguids order on sFB-12 module */
@@ -408,7 +407,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
 		// module 2 <--> remote anafa 2
 		// module 3 <--> remote anafa 1
 		n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2));
-	} else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_2004(spineport->node)) {
 		n->ch_type = ISR2004_CT;
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
@@ -423,19 +422,19 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
 static int get_slb_slot(ibnd_node_t * n, ibnd_port_t * spineport)
 {
 	n->ch_slot = LINE_CS;
-	if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) {
+	if (is_spine_9096(spineport->node)) {
 		n->ch_type = ISR9096_CT;
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
-	} else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_9288(spineport->node)) {
 		n->ch_type = ISR9288_CT;
 		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
 		n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
-	} else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_2012(spineport->node)) {
 		n->ch_type = ISR2012_CT;
 		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
 		n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
-	} else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) {
+	} else if (is_spine_2004(spineport->node)) {
 		n->ch_type = ISR2004_CT;
 		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
 		n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
@@ -454,12 +453,11 @@ static void voltaire_portmap(ibnd_port_t * port);
 	It could be optimized so, but time overhead is very small
 	and its only diag.util
 */
-static int fill_voltaire_chassis_record(struct ibnd_node *node)
+static int fill_voltaire_chassis_record(ibnd_node_t * node)
 {
-	ibnd_node_t *n = (ibnd_node_t *) node;
 	int p = 0;
 	ibnd_port_t *port;
-	struct ibnd_node *remnode = 0;
+	ibnd_node_t *remnode = 0;
 
 	if (node->ch_found)	/* somehow this node has already been passed */
 		return (0);
@@ -470,25 +468,23 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node)
 	/* in such case node->ports is actually a requested port... */
 	if (is_router(node)) {
 		/* find the remote node */
-		for (p = 1; p <= node->node.numports; p++) {
-			port = node->node.ports[p];
-			if (port &&
-			    is_spine(CONV_NODE_INTERNAL
-				     (port->remoteport->node)))
+		for (p = 1; p <= node->numports; p++) {
+			port = node->ports[p];
+			if (port && is_spine(port->remoteport->node))
 				get_router_slot(node, port->remoteport);
 		}
 	} else if (is_spine(node)) {
-		for (p = 1; p <= node->node.numports; p++) {
-			port = node->node.ports[p];
+		for (p = 1; p <= node->numports; p++) {
+			port = node->ports[p];
 			if (!port || !port->remoteport)
 				continue;
-			remnode = CONV_NODE_INTERNAL(port->remoteport->node);
-			if (remnode->node.type != IB_NODE_SWITCH) {
+			remnode = port->remoteport->node;
+			if (remnode->type != IB_NODE_SWITCH) {
 				if (!remnode->ch_found)
 					get_router_slot(remnode, port);
 				continue;
 			}
-			if (!n->ch_type)
+			if (!node->ch_type)
 				/* we assume here that remoteport belongs to line */
 				if (get_sfb_slot(node, port->remoteport))
 					return (-1);
@@ -497,20 +493,20 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node)
 		}
 
 	} else if (is_line(node)) {
-		for (p = 1; p <= node->node.numports; p++) {
-			port = node->node.ports[p];
+		for (p = 1; p <= node->numports; p++) {
+			port = node->ports[p];
 			if (!port || port->portnum > 12 || !port->remoteport)
 				continue;
 			/* we assume here that remoteport belongs to spine */
-			if (get_slb_slot(n, port->remoteport))
+			if (get_slb_slot(node, port->remoteport))
 				return (-1);
 			break;
 		}
 	}
 
 	/* for each port of this node, map external ports */
-	for (p = 1; p <= node->node.numports; p++) {
-		port = node->node.ports[p];
+	for (p = 1; p <= node->numports; p++) {
+		port = node->ports[p];
 		if (!port)
 			continue;
 		voltaire_portmap(port);
@@ -534,8 +530,7 @@ static int get_spine_index(ibnd_node_t * node)
 {
 	int retval;
 
-	if (is_spine_9288(CONV_NODE_INTERNAL(node))
-	    || is_spine_2012(CONV_NODE_INTERNAL(node)))
+	if (is_spine_9288(node) || is_spine_2012(node))
 		retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum;
 	else
 		retval = node->ch_slotnum;
@@ -586,7 +581,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis)
 	for (i = 1; i <= LINES_MAX_NUM; i++) {
 		node = chassis->linenode[i];
 
-		if (!(node && is_line(CONV_NODE_INTERNAL(node))))
+		if (!(node && is_line(node)))
 			continue;	/* empty slot or router */
 
 		for (p = 1; p <= node->numports; p++) {
@@ -596,7 +591,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis)
 
 			remnode = port->remoteport->node;
 
-			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
+			if (!remnode->ch_found)
 				continue;	/* some error - spine not initialized ? FIXME */
 			if (insert_spine(remnode, chassis))
 				return (-1);
@@ -621,7 +616,7 @@ static int pass_on_spines_catch_lines(ibnd_chassis_t * chassis)
 				continue;
 			remnode = port->remoteport->node;
 
-			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
+			if (!remnode->ch_found)
 				continue;	/* some error - line/router not initialized ? FIXME */
 			if (insert_line_router(remnode, chassis))
 				return (-1);
@@ -655,10 +650,10 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t * chassis)
 	in that chassis
 	chassis structure = structure of one standalone chassis
 */
-static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis)
+static int build_chassis(ibnd_node_t * node, ibnd_chassis_t * chassis)
 {
 	int p = 0;
-	struct ibnd_node *remnode = 0;
+	ibnd_node_t *remnode = 0;
 	ibnd_port_t *port = 0;
 
 	/* we get here with node = chassis_spine */
@@ -666,16 +661,16 @@ static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis)
 		return (-1);
 
 	/* loop: pass on all ports of node */
-	for (p = 1; p <= node->node.numports; p++) {
-		port = node->node.ports[p];
+	for (p = 1; p <= node->numports; p++) {
+		port = node->ports[p];
 		if (!port || !port->remoteport)
 			continue;
-		remnode = CONV_NODE_INTERNAL(port->remoteport->node);
+		remnode = port->remoteport->node;
 
 		if (!remnode->ch_found)
 			continue;	/* some error - line or router not initialized ? FIXME */
 
-		insert_line_router(&(remnode->node), chassis);
+		insert_line_router(remnode, chassis);
 	}
 
 	if (pass_on_lines_catch_spines(chassis))
@@ -764,13 +759,11 @@ int int2ext_map_slb2024[2][25] = {
 /* map internal ports to external ports if appropriate */
 static void voltaire_portmap(ibnd_port_t * port)
 {
-	struct ibnd_node *n = CONV_NODE_INTERNAL(port->node);
 	int portnum = port->portnum;
 	int chipnum = 0;
 	ibnd_node_t *node = port->node;
 
-	if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node))
-	    || (portnum < 13 || portnum > 24)) {
+	if (!node->ch_found || !is_line(node) || (portnum < 13 || portnum > 24)) {
 		port->ext_portnum = 0;
 		return;
 	}
@@ -782,9 +775,9 @@ static void voltaire_portmap(ibnd_port_t * port)
 
 	chipnum = port->node->ch_anafanum - 1;
 
-	if (is_line_24(CONV_NODE_INTERNAL(node)))
+	if (is_line_24(node))
 		port->ext_portnum = int2ext_map_slb24[chipnum][portnum];
-	else if (is_line_2024(CONV_NODE_INTERNAL(node)))
+	else if (is_line_2024(node))
 		port->ext_portnum = int2ext_map_slb2024[chipnum][portnum];
 	else
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
@@ -828,7 +821,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
 */
 int group_nodes(struct ibnd_fabric *fabric)
 {
-	struct ibnd_node *node;
+	ibnd_node_t *node;
 	int dist;
 	int chassisnum = 0;
 	ibnd_chassis_t *chassis;
@@ -842,7 +835,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* not very efficient but clear code so... */
 	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
-			if (mad_get_field(node->node.info, 0,
+			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				if (fill_voltaire_chassis_record(node))
 					return (-1);
@@ -853,13 +846,11 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* algorithm: catch spine and find all surrounding nodes */
 	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
-			if (mad_get_field(node->node.info, 0,
+			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
 				continue;
-			//if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node))
 			if (!node->ch_found
-			    || (node->node.chassis
-				&& node->node.chassis->chassisnum)
+			    || (node->chassis && node->chassis->chassisnum)
 			    || !is_spine(node))
 				continue;
 			if (add_chassis(fabric))
@@ -874,10 +865,10 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* grouped by common SystemImageGUID */
 	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
-			if (mad_get_field(node->node.info, 0,
+			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
-			if (mad_get_field64(node->node.info, 0,
+			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
 				    find_chassisguid(fabric,
@@ -901,10 +892,10 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* (defined as chassis->nodecount > 1) */
 	for (dist = 0; dist <= MAXHOPS;) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
-			if (mad_get_field(node->node.info, 0,
+			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
-			if (mad_get_field64(node->node.info, 0,
+			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
 				    find_chassisguid(fabric,
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index b33be8d..b883d4a 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -98,18 +98,17 @@ static int get_port_info(struct ibmad_port *ibmad_port,
  * Returns -1 if error.
  */
 static int query_node_info(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, struct ibnd_node *node,
+			   struct ibnd_fabric *fabric, ibnd_node_t * node,
 			   ib_portid_t * portid)
 {
-	if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, 0,
+	if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0,
 			   ibmad_port))
 		return -1;
 
 	/* decode just a couple of fields for quicker reference. */
-	mad_decode_field(node->node.info, IB_NODE_GUID_F, &(node->node.guid));
-	mad_decode_field(node->node.info, IB_NODE_TYPE_F, &(node->node.type));
-	mad_decode_field(node->node.info, IB_NODE_NPORTS_F,
-			 &(node->node.numports));
+	mad_decode_field(node->info, IB_NODE_GUID_F, &(node->guid));
+	mad_decode_field(node->info, IB_NODE_TYPE_F, &(node->type));
+	mad_decode_field(node->info, IB_NODE_NPORTS_F, &(node->numports));
 
 	return (0);
 }
@@ -118,15 +117,14 @@ static int query_node_info(struct ibmad_port *ibmad_port,
  * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
  */
 static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
-		      struct ibnd_node *inode, struct ibnd_port *iport,
+		      ibnd_node_t * node, struct ibnd_port *iport,
 		      ib_portid_t * portid)
 {
 	int rc = 0;
-	ibnd_node_t *node = &(inode->node);
 	ibnd_port_t *port = &(iport->port);
-	void *nd = inode->node.nodedesc;
+	void *nd = node->nodedesc;
 
-	if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0)
+	if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0)
 		return rc;
 
 	port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F);
@@ -204,30 +202,30 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
 }
 
 static void dump_endnode(ib_portid_t * path, char *prompt,
-			 struct ibnd_node *node, struct ibnd_port *port)
+			 ibnd_node_t * node, struct ibnd_port *port)
 {
 	char type[64];
 	if (!show_progress)
 		return;
 
-	mad_dump_node_type(type, 64, &(node->node.type), sizeof(int));
-
-	printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n",
-	       portid2str(path), prompt, type, node->node.guid,
-	       node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum,
-	       port->port.base_lid,
-	       port->port.base_lid + (1 << port->port.lmc) - 1,
-	       node->node.nodedesc);
+	mad_dump_node_type(type, 64, &(node->type), sizeof(int)),
+	    printf("%s -> %s %s {%016" PRIx64
+		   "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path),
+		   prompt, type, node->guid,
+		   node->type == IB_NODE_SWITCH ? 0 : port->port.portnum,
+		   port->port.base_lid,
+		   port->port.base_lid + (1 << port->port.lmc) - 1,
+		   node->nodedesc);
 }
 
-static struct ibnd_node *find_existing_node(struct ibnd_fabric *fabric,
-					    struct ibnd_node *new)
+static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
+				       ibnd_node_t * new)
 {
-	int hash = HASHGUID(new->node.guid) % HTSZ;
-	struct ibnd_node *node;
+	int hash = HASHGUID(new->guid) % HTSZ;
+	ibnd_node_t *node;
 
 	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
-		if (node->node.guid == new->node.guid)
+		if (node->guid == new->guid)
 			return node;
 
 	return NULL;
@@ -237,7 +235,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 {
 	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int hash = HASHGUID(guid) % HTSZ;
-	struct ibnd_node *node;
+	ibnd_node_t *node;
 
 	if (!fabric) {
 		IBND_DEBUG("fabric parameter NULL\n");
@@ -245,7 +243,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 	}
 
 	for (node = f->nodestbl[hash]; node; node = node->htnext)
-		if (node->node.guid == guid)
+		if (node->guid == guid)
 			return (ibnd_node_t *) node;
 
 	return NULL;
@@ -273,7 +271,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 	void *nd = node->nodedesc;
 	int p = 0;
 	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
-	struct ibnd_node *n = CONV_NODE_INTERNAL(node);
 
 	if (_check_ibmad_port(ibmad_port) < 0)
 		return (NULL);
@@ -288,36 +285,36 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 		return (NULL);
 	}
 
-	if (query_node_info(ibmad_port, f, n, &(n->node.path_portid)))
+	if (query_node_info(ibmad_port, f, node, &(node->path_portid)))
 		return (NULL);
 
-	if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, 0,
+	if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0,
 			   ibmad_port))
 		return (NULL);
 
 	/* update all the port info's */
-	for (p = 1; p >= n->node.numports; p++) {
-		get_port_info(ibmad_port, f,
-			      CONV_PORT_INTERNAL(n->node.ports[p]), p,
-			      &(n->node.path_portid));
+	for (p = 1; p >= node->numports; p++) {
+		get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]),
+			      p, &(node->path_portid));
 	}
 
-	if (n->node.type != IB_NODE_SWITCH)
+	if (node->type != IB_NODE_SWITCH)
 		goto done;
 
-	if (!smp_query_via(portinfo_port0, &(n->node.path_portid),
-			   IB_ATTR_PORT_INFO, 0, 0, ibmad_port))
+	if (!smp_query_via
+	    (portinfo_port0, &(node->path_portid), IB_ATTR_PORT_INFO, 0, 0,
+	     ibmad_port))
 		return (NULL);
 
-	n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F);
-	n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F);
+	node->smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F);
+	node->smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F);
 
-	if (!smp_query_via(node->switchinfo, &(n->node.path_portid),
+	if (!smp_query_via(node->switchinfo, &(node->path_portid),
 			   IB_ATTR_SWITCH_INFO, 0, 0, ibmad_port))
 		node->smaenhsp0 = 0;	/* assume base SP0 */
 	else
 		mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F,
-				 &n->node.smaenhsp0);
+				 &node->smaenhsp0);
 
 done:
 	return (node);
@@ -358,10 +355,9 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 	return (rc);
 }
 
-static void add_to_nodeguid_hash(struct ibnd_node *node,
-				 struct ibnd_node *hash[])
+static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[])
 {
-	int hash_idx = HASHGUID(node->node.guid) % HTSZ;
+	int hash_idx = HASHGUID(node->guid) % HTSZ;
 
 	node->htnext = hash[hash_idx];
 	hash[hash_idx] = node;
@@ -376,9 +372,9 @@ static void add_to_portguid_hash(struct ibnd_port *port,
 	hash[hash_idx] = port;
 }
 
-static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric)
+static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
 {
-	switch (node->node.type) {
+	switch (node->type) {
 	case IB_NODE_CA:
 		node->type_next = fabric->ch_adapters;
 		fabric->ch_adapters = node;
@@ -394,21 +390,21 @@ static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric)
 	}
 }
 
-static void add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric)
+static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
 {
-	int dist = node->node.dist;
-	if (node->node.type != IB_NODE_SWITCH)
+	int dist = node->dist;
+	if (node->type != IB_NODE_SWITCH)
 		dist = MAXHOPS;	/* special Ca list */
 
 	node->dnext = fabric->nodesdist[dist];
 	fabric->nodesdist[dist] = node;
 }
 
-static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
-				     struct ibnd_node *temp, ib_portid_t * path,
-				     int dist)
+static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
+				ibnd_node_t * temp, ib_portid_t * path,
+				int dist)
 {
-	struct ibnd_node *node;
+	ibnd_node_t *node;
 
 	node = malloc(sizeof(*node));
 	if (!node) {
@@ -417,13 +413,13 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
 	}
 
 	memcpy(node, temp, sizeof(*node));
-	node->node.dist = dist;
-	node->node.path_portid = *path;
+	node->dist = dist;
+	node->path_portid = *path;
 
 	add_to_nodeguid_hash(node, fabric->nodestbl);
 
 	/* add this to the all nodes list */
-	node->node.next = fabric->fabric.nodes;
+	node->next = fabric->fabric.nodes;
 	fabric->fabric.nodes = (ibnd_node_t *) node;
 
 	add_to_type_list(node, fabric);
@@ -432,26 +428,24 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
 	return node;
 }
 
-static struct ibnd_port *find_existing_port_node(struct ibnd_node *node,
+static struct ibnd_port *find_existing_port_node(ibnd_node_t * node,
 						 struct ibnd_port *port)
 {
-	if (port->port.portnum > node->node.numports
-	    || node->node.ports == NULL)
+	if (port->port.portnum > node->numports || node->ports == NULL)
 		return (NULL);
 
-	return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum]));
+	return (CONV_PORT_INTERNAL(node->ports[port->port.portnum]));
 }
 
 static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
-					  struct ibnd_node *node,
+					  ibnd_node_t * node,
 					  struct ibnd_port *temp)
 {
 	struct ibnd_port *port;
 
-	if (node->node.ports == NULL) {
-		node->node.ports =
-		    calloc(sizeof(*node->node.ports), node->node.numports + 1);
-		if (!node->node.ports) {
+	if (node->ports == NULL) {
+		node->ports = calloc(sizeof(*node->ports), node->numports + 1);
+		if (!node->ports) {
 			IBND_ERROR("Failed to allocate the ports array\n");
 			return (NULL);
 		}
@@ -467,20 +461,19 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
 	port->port.node = (ibnd_node_t *) node;
 	port->port.ext_portnum = 0;
 
-	node->node.ports[temp->port.portnum] = (ibnd_port_t *) port;
+	node->ports[temp->port.portnum] = (ibnd_port_t *) port;
 
 	add_to_portguid_hash(port, fabric->portstbl);
 	return port;
 }
 
-static void link_ports(struct ibnd_node *node, struct ibnd_port *port,
-		       struct ibnd_node *remotenode,
-		       struct ibnd_port *remoteport)
+static void link_ports(ibnd_node_t * node, struct ibnd_port *port,
+		       ibnd_node_t * remotenode, struct ibnd_port *remoteport)
 {
 	IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64
-		   " %p->%p:%u\n", node->node.guid, node, port,
-		   port->port.portnum, remotenode->node.guid, remotenode,
-		   remoteport, remoteport->port.portnum);
+		   " %p->%p:%u\n", node->guid, node, port, port->port.portnum,
+		   remotenode->guid, remotenode, remoteport,
+		   remoteport->port.portnum);
 	if (port->port.remoteport)
 		port->port.remoteport->remoteport = NULL;
 	if (remoteport->port.remoteport)
@@ -490,14 +483,14 @@ static void link_ports(struct ibnd_node *node, struct ibnd_port *port,
 }
 
 static int get_remote_node(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, struct ibnd_node *node,
+			   struct ibnd_fabric *fabric, ibnd_node_t * node,
 			   struct ibnd_port *port, ib_portid_t * path,
 			   int portnum, int dist)
 {
 	int rc = 0;
-	struct ibnd_node node_buf;
+	ibnd_node_t node_buf;
 	struct ibnd_port port_buf;
-	struct ibnd_node *remotenode, *oldnode;
+	ibnd_node_t *remotenode, *oldnode;
 	struct ibnd_port *remoteport, *oldport;
 
 	memset(&node_buf, 0, sizeof(node_buf));
@@ -554,9 +547,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	int rc = 0;
 	struct ibnd_fabric *fabric = NULL;
 	ib_portid_t my_portid = { 0 };
-	struct ibnd_node node_buf;
+	ibnd_node_t node_buf;
 	struct ibnd_port port_buf;
-	struct ibnd_node *node;
+	ibnd_node_t *node;
 	struct ibnd_port *port;
 	int i;
 	int dist = 0;
@@ -605,7 +598,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 		goto error;
 
 	rc = get_remote_node(ibmad_port, fabric, node, port, from,
-			     mad_get_field(node->node.info, 0,
+			     mad_get_field(node->info, 0,
 					   IB_NODE_LOCAL_PORT_F), 0);
 	if (rc < 0)
 		goto error;
@@ -616,13 +609,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 
-			path = &node->node.path_portid;
+			path = &node->path_portid;
 
 			IBND_DEBUG("dist %d node %p\n", dist, node);
 			dump_endnode(path, "processing", node, port);
 
-			for (i = 1; i <= node->node.numports; i++) {
-				if (i == mad_get_field(node->node.info, 0,
+			for (i = 1; i <= node->numports; i++) {
+				if (i == mad_get_field(node->info, 0,
 						       IB_NODE_LOCAL_PORT_F))
 					continue;
 
@@ -644,9 +637,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 					goto error;
 
 				/* If switch, set port GUID to node port GUID */
-				if (node->node.type == IB_NODE_SWITCH) {
+				if (node->type == IB_NODE_SWITCH) {
 					port->port.guid =
-					    mad_get_field64(node->node.info, 0,
+					    mad_get_field64(node->info, 0,
 							    IB_NODE_PORT_GUID_F);
 				}
 
@@ -666,14 +659,14 @@ error:
 	return (NULL);
 }
 
-static void destroy_node(struct ibnd_node *node)
+static void destroy_node(ibnd_node_t * node)
 {
 	int p = 0;
 
-	for (p = 0; p <= node->node.numports; p++) {
-		free(node->node.ports[p]);
+	for (p = 0; p <= node->numports; p++) {
+		free(node->ports[p]);
 	}
-	free(node->node.ports);
+	free(node->ports);
 	free(node);
 }
 
@@ -681,8 +674,8 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 {
 	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int dist = 0;
-	struct ibnd_node *node = NULL;
-	struct ibnd_node *next = NULL;
+	ibnd_node_t *node = NULL;
+	ibnd_node_t *next = NULL;
 	ibnd_chassis_t *ch, *ch_next;
 
 	if (!fabric)
@@ -747,8 +740,8 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 			  int node_type, void *user_data)
 {
 	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
-	struct ibnd_node *list = NULL;
-	struct ibnd_node *cur = NULL;
+	ibnd_node_t *list = NULL;
+	ibnd_node_t *cur = NULL;
 
 	if (!fabric) {
 		IBND_DEBUG("fabric parameter NULL\n");
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 38555a0..449bd70 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -49,18 +49,6 @@
 #define	IBND_ERROR(fmt, ...) \
 		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
 
-struct ibnd_node {
-	/* This member MUST BE FIRST */
-	ibnd_node_t node;
-
-	/* internal use only */
-	unsigned char ch_found;
-	struct ibnd_node *htnext;	/* hash table list */
-	struct ibnd_node *dnext;	/* nodesdist next */
-	struct ibnd_node *type_next;	/* next based on type */
-};
-#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node)
-
 struct ibnd_port {
 	/* This member MUST BE FIRST */
 	ibnd_port_t port;
@@ -79,15 +67,15 @@ struct ibnd_fabric {
 	ibnd_fabric_t fabric;
 
 	/* internal use only */
-	struct ibnd_node *nodestbl[HTSZ];
+	ibnd_node_t *nodestbl[HTSZ];
 	struct ibnd_port *portstbl[HTSZ];
-	struct ibnd_node *nodesdist[MAXHOPS + 1];
+	ibnd_node_t *nodesdist[MAXHOPS + 1];
 	ibnd_chassis_t *first_chassis;
 	ibnd_chassis_t *current_chassis;
 	ibnd_chassis_t *last_chassis;
-	struct ibnd_node *switches;
-	struct ibnd_node *ch_adapters;
-	struct ibnd_node *routers;
+	ibnd_node_t *switches;
+	ibnd_node_t *ch_adapters;
+	ibnd_node_t *routers;
 	ib_portid_t selfportid;
 };
 #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug 13 20:42:46 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:42:46 -0700
Subject: [ofa-general] [PATCH 2/5] libibnetdisc: make all fields of
	ibnd_port_t public
Message-ID: <20090813204246.59efeb5e.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 13 Aug 2009 19:54:00 -0700
Subject: [PATCH] libibnetdisc: make all fields of ibnd_port_t public


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   15 ++--
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   87 ++++++++++----------
 infiniband-diags/libibnetdisc/src/internal.h       |   11 +--
 3 files changed, 52 insertions(+), 61 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index e7f5f6a..4a57855 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -40,7 +40,7 @@
 
 struct ib_fabric;		/* forward declare */
 struct chassis;			/* forward declare */
-struct port;			/* forward declare */
+struct ibnd_port;		/* forward declare */
 
 /** =========================================================================
  * Node
@@ -67,7 +67,7 @@ typedef struct ibnd_node {
 
 	char nodedesc[IB_SMP_DATA_SIZE];
 
-	struct port **ports;	/* in order array of port pointers
+	struct ibnd_port **ports; /* in order array of port pointers
 				   the size of this array is info.numports + 1
 				   items MAY BE NULL!  (ie 0 == switches only) */
 
@@ -89,17 +89,20 @@ typedef struct ibnd_node {
 /** =========================================================================
  * Port
  */
-typedef struct port {
+typedef struct ibnd_port {
 	uint64_t guid;
 	int portnum;
-	int ext_portnum;	/* optional if != 0 external port num */
-	ibnd_node_t *node;	/* node this port belongs to */
-	struct port *remoteport;	/* null if SMA, or does not exist */
+	int ext_portnum; /* optional if != 0 external port num */
+	ibnd_node_t *node; /* node this port belongs to */
+	struct ibnd_port *remoteport; /* null if SMA, or does not exist */
 	/* quick cache of info below */
 	uint16_t base_lid;
 	uint8_t lmc;
 	/* use libibmad decoder functions for info */
 	uint8_t info[IB_SMP_DATA_SIZE];
+
+	/* internal use only */
+	struct ibnd_port *htnext;
 } ibnd_port_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index b883d4a..1fc964c 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -67,28 +67,28 @@ void decode_port_info(ibnd_port_t * port)
 }
 
 static int get_port_info(struct ibmad_port *ibmad_port,
-			 struct ibnd_fabric *fabric, struct ibnd_port *port,
+			 struct ibnd_fabric *fabric, ibnd_port_t * port,
 			 int portnum, ib_portid_t * portid)
 {
 	char width[64], speed[64];
 	int iwidth;
 	int ispeed;
 
-	port->port.portnum = portnum;
-	iwidth = mad_get_field(port->port.info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
-	ispeed = mad_get_field(port->port.info, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
+	port->portnum = portnum;
+	iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
+	ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
 
-	if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO,
+	if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO,
 			   portnum, 0, ibmad_port))
 		return -1;
 
-	decode_port_info(&(port->port));
+	decode_port_info(port);
 
 	IBND_DEBUG
 	    ("portid %s portnum %d: base lid %d state %d physstate %d %s %s\n",
-	     portid2str(portid), portnum, port->port.base_lid,
-	     mad_get_field(port->port.info, 0, IB_PORT_STATE_F),
-	     mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F),
+	     portid2str(portid), portnum, port->base_lid,
+	     mad_get_field(port->info, 0, IB_PORT_STATE_F),
+	     mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F),
 	     mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth),
 	     mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed));
 	return 0;
@@ -117,11 +117,10 @@ static int query_node_info(struct ibmad_port *ibmad_port,
  * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
  */
 static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
-		      ibnd_node_t * node, struct ibnd_port *iport,
+		      ibnd_node_t * node, ibnd_port_t * port,
 		      ib_portid_t * portid)
 {
 	int rc = 0;
-	ibnd_port_t *port = &(iport->port);
 	void *nd = node->nodedesc;
 
 	if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0)
@@ -202,7 +201,7 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
 }
 
 static void dump_endnode(ib_portid_t * path, char *prompt,
-			 ibnd_node_t * node, struct ibnd_port *port)
+			 ibnd_node_t * node, ibnd_port_t * port)
 {
 	char type[64];
 	if (!show_progress)
@@ -212,10 +211,9 @@ static void dump_endnode(ib_portid_t * path, char *prompt,
 	    printf("%s -> %s %s {%016" PRIx64
 		   "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path),
 		   prompt, type, node->guid,
-		   node->type == IB_NODE_SWITCH ? 0 : port->port.portnum,
-		   port->port.base_lid,
-		   port->port.base_lid + (1 << port->port.lmc) - 1,
-		   node->nodedesc);
+		   node->type == IB_NODE_SWITCH ? 0 : port->portnum,
+		   port->base_lid,
+		   port->base_lid + (1 << port->lmc) - 1, node->nodedesc);
 }
 
 static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
@@ -294,7 +292,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 
 	/* update all the port info's */
 	for (p = 1; p >= node->numports; p++) {
-		get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]),
+		get_port_info(ibmad_port, f, node->ports[p],
 			      p, &(node->path_portid));
 	}
 
@@ -363,10 +361,9 @@ static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[])
 	hash[hash_idx] = node;
 }
 
-static void add_to_portguid_hash(struct ibnd_port *port,
-				 struct ibnd_port *hash[])
+static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[])
 {
-	int hash_idx = HASHGUID(port->port.guid) % HTSZ;
+	int hash_idx = HASHGUID(port->guid) % HTSZ;
 
 	port->htnext = hash[hash_idx];
 	hash[hash_idx] = port;
@@ -429,19 +426,19 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
 }
 
 static struct ibnd_port *find_existing_port_node(ibnd_node_t * node,
-						 struct ibnd_port *port)
+						 ibnd_port_t * port)
 {
-	if (port->port.portnum > node->numports || node->ports == NULL)
+	if (port->portnum > node->numports || node->ports == NULL)
 		return (NULL);
 
-	return (CONV_PORT_INTERNAL(node->ports[port->port.portnum]));
+	return (node->ports[port->portnum]);
 }
 
 static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
 					  ibnd_node_t * node,
-					  struct ibnd_port *temp)
+					  ibnd_port_t * temp)
 {
-	struct ibnd_port *port;
+	ibnd_port_t *port;
 
 	if (node->ports == NULL) {
 		node->ports = calloc(sizeof(*node->ports), node->numports + 1);
@@ -458,40 +455,40 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
 	}
 
 	memcpy(port, temp, sizeof(*port));
-	port->port.node = (ibnd_node_t *) node;
-	port->port.ext_portnum = 0;
+	port->node = (ibnd_node_t *) node;
+	port->ext_portnum = 0;
 
-	node->ports[temp->port.portnum] = (ibnd_port_t *) port;
+	node->ports[temp->portnum] = (ibnd_port_t *) port;
 
 	add_to_portguid_hash(port, fabric->portstbl);
 	return port;
 }
 
-static void link_ports(ibnd_node_t * node, struct ibnd_port *port,
-		       ibnd_node_t * remotenode, struct ibnd_port *remoteport)
+static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
+		       ibnd_node_t * remotenode, ibnd_port_t * remoteport)
 {
 	IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64
-		   " %p->%p:%u\n", node->guid, node, port, port->port.portnum,
+		   " %p->%p:%u\n", node->guid, node, port, port->portnum,
 		   remotenode->guid, remotenode, remoteport,
-		   remoteport->port.portnum);
-	if (port->port.remoteport)
-		port->port.remoteport->remoteport = NULL;
-	if (remoteport->port.remoteport)
-		remoteport->port.remoteport->remoteport = NULL;
-	port->port.remoteport = (ibnd_port_t *) remoteport;
-	remoteport->port.remoteport = (ibnd_port_t *) port;
+		   remoteport->portnum);
+	if (port->remoteport)
+		port->remoteport->remoteport = NULL;
+	if (remoteport->remoteport)
+		remoteport->remoteport->remoteport = NULL;
+	port->remoteport = (ibnd_port_t *) remoteport;
+	remoteport->remoteport = (ibnd_port_t *) port;
 }
 
 static int get_remote_node(struct ibmad_port *ibmad_port,
 			   struct ibnd_fabric *fabric, ibnd_node_t * node,
-			   struct ibnd_port *port, ib_portid_t * path,
+			   ibnd_port_t * port, ib_portid_t * path,
 			   int portnum, int dist)
 {
 	int rc = 0;
 	ibnd_node_t node_buf;
-	struct ibnd_port port_buf;
+	ibnd_port_t port_buf;
 	ibnd_node_t *remotenode, *oldnode;
-	struct ibnd_port *remoteport, *oldport;
+	ibnd_port_t *remoteport, *oldport;
 
 	memset(&node_buf, 0, sizeof(node_buf));
 	memset(&port_buf, 0, sizeof(port_buf));
@@ -499,7 +496,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
 	IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum,
 		   dist);
 
-	if (mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F)
+	if (mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
 	    != IB_PORT_PHYS_STATE_LINKUP)
 		return 1;	/* positive == non-fatal error */
 
@@ -548,9 +545,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	struct ibnd_fabric *fabric = NULL;
 	ib_portid_t my_portid = { 0 };
 	ibnd_node_t node_buf;
-	struct ibnd_port port_buf;
+	ibnd_port_t port_buf;
 	ibnd_node_t *node;
-	struct ibnd_port *port;
+	ibnd_port_t *port;
 	int i;
 	int dist = 0;
 	ib_portid_t *path;
@@ -638,7 +635,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 
 				/* If switch, set port GUID to node port GUID */
 				if (node->type == IB_NODE_SWITCH) {
-					port->port.guid =
+					port->guid =
 					    mad_get_field64(node->info, 0,
 							    IB_NODE_PORT_GUID_F);
 				}
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 449bd70..f06d2c3 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -49,15 +49,6 @@
 #define	IBND_ERROR(fmt, ...) \
 		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
 
-struct ibnd_port {
-	/* This member MUST BE FIRST */
-	ibnd_port_t port;
-
-	/* internal use only */
-	struct ibnd_port *htnext;
-};
-#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port)
-
 /* HASH table defines */
 #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
 #define HTSZ 137
@@ -68,7 +59,7 @@ struct ibnd_fabric {
 
 	/* internal use only */
 	ibnd_node_t *nodestbl[HTSZ];
-	struct ibnd_port *portstbl[HTSZ];
+	ibnd_port_t *portstbl[HTSZ];
 	ibnd_node_t *nodesdist[MAXHOPS + 1];
 	ibnd_chassis_t *first_chassis;
 	ibnd_chassis_t *current_chassis;
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug 13 20:42:51 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:42:51 -0700
Subject: [ofa-general] [PATCH 3/5] libibnetdisc: make all fields of
	ibnd_fabric_t public
Message-ID: <20090813204251.df6446c1.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 13 Aug 2009 20:08:51 -0700
Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public

	In addition clean up the name of the chassis struct

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   41 +++++++++----
 infiniband-diags/libibnetdisc/src/chassis.c        |   23 ++++----
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   63 +++++++++-----------
 infiniband-diags/libibnetdisc/src/internal.h       |   24 --------
 4 files changed, 69 insertions(+), 82 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 4a57855..414e068 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -38,8 +38,7 @@
 #include <infiniband/mad.h>
 #include <iba/ib_types.h>
 
-struct ib_fabric;		/* forward declare */
-struct chassis;			/* forward declare */
+struct ibnd_chassis;		/* forward declare */
 struct ibnd_port;		/* forward declare */
 
 /** =========================================================================
@@ -67,13 +66,13 @@ typedef struct ibnd_node {
 
 	char nodedesc[IB_SMP_DATA_SIZE];
 
-	struct ibnd_port **ports; /* in order array of port pointers
-				   the size of this array is info.numports + 1
-				   items MAY BE NULL!  (ie 0 == switches only) */
+	struct ibnd_port **ports;	/* in order array of port pointers
+					   the size of this array is info.numports + 1
+					   items MAY BE NULL!  (ie 0 == switches only) */
 
 	/* chassis info */
 	struct ibnd_node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
-	struct chassis *chassis;	/* if != NULL the chassis this node belongs to */
+	struct ibnd_chassis *chassis;	/* if != NULL the chassis this node belongs to */
 	unsigned char ch_type;
 	unsigned char ch_anafanum;
 	unsigned char ch_slotnum;
@@ -92,9 +91,9 @@ typedef struct ibnd_node {
 typedef struct ibnd_port {
 	uint64_t guid;
 	int portnum;
-	int ext_portnum; /* optional if != 0 external port num */
-	ibnd_node_t *node; /* node this port belongs to */
-	struct ibnd_port *remoteport; /* null if SMA, or does not exist */
+	int ext_portnum;	/* optional if != 0 external port num */
+	ibnd_node_t *node;	/* node this port belongs to */
+	struct ibnd_port *remoteport;	/* null if SMA, or does not exist */
 	/* quick cache of info below */
 	uint16_t base_lid;
 	uint8_t lmc;
@@ -108,8 +107,8 @@ typedef struct ibnd_port {
 /** =========================================================================
  * Chassis
  */
-typedef struct chassis {
-	struct chassis *next;
+typedef struct ibnd_chassis {
+	struct ibnd_chassis *next;
 	uint64_t chassisguid;
 	unsigned char chassisnum;
 
@@ -124,11 +123,17 @@ typedef struct chassis {
 	ibnd_node_t *linenode[LINES_MAX_NUM + 1];
 } ibnd_chassis_t;
 
+/* HASH table defines */
+#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
+#define HTSZ 137
+
+#define MAXHOPS		63
+
 /** =========================================================================
  * Fabric
  * Main fabric object which is returned and represents the data discovered
  */
-typedef struct ib_fabric {
+typedef struct ibnd_fabric {
 	/* the node the discover was initiated from
 	 * "from" parameter in ibnd_discover_fabric
 	 * or by default the node you ar running on
@@ -139,6 +144,18 @@ typedef struct ib_fabric {
 	/* NULL terminated list of all chassis found in the fabric */
 	ibnd_chassis_t *chassis;
 	int maxhops_discovered;
+
+	/* internal use only */
+	ibnd_node_t *nodestbl[HTSZ];
+	ibnd_port_t *portstbl[HTSZ];
+	ibnd_node_t *nodesdist[MAXHOPS + 1];
+	ibnd_chassis_t *first_chassis;
+	ibnd_chassis_t *current_chassis;
+	ibnd_chassis_t *last_chassis;
+	ibnd_node_t *switches;
+	ibnd_node_t *ch_adapters;
+	ibnd_node_t *routers;
+	ib_portid_t selfportid;
 } ibnd_fabric_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 0dd259a..4886cfc 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -91,7 +91,7 @@ char *ibnd_get_chassis_slot_str(ibnd_node_t * node, char *str, size_t size)
 	return (str);
 }
 
-static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric,
+static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric,
 				       unsigned char chassisnum)
 {
 	ibnd_chassis_t *current;
@@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node)
 		return sysimgguid;
 }
 
-static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f,
+static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric,
 					ibnd_node_t * node)
 {
 	ibnd_chassis_t *current;
 	uint64_t chguid;
 
 	chguid = get_chassisguid(node);
-	for (current = f->first_chassis; current; current = current->next) {
+	for (current = fabric->first_chassis; current; current = current->next) {
 		if (current->chassisguid == chguid)
 			return current;
 	}
@@ -224,7 +224,6 @@ static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f,
 
 uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	ibnd_chassis_t *chassis;
 
 	if (!fabric) {
@@ -232,7 +231,7 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
 		return 0;
 	}
 
-	chassis = find_chassisnum(f, chassisnum);
+	chassis = find_chassisnum(fabric, chassisnum);
 	if (chassis)
 		return chassis->chassisguid;
 	else
@@ -783,7 +782,7 @@ static void voltaire_portmap(ibnd_port_t * port)
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
 }
 
-static int add_chassis(struct ibnd_fabric *fabric)
+static int add_chassis(ibnd_fabric_t * fabric)
 {
 	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
 		IBND_ERROR("OOM: failed to allocate chassis object\n");
@@ -819,7 +818,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
 	Returns:
 	0 on success, -1 on failure
 */
-int group_nodes(struct ibnd_fabric *fabric)
+int group_nodes(ibnd_fabric_t * fabric)
 {
 	ibnd_node_t *node;
 	int dist;
@@ -833,7 +832,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* an appropriate chassis record (slotnum and position) */
 	/* according to internal connectivity */
 	/* not very efficient but clear code so... */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
@@ -844,7 +843,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 
 	/* separate every Voltaire chassis from each other and build linked list of them */
 	/* algorithm: catch spine and find all surrounding nodes */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
@@ -863,7 +862,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 
 	/* now make pass on nodes for chassis which are not Voltaire */
 	/* grouped by common SystemImageGUID */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
@@ -913,12 +912,12 @@ int group_nodes(struct ibnd_fabric *fabric)
 				}
 			}
 		}
-		if (dist == fabric->fabric.maxhops_discovered)
+		if (dist == fabric->maxhops_discovered)
 			dist = MAXHOPS;	/* skip to CAs */
 		else
 			dist++;
 	}
 
-	fabric->fabric.chassis = fabric->first_chassis;
+	fabric->chassis = fabric->first_chassis;
 	return (0);
 }
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 1fc964c..2cd2c9b 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -67,7 +67,7 @@ void decode_port_info(ibnd_port_t * port)
 }
 
 static int get_port_info(struct ibmad_port *ibmad_port,
-			 struct ibnd_fabric *fabric, ibnd_port_t * port,
+			 ibnd_fabric_t * fabric, ibnd_port_t * port,
 			 int portnum, ib_portid_t * portid)
 {
 	char width[64], speed[64];
@@ -98,7 +98,7 @@ static int get_port_info(struct ibmad_port *ibmad_port,
  * Returns -1 if error.
  */
 static int query_node_info(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, ibnd_node_t * node,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
 			   ib_portid_t * portid)
 {
 	if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0,
@@ -116,7 +116,7 @@ static int query_node_info(struct ibmad_port *ibmad_port,
 /*
  * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
  */
-static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
+static int query_node(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 		      ibnd_node_t * node, ibnd_port_t * port,
 		      ib_portid_t * portid)
 {
@@ -175,28 +175,28 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
-static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
+static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 			ib_portid_t * portid, int nextport)
 {
 	int rc = 0;
 
 	if (portid->lid) {
 		/* If we were LID routed we need to set up the drslid */
-		if (!f->selfportid.lid)
-			if (ib_resolve_self_via(&f->selfportid, NULL, NULL,
+		if (!fabric->selfportid.lid)
+			if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL,
 						ibmad_port) < 0) {
 				IBND_ERROR("Failed to resolve self\n");
 				return -1;
 			}
 
-		portid->drpath.drslid = (uint16_t) f->selfportid.lid;
+		portid->drpath.drslid = (uint16_t) fabric->selfportid.lid;
 		portid->drpath.drdlid = 0xFFFF;
 	}
 
 	rc = add_port_to_dpath(&portid->drpath, nextport);
 
-	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
-		f->fabric.maxhops_discovered = portid->drpath.cnt;
+	if ((rc != -1) && (portid->drpath.cnt > fabric->maxhops_discovered))
+		fabric->maxhops_discovered = portid->drpath.cnt;
 	return (rc);
 }
 
@@ -216,7 +216,7 @@ static void dump_endnode(ib_portid_t * path, char *prompt,
 		   port->base_lid + (1 << port->lmc) - 1, node->nodedesc);
 }
 
-static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
+static ibnd_node_t *find_existing_node(ibnd_fabric_t * fabric,
 				       ibnd_node_t * new)
 {
 	int hash = HASHGUID(new->guid) % HTSZ;
@@ -231,7 +231,6 @@ static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
 
 ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int hash = HASHGUID(guid) % HTSZ;
 	ibnd_node_t *node;
 
@@ -240,7 +239,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 		return (NULL);
 	}
 
-	for (node = f->nodestbl[hash]; node; node = node->htnext)
+	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
 		if (node->guid == guid)
 			return (ibnd_node_t *) node;
 
@@ -268,7 +267,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 	char portinfo_port0[IB_SMP_DATA_SIZE];
 	void *nd = node->nodedesc;
 	int p = 0;
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 
 	if (_check_ibmad_port(ibmad_port) < 0)
 		return (NULL);
@@ -283,7 +281,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 		return (NULL);
 	}
 
-	if (query_node_info(ibmad_port, f, node, &(node->path_portid)))
+	if (query_node_info(ibmad_port, fabric, node, &(node->path_portid)))
 		return (NULL);
 
 	if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0,
@@ -292,7 +290,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 
 	/* update all the port info's */
 	for (p = 1; p >= node->numports; p++) {
-		get_port_info(ibmad_port, f, node->ports[p],
+		get_port_info(ibmad_port, fabric, node->ports[p],
 			      p, &(node->path_portid));
 	}
 
@@ -320,7 +318,6 @@ done:
 
 ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int i = 0;
 	ibnd_node_t *rc;
 	ib_dr_path_t path;
@@ -330,7 +327,7 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 		return (NULL);
 	}
 
-	rc = f->fabric.from_node;
+	rc = fabric->from_node;
 
 	if (str2drpath(&path, dr_str, 0, 0) == -1) {
 		return (NULL);
@@ -369,7 +366,7 @@ static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[])
 	hash[hash_idx] = port;
 }
 
-static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
+static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric)
 {
 	switch (node->type) {
 	case IB_NODE_CA:
@@ -387,7 +384,7 @@ static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
 	}
 }
 
-static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
+static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric)
 {
 	int dist = node->dist;
 	if (node->type != IB_NODE_SWITCH)
@@ -397,7 +394,7 @@ static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
 	fabric->nodesdist[dist] = node;
 }
 
-static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
+static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
 				ibnd_node_t * temp, ib_portid_t * path,
 				int dist)
 {
@@ -416,8 +413,8 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
 	add_to_nodeguid_hash(node, fabric->nodestbl);
 
 	/* add this to the all nodes list */
-	node->next = fabric->fabric.nodes;
-	fabric->fabric.nodes = (ibnd_node_t *) node;
+	node->next = fabric->nodes;
+	fabric->nodes = (ibnd_node_t *) node;
 
 	add_to_type_list(node, fabric);
 	add_to_nodedist(node, fabric);
@@ -434,7 +431,7 @@ static struct ibnd_port *find_existing_port_node(ibnd_node_t * node,
 	return (node->ports[port->portnum]);
 }
 
-static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
+static struct ibnd_port *add_port_to_node(ibnd_fabric_t * fabric,
 					  ibnd_node_t * node,
 					  ibnd_port_t * temp)
 {
@@ -480,7 +477,7 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 }
 
 static int get_remote_node(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, ibnd_node_t * node,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
 			   ibnd_port_t * port, ib_portid_t * path,
 			   int portnum, int dist)
 {
@@ -542,7 +539,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 				    ib_portid_t * from, int hops)
 {
 	int rc = 0;
-	struct ibnd_fabric *fabric = NULL;
+	ibnd_fabric_t *fabric = NULL;
 	ib_portid_t my_portid = { 0 };
 	ibnd_node_t node_buf;
 	ibnd_port_t port_buf;
@@ -588,7 +585,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	if (!node)
 		goto error;
 
-	fabric->fabric.from_node = (ibnd_node_t *) node;
+	fabric->from_node = (ibnd_node_t *) node;
 
 	port = add_port_to_node(fabric, node, &port_buf);
 	if (!port)
@@ -669,7 +666,6 @@ static void destroy_node(ibnd_node_t * node)
 
 void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int dist = 0;
 	ibnd_node_t *node = NULL;
 	ibnd_node_t *next = NULL;
@@ -678,21 +674,21 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 	if (!fabric)
 		return;
 
-	ch = f->first_chassis;
+	ch = fabric->first_chassis;
 	while (ch) {
 		ch_next = ch->next;
 		free(ch);
 		ch = ch_next;
 	}
 	for (dist = 0; dist <= MAXHOPS; dist++) {
-		node = f->nodesdist[dist];
+		node = fabric->nodesdist[dist];
 		while (node) {
 			next = node->dnext;
 			destroy_node(node);
 			node = next;
 		}
 	}
-	free(f);
+	free(fabric);
 }
 
 void ibnd_debug(int i)
@@ -736,7 +732,6 @@ void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 			  int node_type, void *user_data)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	ibnd_node_t *list = NULL;
 	ibnd_node_t *cur = NULL;
 
@@ -752,13 +747,13 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 
 	switch (node_type) {
 	case IB_NODE_SWITCH:
-		list = f->switches;
+		list = fabric->switches;
 		break;
 	case IB_NODE_CA:
-		list = f->ch_adapters;
+		list = fabric->ch_adapters;
 		break;
 	case IB_NODE_ROUTER:
-		list = f->routers;
+		list = fabric->routers;
 		break;
 	default:
 		IBND_DEBUG("Invalid node_type specified %d\n", node_type);
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index f06d2c3..ba32291 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -40,8 +40,6 @@
 
 #include <infiniband/ibnetdisc.h>
 
-#define MAXHOPS		63
-
 #define	IBND_DEBUG(fmt, ...) \
 	if (ibdebug) { \
 		printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \
@@ -49,26 +47,4 @@
 #define	IBND_ERROR(fmt, ...) \
 		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
 
-/* HASH table defines */
-#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
-#define HTSZ 137
-
-struct ibnd_fabric {
-	/* This member MUST BE FIRST */
-	ibnd_fabric_t fabric;
-
-	/* internal use only */
-	ibnd_node_t *nodestbl[HTSZ];
-	ibnd_port_t *portstbl[HTSZ];
-	ibnd_node_t *nodesdist[MAXHOPS + 1];
-	ibnd_chassis_t *first_chassis;
-	ibnd_chassis_t *current_chassis;
-	ibnd_chassis_t *last_chassis;
-	ibnd_node_t *switches;
-	ibnd_node_t *ch_adapters;
-	ibnd_node_t *routers;
-	ib_portid_t selfportid;
-};
-#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
-
 #endif				/* _INTERNAL_H_ */
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug 13 20:43:06 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:43:06 -0700
Subject: [ofa-general] [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a
 context object.
Message-ID: <20090813204306.dffc3237.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 13 Aug 2009 20:16:01 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object.

	This object must be created before query functions can be used and is
	used to control the functionality of the queries.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/Makefile.am          |    4 +-
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   23 ++++--
 .../libibnetdisc/man/ibnd_create_ctx.3             |    2 +
 .../libibnetdisc/man/ibnd_destroy_ctx.3            |    2 +
 .../libibnetdisc/man/ibnd_discover_fabric.3        |   41 ++++++++---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   74 ++++++++++++++------
 infiniband-diags/libibnetdisc/src/internal.h       |    5 ++
 infiniband-diags/libibnetdisc/src/libibnetdisc.map |    2 +
 infiniband-diags/libibnetdisc/test/testleaks.c     |    7 ++-
 infiniband-diags/src/iblinkinfo.c                  |    8 ++-
 infiniband-diags/src/ibnetdiscover.c               |   13 +++-
 infiniband-diags/src/ibqueryerrors.c               |    8 ++-
 12 files changed, 141 insertions(+), 48 deletions(-)
 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3

diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am
index 7085f14..5619aad 100644
--- a/infiniband-diags/libibnetdisc/Makefile.am
+++ b/infiniband-diags/libibnetdisc/Makefile.am
@@ -45,7 +45,9 @@ man_MANS = man/ibnd_debug.3 \
 	man/ibnd_iter_nodes.3 \
 	man/ibnd_iter_nodes_type.3 \
 	man/ibnd_show_progress.3 \
-	man/ibnd_update_node.3
+	man/ibnd_update_node.3 \
+	man/ibnd_create_ctx.3 \
+	man/ibnd_destroy_ctx.3
 
 EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver $(man_MANS)
 
diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 414e068..65ba74f 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -38,8 +38,11 @@
 #include <infiniband/mad.h>
 #include <iba/ib_types.h>
 
-struct ibnd_chassis;		/* forward declare */
-struct ibnd_port;		/* forward declare */
+typedef struct ibnd_ctx ibnd_ctx_t;
+
+/* forward declares */
+struct ibnd_chassis;
+struct ibnd_port;
 
 /** =========================================================================
  * Node
@@ -159,15 +162,21 @@ typedef struct ibnd_fabric {
 } ibnd_fabric_t;
 
 /** =========================================================================
- * Initialization (fabric operations)
+ * Initialization
  */
 MAD_EXPORT void ibnd_debug(int i);
-MAD_EXPORT void ibnd_show_progress(int i);
 
-MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port,
+MAD_EXPORT ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port);
+MAD_EXPORT void ibnd_destroy_ctx(ibnd_ctx_t * ctx);
+MAD_EXPORT int ibnd_show_progress(ibnd_ctx_t * ctx, int i);
+
+/** =========================================================================
+ * Fabric Operations
+ */
+MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 					       ib_portid_t * from, int hops);
 	/**
-	 * open: (required) ibmad_port object from libibmad
+	 * ctx : (required) context created by ibnd_create_ctx.
 	 * from: (optional) specify the node to start scanning from.
 	 *       If NULL start from the node we are running on.
 	 * hops: (optional) Specify how much of the fabric to traverse.
@@ -181,7 +190,7 @@ MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t * fabric);
 MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric,
 					    uint64_t guid);
 MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str);
-MAD_EXPORT ibnd_node_t *ibnd_update_node(struct ibmad_port *ibmad_port,
+MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx,
 					 ibnd_fabric_t * fabric,
 					 ibnd_node_t * node);
 
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
new file mode 100644
index 0000000..8b321b0
--- /dev/null
+++ b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
@@ -0,0 +1,2 @@
+.\".TH IBND_CREATE_CTX 3  "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3
new file mode 100644
index 0000000..bb9d96a
--- /dev/null
+++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3
@@ -0,0 +1,2 @@
+.\".TH IBND_DESTROY_CTX 3  "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
index dfeaf47..f014977 100644
--- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
+++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
@@ -1,46 +1,65 @@
 .TH IBND_DISCOVER_FABRIC 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
 .SH "NAME"
-ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library.
+ibnd_create_ctx, ibnd_destroy_ctx,
+ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug, ibnd_show_progress \-
+initialize ibnetdiscover library and query the fabric.
 .SH "SYNOPSIS"
 .nf
 .B #include <infiniband/ibnetdisc.h>
 .sp
-.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)"
+.bi "ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)"
+.BI "void ibnd_destroy_ctx(ibnd_ctx_t *ctx)"
+.bi "ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t *ctx, ib_portid_t *from, int hops)"
 .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)"
 .BI "void ibnd_debug(int i)"
-.BI "void ibnd_show_progress(int i)"
+.BI "int ibnd_show_progress(ibnd_ctx_t *ctx, int i)"
 .SH "DESCRIPTION"
-.B ibnd_discover_fabric()
-Discover the fabric connected to the port specified by ibmad_port, using a timeout specified.  The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops".  This gives the user a "sub-fabric" which is "centered" anywhere they chose.
+.B ibnd_create_ctx()
+Create a context for the ibnetdiscover library to be used in query operations.
 
 ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS
-classes for ibnd_discover_fabric to work.
+classes for queries to work.
+
+.B ibnd_discover_fabric()
+Discover the fabric using the context specified.  The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops".  This gives the user a "sub-fabric" which is "centered" anywhere they chose.
 
 .B ibnd_destroy_fabric()
 free all memory and resources associated with the fabric.
 
+.B ibnd_destroy_ctx()
+free all memory and resources associated with the context.
+
 .B ibnd_debug()
 Set the debug level to be printed as library operations take place.
 
-.B ibnd_debug()
-Indicate that the library should print debug output which shows it's progress
+.B ibnd_show_progress()
+Indicate that the library should print output which shows it's progress
 through the fabric.
 
 .SH "RETURN VALUE"
+.B ibnd_create_ctx()
+return NULL on failure, otherwise a valid ibnd_ctx_t object.
+
 .B ibnd_discover_fabric()
 return NULL on failure, otherwise a valid ibnd_fabric_t object.
 
-.B ibnd_destory_fabric(), ibnd_debug()
+.B ibnd_show_progress()
+Returnes the previous setting for this value.
+
+.B ibnd_destory_fabric(), ibnd_debug(), ibnd_destroy_ctx()
 NONE
+
 .SH "EXAMPLES"
 
 .B Discover the entire fabric connected to device "mthca0", port 1.
 
 	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
 	struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2);
-	ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0);
+	ibnd_ctx_t *ctx = ibnd_create_ctx(ibmad_port);
+	ibnd_fabric_t *fabric = ibnd_discover_fabric(ctx, NULL, 0);
 	...
 	ibnd_destroy_fabric(fabric);
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 
 .B Discover only a single node and those nodes connected to it.
@@ -48,7 +67,7 @@ NONE
 	...
 	str2drpath(&(port_id.drpath), from, 0, 0);
 	...
-	ibnd_discover_fabric(ibmad_port, 100, &port_id, 1);
+	ibnd_discover_fabric(ctx, &port_id, 1);
 	...
 .SH "SEE ALSO"
 	libibmad, mad_rpc_open_port
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 2cd2c9b..4b320cd 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -57,9 +57,23 @@
 #include "internal.h"
 #include "chassis.h"
 
-static int show_progress = 0;
 int ibdebug;
 
+ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)
+{
+	ibnd_ctx_t *rc = calloc(1, sizeof *rc);
+	if (!rc)
+		return (NULL);
+
+	rc->ibmad_port = ibmad_port;
+	return (rc);
+}
+
+void ibnd_destroy_ctx(ibnd_ctx_t * ctx)
+{
+	free(ctx);
+}
+
 void decode_port_info(ibnd_port_t * port)
 {
 	port->base_lid = (uint16_t) mad_get_field(port->info, 0, IB_PORT_LID_F);
@@ -204,8 +218,6 @@ static void dump_endnode(ib_portid_t * path, char *prompt,
 			 ibnd_node_t * node, ibnd_port_t * port)
 {
 	char type[64];
-	if (!show_progress)
-		return;
 
 	mad_dump_node_type(type, 64, &(node->type), sizeof(int)),
 	    printf("%s -> %s %s {%016" PRIx64
@@ -261,16 +273,29 @@ static int _check_ibmad_port(struct ibmad_port *ibmad_port)
 	return (0);
 }
 
-ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
-			      ibnd_fabric_t * fabric, ibnd_node_t * node)
+static int check_ctx(ibnd_ctx_t * ctx)
+{
+	if (!ctx) {
+		IBND_DEBUG("ctx must be specified\n");
+		return (-1);
+	}
+
+	return (_check_ibmad_port(ctx->ibmad_port));
+}
+
+ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
+			      ibnd_node_t * node)
 {
 	char portinfo_port0[IB_SMP_DATA_SIZE];
 	void *nd = node->nodedesc;
 	int p = 0;
+	struct ibmad_port *ibmad_port;
 
-	if (_check_ibmad_port(ibmad_port) < 0)
+	if (check_ctx(ctx) < 0)
 		return (NULL);
 
+	ibmad_port = ctx->ibmad_port;
+
 	if (!fabric) {
 		IBND_DEBUG("fabric parameter NULL\n");
 		return (NULL);
@@ -476,12 +501,12 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 	remoteport->remoteport = (ibnd_port_t *) port;
 }
 
-static int get_remote_node(struct ibmad_port *ibmad_port,
-			   ibnd_fabric_t * fabric, ibnd_node_t * node,
-			   ibnd_port_t * port, ib_portid_t * path,
-			   int portnum, int dist)
+static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
+			   ibnd_node_t * node, ibnd_port_t * port,
+			   ib_portid_t * path, int portnum, int dist)
 {
 	int rc = 0;
+	struct ibmad_port *ibmad_port = ctx->ibmad_port;
 	ibnd_node_t node_buf;
 	ibnd_port_t port_buf;
 	ibnd_node_t *remotenode, *oldnode;
@@ -525,8 +550,9 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
 		goto error;
 	}
 
-	dump_endnode(path, oldnode ? "known remote" : "new remote",
-		     remotenode, remoteport);
+	if (ctx->show_progress)
+		dump_endnode(path, oldnode ? "known remote" : "new remote",
+			     remotenode, remoteport);
 
 	link_ports(node, port, remotenode, remoteport);
 
@@ -535,7 +561,7 @@ error:
 	return (rc);
 }
 
-ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
+ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 				    ib_portid_t * from, int hops)
 {
 	int rc = 0;
@@ -550,7 +576,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	ib_portid_t *path;
 	int max_hops = MAXHOPS - 1;	/* default find everything */
 
-	if (_check_ibmad_port(ibmad_port) < 0)
+	if (check_ctx(ctx) < 0)
 		return (NULL);
 
 	/* if not everything how much? */
@@ -576,7 +602,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	memset(&node_buf, 0, sizeof(node_buf));
 	memset(&port_buf, 0, sizeof(port_buf));
 
-	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, from)) {
+	if (query_node(ctx->ibmad_port, fabric, &node_buf, &port_buf, from)) {
 		IBND_DEBUG("can't reach node %s\n", portid2str(from));
 		goto error;
 	}
@@ -591,7 +617,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	if (!port)
 		goto error;
 
-	rc = get_remote_node(ibmad_port, fabric, node, port, from,
+	rc = get_remote_node(ctx, fabric, node, port, from,
 			     mad_get_field(node->info, 0,
 					   IB_NODE_LOCAL_PORT_F), 0);
 	if (rc < 0)
@@ -606,14 +632,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 			path = &node->path_portid;
 
 			IBND_DEBUG("dist %d node %p\n", dist, node);
-			dump_endnode(path, "processing", node, port);
+			if (ctx->show_progress)
+				dump_endnode(path, "processing", node, port);
 
 			for (i = 1; i <= node->numports; i++) {
 				if (i == mad_get_field(node->info, 0,
 						       IB_NODE_LOCAL_PORT_F))
 					continue;
 
-				if (get_port_info(ibmad_port, fabric,
+				if (get_port_info(ctx->ibmad_port, fabric,
 						  &port_buf, i, path)) {
 					IBND_ERROR
 					    ("can't reach node %s port %d",
@@ -637,7 +664,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 							    IB_NODE_PORT_GUID_F);
 				}
 
-				if (get_remote_node(ibmad_port, fabric, node,
+				if (get_remote_node(ctx, fabric, node,
 						    port, path, i, dist) < 0)
 					goto error;
 			}
@@ -704,9 +731,14 @@ void ibnd_debug(int i)
 	}
 }
 
-void ibnd_show_progress(int i)
+int ibnd_show_progress(ibnd_ctx_t * ctx, int i)
 {
-	show_progress = i;
+	int rc = 0;
+	if (check_ctx(ctx))
+		return (-1);
+	rc = ctx->show_progress;
+	ctx->show_progress = i;
+	return (rc);
 }
 
 void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index ba32291..8753eae 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -47,4 +47,9 @@
 #define	IBND_ERROR(fmt, ...) \
 		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
 
+struct ibnd_ctx {
+	struct ibmad_port *ibmad_port;
+	int show_progress;
+};
+
 #endif				/* _INTERNAL_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map
index bd108ab..56560ec 100644
--- a/infiniband-diags/libibnetdisc/src/libibnetdisc.map
+++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map
@@ -2,6 +2,8 @@ IBNETDISC_1.0 {
 	global:
 		ibnd_debug;
 		ibnd_show_progress;
+		ibnd_create_ctx;
+		ibnd_destroy_ctx;
 		ibnd_discover_fabric;
 		ibnd_destroy_fabric;
 		ibnd_find_node_guid;
diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c
index cb5651e..b121bdd 100644
--- a/infiniband-diags/libibnetdisc/test/testleaks.c
+++ b/infiniband-diags/libibnetdisc/test/testleaks.c
@@ -87,6 +87,7 @@ int main(int argc, char **argv)
 	int hops = 0;
 	ib_portid_t port_id;
 	int iters = -1;
+	ibnd_ctx_t *ctx = NULL;
 
 	struct ibmad_port *ibmad_port;
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
@@ -156,11 +157,12 @@ int main(int argc, char **argv)
 
 	mad_rpc_set_timeout(ibmad_port, timeout_ms);
 
+	ctx = ibnd_create_ctx(ibmad_port);
 	while (iters == -1 || iters-- > 0) {
 		if (from) {
 			/* only scan part of the fabric */
 			str2drpath(&(port_id.drpath), from, 0, 0);
-			if ((fabric = ibnd_discover_fabric(ibmad_port,
+			if ((fabric = ibnd_discover_fabric(ctx,
 							   &port_id,
 							   hops)) == NULL) {
 				fprintf(stderr, "discover failed\n");
@@ -170,7 +172,7 @@ int main(int argc, char **argv)
 			guid = 0;
 		} else {
 			if ((fabric =
-			     ibnd_discover_fabric(ibmad_port, NULL,
+			     ibnd_discover_fabric(ctx, NULL,
 						  -1)) == NULL) {
 				fprintf(stderr, "discover failed\n");
 				rc = 1;
@@ -182,6 +184,7 @@ int main(int argc, char **argv)
 	}
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 	exit(rc);
 }
diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index 5dfadee..af5be09 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -274,6 +274,7 @@ main(int argc, char **argv)
 	int rc = 0;
 	int resolved = -1;
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 	struct ibmad_port *ibmad_port;
 	ib_portid_t port_id = {0};
 	int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS};
@@ -313,6 +314,8 @@ main(int argc, char **argv)
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
+	ctx = ibnd_create_ctx(ibmad_port);
+
 	if (dr_path) {
 		/* only scan part of the fabric */
 		if ((resolved = ib_resolve_portid_str_via(&port_id, dr_path, IB_DEST_DRPATH,
@@ -327,12 +330,12 @@ main(int argc, char **argv)
 	}
 
 	if (resolved >= 0)
-		if ((fabric = ibnd_discover_fabric(ibmad_port, &port_id,
+		if ((fabric = ibnd_discover_fabric(ctx, &port_id,
 				hops)) == NULL)
 			IBWARN("Single node discover failed; attempting full scan\n");
 
 	if (!fabric)
-		if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) {
+		if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
 			goto close_port;
@@ -364,6 +367,7 @@ main(int argc, char **argv)
 	ibnd_destroy_fabric(fabric);
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	close_node_name_map(node_name_map);
 	mad_rpc_close_port(ibmad_port);
 	exit(rc);
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index b04f2c6..ecb591e 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -65,6 +65,7 @@ static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
 
 static int report_max_hops = 0;
+static int show_progress = 0;
 
 /**
  * Define our own conversion functions to maintain compatibility with the old
@@ -610,7 +611,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		node_name_map_file = strdup(optarg);
 		break;
 	case 's':
-		ibnd_show_progress(1);
+		show_progress = 1;
 		break;
 	case 'l':
 		list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE;
@@ -643,6 +644,7 @@ static int process_opt(void *context, int ch, char *optarg)
 int main(int argc, char **argv)
 {
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 
 	struct ibmad_port *ibmad_port;
 	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
@@ -683,8 +685,14 @@ int main(int argc, char **argv)
 		IBERROR("can't open file %s for writing", argv[0]);
 
 	node_name_map = open_node_name_map(node_name_map_file);
+	ctx = ibnd_create_ctx(ibmad_port);
 
-	if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL)
+	if (!ctx)
+		IBERROR("failed to create libibnetdisc context\n");
+
+	ibnd_show_progress(ctx, show_progress);
+
+	if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL)
 		IBERROR("discover failed\n");
 
 	if (ports_report)
@@ -697,6 +705,7 @@ int main(int argc, char **argv)
 		dump_topology(group, fabric);
 
 	ibnd_destroy_fabric(fabric);
+	ibnd_destroy_ctx(ctx);
 	close_node_name_map(node_name_map);
 	mad_rpc_close_port(ibmad_port);
 	exit(0);
diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c
index 2c85423..0955415 100644
--- a/infiniband-diags/src/ibqueryerrors.c
+++ b/infiniband-diags/src/ibqueryerrors.c
@@ -392,6 +392,7 @@ main(int argc, char **argv)
 	ib_portid_t portid = {0};
 	int rc = 0;
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 
 	int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS};
 
@@ -427,6 +428,8 @@ main(int argc, char **argv)
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
+	ctx = ibnd_create_ctx(ibmad_port);
+
 	/* limit the scan the fabric around the target */
 	if (dr_path) {
 		if ((resolved = ib_resolve_portid_str_via(&portid, dr_path, IB_DEST_DRPATH,
@@ -440,12 +443,12 @@ main(int argc, char **argv)
 	}
 
 	if (resolved >= 0)
-		if ((fabric = ibnd_discover_fabric(ibmad_port, &portid,
+		if ((fabric = ibnd_discover_fabric(ctx, &portid,
 				0)) == NULL)
 			IBWARN("Single node discover failed; attempting full scan\n");
 
 	if (!fabric) /* do a full scan */
-		if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) {
+		if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
 			goto close_port;
@@ -479,6 +482,7 @@ main(int argc, char **argv)
 	ibnd_destroy_fabric(fabric);
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 	close_node_name_map(node_name_map);
 	exit(rc);
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Aug 13 20:43:16 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 13 Aug 2009 20:43:16 -0700
Subject: [ofa-general] [PATCH 5/5] infiniband-diags/libibnetdisc: remove
 members of the fabric struct which are used in the scan only.
Message-ID: <20090813204316.c6ce0de3.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 13 Aug 2009 20:27:41 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only.


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |    7 --
 infiniband-diags/libibnetdisc/src/chassis.c        |   52 +++++++-------
 infiniband-diags/libibnetdisc/src/chassis.h        |    2 +-
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   80 ++++++++++++--------
 infiniband-diags/libibnetdisc/src/internal.h       |   13 +++
 5 files changed, 88 insertions(+), 66 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 65ba74f..da14942 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -130,8 +130,6 @@ typedef struct ibnd_chassis {
 #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
 #define HTSZ 137
 
-#define MAXHOPS		63
-
 /** =========================================================================
  * Fabric
  * Main fabric object which is returned and represents the data discovered
@@ -151,14 +149,9 @@ typedef struct ibnd_fabric {
 	/* internal use only */
 	ibnd_node_t *nodestbl[HTSZ];
 	ibnd_port_t *portstbl[HTSZ];
-	ibnd_node_t *nodesdist[MAXHOPS + 1];
-	ibnd_chassis_t *first_chassis;
-	ibnd_chassis_t *current_chassis;
-	ibnd_chassis_t *last_chassis;
 	ibnd_node_t *switches;
 	ibnd_node_t *ch_adapters;
 	ibnd_node_t *routers;
-	ib_portid_t selfportid;
 } ibnd_fabric_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 4886cfc..d11d7df 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -96,7 +96,7 @@ static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric,
 {
 	ibnd_chassis_t *current;
 
-	for (current = fabric->first_chassis; current; current = current->next) {
+	for (current = fabric->chassis; current; current = current->next) {
 		if (current->chassisnum == chassisnum)
 			return current;
 	}
@@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node)
 		return sysimgguid;
 }
 
-static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric,
+static ibnd_chassis_t *find_chassisguid(struct ibnd_chassis_ctx *ch_ctx,
 					ibnd_node_t * node)
 {
 	ibnd_chassis_t *current;
 	uint64_t chguid;
 
 	chguid = get_chassisguid(node);
-	for (current = fabric->first_chassis; current; current = current->next) {
+	for (current = ch_ctx->first_chassis; current; current = current->next) {
 		if (current->chassisguid == chguid)
 			return current;
 	}
@@ -782,19 +782,19 @@ static void voltaire_portmap(ibnd_port_t * port)
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
 }
 
-static int add_chassis(ibnd_fabric_t * fabric)
+static int add_chassis(struct ibnd_chassis_ctx *ch_ctx)
 {
-	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
+	if (!(ch_ctx->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
 		IBND_ERROR("OOM: failed to allocate chassis object\n");
 		return (-1);
 	}
 
-	if (fabric->first_chassis == NULL) {
-		fabric->first_chassis = fabric->current_chassis;
-		fabric->last_chassis = fabric->current_chassis;
+	if (ch_ctx->first_chassis == NULL) {
+		ch_ctx->first_chassis = ch_ctx->current_chassis;
+		ch_ctx->last_chassis = ch_ctx->current_chassis;
 	} else {
-		fabric->last_chassis->next = fabric->current_chassis;
-		fabric->last_chassis = fabric->current_chassis;
+		ch_ctx->last_chassis->next = ch_ctx->current_chassis;
+		ch_ctx->last_chassis = ch_ctx->current_chassis;
 	}
 	return (0);
 }
@@ -818,22 +818,22 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
 	Returns:
 	0 on success, -1 on failure
 */
-int group_nodes(ibnd_fabric_t * fabric)
+int group_nodes(struct ibnd_scan_ctx *scan_ctx, ibnd_fabric_t * fabric)
 {
 	ibnd_node_t *node;
 	int dist;
 	int chassisnum = 0;
 	ibnd_chassis_t *chassis;
+	struct ibnd_chassis_ctx ch_ctx;
 
-	fabric->first_chassis = NULL;
-	fabric->current_chassis = NULL;
+	memset(&ch_ctx, 0, sizeof ch_ctx);
 
 	/* first pass on switches and build for every Voltaire node */
 	/* an appropriate chassis record (slotnum and position) */
 	/* according to internal connectivity */
 	/* not very efficient but clear code so... */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				if (fill_voltaire_chassis_record(node))
@@ -844,7 +844,7 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* separate every Voltaire chassis from each other and build linked list of them */
 	/* algorithm: catch spine and find all surrounding nodes */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
 				continue;
@@ -852,10 +852,10 @@ int group_nodes(ibnd_fabric_t * fabric)
 			    || (node->chassis && node->chassis->chassisnum)
 			    || !is_spine(node))
 				continue;
-			if (add_chassis(fabric))
+			if (add_chassis(&ch_ctx))
 				return (-1);
-			fabric->current_chassis->chassisnum = ++chassisnum;
-			if (build_chassis(node, fabric->current_chassis))
+			ch_ctx.current_chassis->chassisnum = ++chassisnum;
+			if (build_chassis(node, ch_ctx.current_chassis))
 				return (-1);
 		}
 	}
@@ -863,25 +863,25 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* now make pass on nodes for chassis which are not Voltaire */
 	/* grouped by common SystemImageGUID */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
 			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
-				    find_chassisguid(fabric,
+				    find_chassisguid(&ch_ctx,
 						     (ibnd_node_t *) node);
 				if (chassis)
 					chassis->nodecount++;
 				else {
 					/* Possible new chassis */
-					if (add_chassis(fabric))
+					if (add_chassis(&ch_ctx))
 						return (-1);
-					fabric->current_chassis->chassisguid =
+					ch_ctx.current_chassis->chassisguid =
 					    get_chassisguid((ibnd_node_t *)
 							    node);
-					fabric->current_chassis->nodecount = 1;
+					ch_ctx.current_chassis->nodecount = 1;
 				}
 			}
 		}
@@ -890,14 +890,14 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* now, make another pass to see which nodes are part of chassis */
 	/* (defined as chassis->nodecount > 1) */
 	for (dist = 0; dist <= MAXHOPS;) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
 			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
-				    find_chassisguid(fabric,
+				    find_chassisguid(&ch_ctx,
 						     (ibnd_node_t *) node);
 				if (chassis && chassis->nodecount > 1) {
 					if (!chassis->chassisnum)
@@ -918,6 +918,6 @@ int group_nodes(ibnd_fabric_t * fabric)
 			dist++;
 	}
 
-	fabric->chassis = fabric->first_chassis;
+	fabric->chassis = ch_ctx.first_chassis;
 	return (0);
 }
diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h
index 2191046..707140c 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.h
+++ b/infiniband-diags/libibnetdisc/src/chassis.h
@@ -82,6 +82,6 @@ enum ibnd_chassis_type {
 };
 enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS };
 
-int group_nodes(struct ibnd_fabric *fabric);
+int group_nodes(struct ibnd_scan_ctx *scan_ctx, struct ibnd_fabric *fabric);
 
 #endif				/* _CHASSIS_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 4b320cd..14f6bf1 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -189,21 +189,27 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
-static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
+static int extend_dpath(struct ibnd_scan_ctx *scan_ctx,
+			struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 			ib_portid_t * portid, int nextport)
 {
 	int rc = 0;
 
 	if (portid->lid) {
+		if (!scan_ctx) {
+			IBND_ERROR("Invalid internal scan state");
+			return (-1);
+		}
 		/* If we were LID routed we need to set up the drslid */
-		if (!fabric->selfportid.lid)
-			if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL,
-						ibmad_port) < 0) {
+		if (!scan_ctx->selfportid.lid)
+			if (ib_resolve_self_via
+			    (&scan_ctx->selfportid, NULL, NULL,
+			     ibmad_port) < 0) {
 				IBND_ERROR("Failed to resolve self\n");
 				return -1;
 			}
 
-		portid->drpath.drslid = (uint16_t) fabric->selfportid.lid;
+		portid->drpath.drslid = (uint16_t) scan_ctx->selfportid.lid;
 		portid->drpath.drdlid = 0xFFFF;
 	}
 
@@ -409,19 +415,25 @@ static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric)
 	}
 }
 
-static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric)
+static void add_to_nodedist(ibnd_node_t * node, struct ibnd_scan_ctx *scan_ctx)
 {
 	int dist = node->dist;
+
+	if (!scan_ctx) {
+		IBND_ERROR("Invalid internal scan state");
+		return;
+	}
+
 	if (node->type != IB_NODE_SWITCH)
 		dist = MAXHOPS;	/* special Ca list */
 
-	node->dnext = fabric->nodesdist[dist];
-	fabric->nodesdist[dist] = node;
+	node->dnext = scan_ctx->nodesdist[dist];
+	scan_ctx->nodesdist[dist] = node;
 }
 
-static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
-				ibnd_node_t * temp, ib_portid_t * path,
-				int dist)
+static ibnd_node_t *create_node(struct ibnd_scan_ctx *scan_ctx,
+				ibnd_fabric_t * fabric, ibnd_node_t * temp,
+				ib_portid_t * path, int dist)
 {
 	ibnd_node_t *node;
 
@@ -442,7 +454,7 @@ static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
 	fabric->nodes = (ibnd_node_t *) node;
 
 	add_to_type_list(node, fabric);
-	add_to_nodedist(node, fabric);
+	add_to_nodedist(node, scan_ctx);
 
 	return node;
 }
@@ -501,9 +513,10 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 	remoteport->remoteport = (ibnd_port_t *) port;
 }
 
-static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
-			   ibnd_node_t * node, ibnd_port_t * port,
-			   ib_portid_t * path, int portnum, int dist)
+static int get_remote_node(ibnd_ctx_t * ctx, struct ibnd_scan_ctx *scan_ctx,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
+			   ibnd_port_t * port, ib_portid_t * path,
+			   int portnum, int dist)
 {
 	int rc = 0;
 	struct ibmad_port *ibmad_port = ctx->ibmad_port;
@@ -522,7 +535,7 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
 	    != IB_PORT_PHYS_STATE_LINKUP)
 		return 1;	/* positive == non-fatal error */
 
-	if (extend_dpath(ibmad_port, fabric, path, portnum) < 0)
+	if (extend_dpath(scan_ctx, ibmad_port, fabric, path, portnum) < 0)
 		return -1;
 
 	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
@@ -535,7 +548,9 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
 	oldnode = find_existing_node(fabric, &node_buf);
 	if (oldnode)
 		remotenode = oldnode;
-	else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) {
+	else if (!
+		 (remotenode =
+		  create_node(scan_ctx, fabric, &node_buf, path, dist + 1))) {
 		rc = -1;
 		goto error;
 	}
@@ -575,10 +590,13 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 	int dist = 0;
 	ib_portid_t *path;
 	int max_hops = MAXHOPS - 1;	/* default find everything */
+	struct ibnd_scan_ctx scan_ctx;
 
 	if (check_ctx(ctx) < 0)
 		return (NULL);
 
+	memset(&scan_ctx, 0, sizeof scan_ctx);
+
 	/* if not everything how much? */
 	if (hops >= 0) {
 		max_hops = hops;
@@ -607,7 +625,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 		goto error;
 	}
 
-	node = create_node(fabric, &node_buf, from, 0);
+	node = create_node(&scan_ctx, fabric, &node_buf, from, 0);
 	if (!node)
 		goto error;
 
@@ -617,7 +635,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 	if (!port)
 		goto error;
 
-	rc = get_remote_node(ctx, fabric, node, port, from,
+	rc = get_remote_node(ctx, &scan_ctx, fabric, node, port, from,
 			     mad_get_field(node->info, 0,
 					   IB_NODE_LOCAL_PORT_F), 0);
 	if (rc < 0)
@@ -627,7 +645,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 
 	for (dist = 0; dist <= max_hops; dist++) {
 
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx.nodesdist[dist]; node; node = node->dnext) {
 
 			path = &node->path_portid;
 
@@ -664,14 +682,15 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 							    IB_NODE_PORT_GUID_F);
 				}
 
-				if (get_remote_node(ctx, fabric, node,
-						    port, path, i, dist) < 0)
+				if (get_remote_node
+				    (ctx, &scan_ctx, fabric, node, port, path,
+				     i, dist) < 0)
 					goto error;
 			}
 		}
 	}
 
-	if (group_nodes(fabric))
+	if (group_nodes(&scan_ctx, fabric))
 		goto error;
 
 	return ((ibnd_fabric_t *) fabric);
@@ -693,7 +712,6 @@ static void destroy_node(ibnd_node_t * node)
 
 void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 {
-	int dist = 0;
 	ibnd_node_t *node = NULL;
 	ibnd_node_t *next = NULL;
 	ibnd_chassis_t *ch, *ch_next;
@@ -701,19 +719,17 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 	if (!fabric)
 		return;
 
-	ch = fabric->first_chassis;
+	ch = fabric->chassis;
 	while (ch) {
 		ch_next = ch->next;
 		free(ch);
 		ch = ch_next;
 	}
-	for (dist = 0; dist <= MAXHOPS; dist++) {
-		node = fabric->nodesdist[dist];
-		while (node) {
-			next = node->dnext;
-			destroy_node(node);
-			node = next;
-		}
+	node = fabric->nodes;
+	while (node) {
+		next = node->next;
+		destroy_node(node);
+		node = next;
 	}
 	free(fabric);
 }
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 8753eae..cf0b4bc 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -47,6 +47,19 @@
 #define	IBND_ERROR(fmt, ...) \
 		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
 
+#define MAXHOPS		63
+
+struct ibnd_chassis_ctx {
+	ibnd_chassis_t *first_chassis;
+	ibnd_chassis_t *current_chassis;
+	ibnd_chassis_t *last_chassis;
+};
+
+struct ibnd_scan_ctx {
+	ibnd_node_t *nodesdist[MAXHOPS + 1];
+	ib_portid_t selfportid;
+};
+
 struct ibnd_ctx {
 	struct ibmad_port *ibmad_port;
 	int show_progress;
-- 
1.5.4.5


From vlad at lists.openfabrics.org  Fri Aug 14 03:02:01 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 14 Aug 2009 03:02:01 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090814-0200 daily build status
Message-ID: <20090814100201.890FB1020236@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hnrose at comcast.net  Fri Aug 14 04:31:43 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 14 Aug 2009 07:31:43 -0400
Subject: [ofa-general] [PATCHv4] IB/mad: Allow tuning of QP0 and QP1 sizes
Message-ID: <20090814113143.GA18401@comcast.net>


IB/mad: Allow tuning of QP0 and QP1 sizes

MADs are UD and can be dropped if there are no receives posted.
Send side tuning is done for symmetry with receive.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v3:
Reverted module parameter permissions to 0444

Changes since v2:
Removed roundup_pow_of_two of receive and send sizes
Changed module parameter permissions to 0644

Changes since v1:
Added changelog

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..d1127ec 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
 MODULE_AUTHOR("Sean Hefty");
 
+int mad_sendq_size = IB_MAD_QP_SEND_SIZE;
+int mad_recvq_size = IB_MAD_QP_RECV_SIZE;
+
+module_param_named(send_queue_size, mad_sendq_size, int, 0444);
+MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests");
+module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
+MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests");
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 	qp_init_attr.send_cq = qp_info->port_priv->cq;
 	qp_init_attr.recv_cq = qp_info->port_priv->cq;
 	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
-	qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE;
-	qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE;
+	qp_init_attr.cap.max_send_wr = mad_sendq_size;
+	qp_init_attr.cap.max_recv_wr = mad_recvq_size;
 	qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
 	qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
 	qp_init_attr.qp_type = qp_type;
@@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 		goto error;
 	}
 	/* Use minimum queue sizes unless the CQ is resized */
-	qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE;
-	qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE;
+	qp_info->send_queue.max_active = mad_sendq_size;
+	qp_info->recv_queue.max_active = mad_recvq_size;
 	return 0;
 
 error:
@@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
-	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	cq_size = (mad_sendq_size + mad_recvq_size) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2984,6 +2993,12 @@ static int __init ib_mad_init_module(void)
 {
 	int ret;
 
+	mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
+	mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
+
+	mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
+	mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
+
 	spin_lock_init(&ib_mad_port_list_lock);
 
 	ib_mad_cache = kmem_cache_create("ib_mad",
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 05ce331..9430ab4 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation. All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -49,6 +50,8 @@
 /* QP and CQ parameters */
 #define IB_MAD_QP_SEND_SIZE	128
 #define IB_MAD_QP_RECV_SIZE	512
+#define IB_MAD_QP_MIN_SIZE	64
+#define IB_MAD_QP_MAX_SIZE	8192
 #define IB_MAD_SEND_REQ_MAX_SG	2
 #define IB_MAD_RECV_REQ_MAX_SG	1
 

From hnrose at comcast.net  Fri Aug 14 04:56:07 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 14 Aug 2009 07:56:07 -0400
Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: Fix endian of status
	in error message
Message-ID: <20090814115607.GA18583@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock>
---
diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c
index 791c848..c211bf8 100644
--- a/opensm/opensm/osm_sm_mad_ctrl.c
+++ b/opensm/opensm/osm_sm_mad_ctrl.c
@@ -637,7 +637,7 @@ static void sm_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw,
 
 	if (status != 0) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3111: "
-			"Error status = 0x%X\n", status);
+			"Error status = 0x%X\n", cl_ntoh16(status));
 		osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
 	}
 

From hnrose at comcast.net  Fri Aug 14 04:58:23 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 14 Aug 2009 07:58:23 -0400
Subject: [ofa-general] [PATCH] opensm/osm_helper.c: In osm_dump_dr_smp,
	fix endian of status
Message-ID: <20090814115823.GB18583@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 1b16ad9..23392a4 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -1911,7 +1911,7 @@ void osm_dump_dr_smp(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
 				      "\t\t\t\tD bit...................0x%X\n"
 				      "\t\t\t\tstatus..................0x%X\n",
 				      ib_smp_is_d(p_smp),
-				      ib_smp_get_status(p_smp));
+				      cl_ntoh16(ib_smp_get_status(p_smp)));
 		} else {
 			n += snprintf(buf + n, sizeof(buf) - n,
 				      "\t\t\t\tstatus..................0x%X\n",


From hnrose at comcast.net  Fri Aug 14 07:11:32 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 14 Aug 2009 10:11:32 -0400
Subject: [ofa-general] [PATCH] opensm/libvendor/osm_vendor_ibumad.c: Handle
	umad_alloc failure in osm_vendor_get
Message-ID: <20090814141132.GA31087@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
index a551493..e5f0c54 100644
--- a/opensm/libvendor/osm_vendor_ibumad.c
+++ b/opensm/libvendor/osm_vendor_ibumad.c
@@ -973,7 +973,7 @@ ib_mad_t *osm_vendor_get(IN osm_bind_handle_t h_bind,
 		"Acquired UMAD %p, size = %u\n", p_vw->umad, p_vw->size);
 
 	OSM_LOG_EXIT(p_vend->p_log);
-	return umad_get_mad(p_vw->umad);
+	return (p_vw->umad ? umad_get_mad(p_vw->umad) : NULL);
 }
 
 /**********************************************************************


From hal.rosenstock at gmail.com  Fri Aug 14 07:24:10 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 14 Aug 2009 10:24:10 -0400
Subject: [ofa-general] will opensm respond to requests that do not 
	originate from qp1
In-Reply-To: <f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
Message-ID: <f0e08f230908140724o78977c77gb329da564605d34f@mail.gmail.com>

On Thu, Aug 13, 2009 at 3:41 PM, Hal Rosenstock <hal.rosenstock at gmail.com>wrote:

>
>
>  On 8/13/09, Sean Hefty <sean.hefty at intel.com> wrote:
>>
>> Does anyone know off the top of their heads if opensm will respond
>> correctly to
>> SA MADs that are not sent from QP1?
>
>
> I don't have the code in front of me right now (I can validate tomorrow)
> but don't think that should be a problem as for responses it just takes the
> incoming source QP and uses that for the dest QP.
>

Based on a code audit, I've confirmed that this should work
(osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure
it's been tried for SA but it has been exercised for other GS classes
(sending to some QP other than QP1).

-- Hal


>  Are you suspecting some issue here ?
>
> -- Hal
>
> - Sean
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090814/0524ca31/attachment.html>

From hnrose at comcast.net  Fri Aug 14 07:52:49 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 14 Aug 2009 10:52:49 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Fix typo in option
	name
Message-ID: <20090814145249.GA8448@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 330c6aa..313f9a7 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1638,7 +1638,7 @@ int main(int argc, char **argv)
 		{"reversible", 'r', 1, NULL, "Reversible path (PathRecord)"},
 		{"numb_path", 'n', 1, NULL, "Number of paths (PathRecord)"},
 		{"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)"},
-		{"qos_calss", 'Q', 1, NULL, "QoS Class (PathRecord)"},
+		{"qos_class", 'Q', 1, NULL, "QoS Class (PathRecord)"},
 		{"sl", 19, 1, NULL,
 		 "Service level (PathRecord, MCMemberRecord)"},
 		{"mtu", 'M', 1, NULL,


From nmehrotra at riorey.com  Fri Aug 14 08:42:42 2009
From: nmehrotra at riorey.com (Nitin Mehrotra)
Date: Fri, 14 Aug 2009 10:42:42 -0500 (GMT-05:00)
Subject: [ofa-general] Help - RDMA event files remain open after
	acknowledging them
In-Reply-To: <940427960.2111250264532375.JavaMail.root@zmail.riorey.com>
Message-ID: <822888568.2131250264562447.JavaMail.root@zmail.riorey.com>

Folks,

This may be a newbie question but I can't seem to find the answer and I'm hoping someone can point me in the right direction.

I'm building an IB application where the two ends are required to robustly connect when present. Either of the ends may fail for extended periods of time and the other needs to handle this and reconnect when the peer recovers. The server is trivial since it passively listens for connections but the client is giving me some trouble.

I have used a model similar to the one described in http://linux.die.net/man/7/rdma_cm. The general connection flow on the client is  rdma_create_id/rdma_resolve_addr/rdma_create_qp/rdma_resolve_route/rdma_connect, handling the events as appropriate. This works when the peer (server) is present. However when the server is not present I have observed that rdma_resolve_addr and rdma_resolve_route succeed (since the local HCA and SM are present) and then I get a RDMA_CM_EVENT_REJECTED or a RDMA_CM_EVENT_UNREACHABLE event. At this point I delete the IB resources allocated between steps 1 & 2 (QP, CQE, CQ, etc) and restart the rdma_resolve_addr. As an aside, I found that I could not just restart rdma_resolve_route - that returned error EINVAL, I had to restart from rdma_resolve_addr.

The problem I am facing is that it appears that every RDMA event I receive (from uverbs it appears) creates a special file that is linked to "infinibandevent:". See below. However even though I am careful to acknowledge every RDMA event I receive (rdma_ack_cm_event for every rdma_ack_cm_event) these files don't get closed or deleted so that eventually the application fails with error EMFILE (too many open files) when trying to create the completion event queue (as part of creating the QP).

What am I doing wrong? Is there something more I need to do than calling rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be closed? As an fyi, I have even tried closing the rdma_id and destroying the event channel when the connection fails to force the event files to be closed without success.

Btw, this is a user space application and I am using OFED 1.4.1 on Linux 2.6.27 (gentoo distro). It should be irrelevant but just in case, this is using a Mellanox HCA, both peers are on a local subnet with only one IB interface per peer.

Thanks,

Nitin

filter-1 ib # ls -l /proc/8072/fd
total 0
lrwx------ 1 root root 64 Aug 14 06:44 0 -> /dev/pts/0
lrwx------ 1 root root 64 Aug 14 06:44 1 -> /dev/pts/0
lr-x------ 1 root root 64 Aug 14 06:44 10 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 11 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 12 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 13 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 14 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 15 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 16 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 17 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 18 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 19 -> infinibandevent:
lrwx------ 1 root root 64 Aug 14 06:44 2 -> /dev/pts/0
lr-x------ 1 root root 64 Aug 14 06:44 20 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 21 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 22 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 23 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 24 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 25 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 26 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 27 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 28 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 29 -> infinibandevent:
lrwx------ 1 root root 64 Aug 14 06:44 3 -> socket:[223603]
lr-x------ 1 root root 64 Aug 14 06:44 30 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 31 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 32 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 33 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 34 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 35 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 36 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 37 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 38 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 39 -> infinibandevent:

These grow until 999 files and then the app fails.


From sean.hefty at intel.com  Fri Aug 14 09:34:15 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 14 Aug 2009 09:34:15 -0700
Subject: [ofa-general] Help - RDMA event files remain open
	after	acknowledging them
In-Reply-To: <822888568.2131250264562447.JavaMail.root@zmail.riorey.com>
References: <940427960.2111250264532375.JavaMail.root@zmail.riorey.com>
	<822888568.2131250264562447.JavaMail.root@zmail.riorey.com>
Message-ID: <D10B9B28FE5A49C8A4C76B617B574FF5@amr.corp.intel.com>

>What am I doing wrong? Is there something more I need to do than calling
>rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be
>closed? As an fyi, I have even tried closing the rdma_id and destroying the
>event channel when the connection fails to force the event files to be closed
>without success.

The following calls result in opening files to the kernel:

ibv_create_comp_channel() - used to report cq events
rdma_create_event_channel() - used to report rdma cm events

Be sure that there are corresponding calls to:

ibv_destroy_comp_channel()
rdma_destroy_event_channel()

These are the calls that close the opened files.

- Sean


From sean.hefty at intel.com  Fri Aug 14 10:15:23 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 14 Aug 2009 10:15:23 -0700
Subject: [ofa-general] will opensm respond to requests that do not
	originate from qp1
In-Reply-To: <f0e08f230908140724o78977c77gb329da564605d34f@mail.gmail.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
	<f0e08f230908140724o78977c77gb329da564605d34f@mail.gmail.com>
Message-ID: <F7841423EDFB4E62921E0A5F188D44B5@amr.corp.intel.com>

>Based on a code audit, I've confirmed that this should work
>(osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure
>it's been tried for SA but it has been exercised for other GS classes (sending
>to some QP other than QP1).

Thanks for checking and pointing me at the right source file.


From robertacummins at gmail.com  Fri Aug 14 10:16:14 2009
From: robertacummins at gmail.com (Robert Cummins)
Date: Fri, 14 Aug 2009 11:16:14 -0600
Subject: [ofa-general] RHEL 5.3 (2.6.18-128.1.1.el5 kernel) and connected
	mode
Message-ID: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com>

Hello,

IHAC that is experiencing a problem with IB.  Specifically, when placing
the Infinihost III card in connected mode with 'echo connected
> /sys/class/net/ib0/mode' some nodes stop responding.  By 'stop
responding' I mean:

  - ping <ib ip address> doesn't work (no packets returned; 100% packet
loss)
  - ib_rdma_bw -b node never runs
  - ibping does work

since the customer is mounting their nfs server over IB nfs services
stop working when in connected mode.  What is interesting is if I leave
the nfs server in datagram mode then the affected nodes can still
interact with the nfs server, ie., nfs service continues to work, but I
can not communicate over IB with other nodes that are also in connected
mode.

At first I thought this was only a problem with IPoIB.  I note the
following difference between nodes that do not work in connected mode
and nodes that do.  The first output is from a node that stops working,
the second from a node that continues to work.

[root at ws3 ~]# modinfo ib_ipoib
filename:       /lib/modules/2.6.18-128.el5/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko
license:        Dual BSD/GPL
description:    IP-over-InfiniBand net driver
author:         Roland Dreier
srcversion:     E3C28A100A995101E2AB934
depends:        ib_cm,ipv6,ib_core,ib_sa
vermagic:       2.6.18-128.el5 SMP mod_unload gcc-4.1
parm:           max_nonsrq_conn_qp:Max number of connected-mode QPs per
interface (applied only if shared receive queue is not available) (int)
parm:           set_nonsrq:set to dictate working in none SRQ mode,
otherwise act according to device capabilities (int)
parm:           mcast_debug_level:Enable multicast debug tracing if > 0
(int)
parm:           send_queue_size:Number of descriptors in send queue
(int)
parm:           recv_queue_size:Number of descriptors in receive queue
(int)
parm:           debug_level:Enable debug tracing if > 0 (int)
module_sig:
883f35049492f615cdc734e64d24fa112659309d1b9619270a5e84a97a46cbc6e4ac0908b21f20a0a75b803bc72eba1ce62d2a8eec53fd9c2d7288c
[root at ws3 ~]# 


[root at scyld ~]# modinfo ib_ipoib
filename:       /lib/modules/2.6.18-128.1.1.el5.530g0000/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko
license:        Dual BSD/GPL
description:    IP-over-InfiniBand net driver
author:         Roland Dreier
srcversion:     8E47481E21B330BFE32B7CE
depends:        ib_cm,ipv6,ib_core,ib_sa
vermagic:       2.6.18-128.1.1.el5.530g0000 SMP mod_unload gcc-4.1
parm:           max_nonsrq_conn_qp:Max number of connected-mode QPs per
interface (applied only if shared receive queue is not available) (int)
parm:           set_nonsrq:set to dictate working in none SRQ mode,
otherwise act according to device capabilities (int)
parm:           mcast_debug_level:Enable multicast debug tracing if > 0
(int)
parm:           send_queue_size:Number of descriptors in send queue
(int)
parm:           recv_queue_size:Number of descriptors in receive queue
(int)
parm:           debug_level:Enable debug tracing if > 0 (int)
module_sig:
883f35049c0555e56ccec1c0ba19c3112535c09b5f5dbc8607465f947d60f2be7fa26132d43309f5dc241bebfe2f2f88fc7c93fbe5ea12cd721a59
[root at scyld ~]# 

However, after retesting with ib_rdma_bw I can see that even the verbs
layer is not working.

I have not tried using the ib_ipoib.ko from the 'working' configuration
in the non-working system since I assumed it would not load due to the
slight kernel difference.

It should be noted that the I have four nodes that fail and nearly 20
that 'work'.   The failing nodes are running the same kernel
(2.6.18-128.el5) while the working nodes are running the
2.6.18-128.1.1.el5 kernel.  I am at a loss as to how to proceed with
debugging this short of getting the latest OFED distro and building it.

Has anyone else run into this problem and if so, how did you get around
it?  

TIA


R.


From nmehrotra at riorey.com  Fri Aug 14 12:02:21 2009
From: nmehrotra at riorey.com (Nitin Mehrotra)
Date: Fri, 14 Aug 2009 14:02:21 -0500 (GMT-05:00)
Subject: How to destroy IB resources (was Re: [ofa-general] Help - RDMA
	event files remain open after acknowledging them)
In-Reply-To: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>
Message-ID: <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>

Sean,

Thanks for your reply. It turns out the problem is the file created by the ibv_create_comp_channel() call. I do make sure to call the destroy call for each create call, the problem is that it is failing with error 16 (device or resource busy) and I missed that fact.

So this brings me to another newbie question which I haven't been able to completely solve and that is how to cleanly and successfully destroy all IB resources. Since this is a new subject I have changed the thread subject appropriately.

I init IB as follows:

- ibv_create_comp_channel()
- make ccc_fd non-blocking
- ibv_create_cq()
- ibv_req_notify_cq()
- ibv_alloc_pd()
- ibv_create_qp

I shutdown in the reverse order
- drain_cq()
- ibv_destroy_qp()
- ibv_dealloc_pd()
- ibv_destroy_cq()
- ibv_destroy_comp_channel()

my drain_cq() function is:
loop:
    - ibv_get_cq_event()
    - ibv_ack_cq_events() all unacknowledged events pending, if any
    - ibv_req_notify_cq()
    - ibv_poll_cq()
until either ibv_get_cq_event() returns an error, ibv_poll_cq() returns 0 completions or I have looped the depth of the cq.

It works, in a fashion. Without the drain function ibv_destroy_cq() hangs. However now, ibv_get_cq_event() returns EAGAIN continuonsly so I exit after I have looped the depth of cq  and then ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return error EBUSY. This leaves the file open in the system.

I guess my question is, what's the best way to destroy IB resources? (Perhaps even, what's the best way to init them in the first place).

Thanks,

Nitin
----- Original Message -----
From: "Sean Hefty" <sean.hefty at intel.com>
To: "Nitin Mehrotra" <nmehrotra at riorey.com>, general at lists.openfabrics.org
Sent: Friday, August 14, 2009 12:34:15 PM GMT -05:00 US/Canada Eastern
Subject: RE: [ofa-general] Help - RDMA event files remain open after	acknowledging them

>What am I doing wrong? Is there something more I need to do than calling
>rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be
>closed? As an fyi, I have even tried closing the rdma_id and destroying the
>event channel when the connection fails to force the event files to be closed
>without success.

The following calls result in opening files to the kernel:

ibv_create_comp_channel() - used to report cq events
rdma_create_event_channel() - used to report rdma cm events

Be sure that there are corresponding calls to:

ibv_destroy_comp_channel()
rdma_destroy_event_channel()

These are the calls that close the opened files.

- Sean


From sean.hefty at intel.com  Fri Aug 14 12:45:36 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 14 Aug 2009 12:45:36 -0700
Subject: How to destroy IB resources (was Re: [ofa-general] Help -
	RDMA	event files remain open after acknowledging them)
In-Reply-To: <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>
References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>
	<1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>
Message-ID: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>

>I guess my question is, what's the best way to destroy IB resources? (Perhaps
>even, what's the best way to init them in the first place).

If you're destroying the CQ, there's no need to call ibv_get_cq_event() or
ibv_poll_cq(), unless you need completion information (for example, from flushed
receives).

However, every successful call to ibv_get_cq_event() needs a corresponding call
to ibv_ack_cq_event().  You can call ack(1) for each cq event, or count the
number of times that get returns success and call ack(get_cnt) once before
calling destroy.  Note that the count refers to the number of cq events, and not
the number of completions returned through ibv_poll_cq.

For your drain_cq() function, you should be safe doing something like this:

while (ibv_poll_cq(...) > 0)
	/* optional processing of any left over completions */;

ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */
ibv_destroy_cq(...);

>ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return
>error EBUSY

This sounds like a QP isn't being destroyed.  I'm not sure that anything else
fails CQ destruction with EBUSY.

Btw, if you're using the rdma_cm interface, then it's simpler to use the
rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP
state transitions for you.

- Sean


From rdreier at cisco.com  Fri Aug 14 15:15:44 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 14 Aug 2009 15:15:44 -0700
Subject: [ofa-general] [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adazla76rh4.fsf@cisco.com> (Roland Dreier's message of "Mon, 10
	Aug 2009 18:59:03 -0700")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com>
Message-ID: <adabpmi3uun.fsf_-_@cisco.com>

How about this approach?  Basically it just open-codes delayed work by
splitting the timer and the work struct, and switches to mod_timer()
instead of del_timer() + add_timer().  It passes very light testing here
(basically I started ipoib and nothing blew up).
---
 drivers/infiniband/core/mad.c      |   51 +++++++++++++++++------------------
 drivers/infiniband/core/mad_priv.h |    3 +-
 2 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5cef8f8..16ff496 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -174,6 +174,15 @@ int ib_response_mad(struct ib_mad *mad)
 }
 EXPORT_SYMBOL(ib_response_mad);
 
+static void timeout_callback(unsigned long data)
+{
+	struct ib_mad_agent_private *mad_agent_priv =
+		(struct ib_mad_agent_private *) data;
+
+	queue_work(mad_agent_priv->qp_info->port_priv->wq,
+		   &mad_agent_priv->timeout_work);
+}
+
 /*
  * ib_register_mad_agent - Register to send/receive MADs
  */
@@ -305,7 +314,9 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
 	INIT_LIST_HEAD(&mad_agent_priv->done_list);
 	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
-	INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends);
+	INIT_WORK(&mad_agent_priv->timeout_work, timeout_sends);
+	setup_timer(&mad_agent_priv->timeout_timer, timeout_callback,
+		    (unsigned long) mad_agent_priv);
 	INIT_LIST_HEAD(&mad_agent_priv->local_list);
 	INIT_WORK(&mad_agent_priv->local_work, local_completions);
 	atomic_set(&mad_agent_priv->refcount, 1);
@@ -512,7 +523,8 @@ static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv)
 	 */
 	cancel_mads(mad_agent_priv);
 	port_priv = mad_agent_priv->qp_info->port_priv;
-	cancel_delayed_work(&mad_agent_priv->timed_work);
+	del_timer_sync(&mad_agent_priv->timeout_timer);
+	cancel_work_sync(&mad_agent_priv->timeout_work);
 
 	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	remove_mad_reg_req(mad_agent_priv);
@@ -1970,10 +1982,9 @@ out:
 static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
 {
 	struct ib_mad_send_wr_private *mad_send_wr;
-	unsigned long delay;
 
 	if (list_empty(&mad_agent_priv->wait_list)) {
-		cancel_delayed_work(&mad_agent_priv->timed_work);
+		del_timer(&mad_agent_priv->timeout_timer);
 	} else {
 		mad_send_wr = list_entry(mad_agent_priv->wait_list.next,
 					 struct ib_mad_send_wr_private,
@@ -1982,13 +1993,8 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
 		if (time_after(mad_agent_priv->timeout,
 			       mad_send_wr->timeout)) {
 			mad_agent_priv->timeout = mad_send_wr->timeout;
-			cancel_delayed_work(&mad_agent_priv->timed_work);
-			delay = mad_send_wr->timeout - jiffies;
-			if ((long)delay <= 0)
-				delay = 1;
-			queue_delayed_work(mad_agent_priv->qp_info->
-					   port_priv->wq,
-					   &mad_agent_priv->timed_work, delay);
+			mod_timer(&mad_agent_priv->timeout_timer,
+				  mad_send_wr->timeout);
 		}
 	}
 }
@@ -2015,17 +2021,14 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 				       temp_mad_send_wr->timeout))
 				break;
 		}
-	}
-	else
+	} else
 		list_item = &mad_agent_priv->wait_list;
 	list_add(&mad_send_wr->agent_list, list_item);
 
 	/* Reschedule a work item if we have a shorter timeout */
-	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) {
-		cancel_delayed_work(&mad_agent_priv->timed_work);
-		queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq,
-				   &mad_agent_priv->timed_work, delay);
-	}
+	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list)
+		mod_timer(&mad_agent_priv->timeout_timer,
+			  mad_send_wr->timeout);
 }
 
 void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr,
@@ -2469,10 +2472,10 @@ static void timeout_sends(struct work_struct *work)
 	struct ib_mad_agent_private *mad_agent_priv;
 	struct ib_mad_send_wr_private *mad_send_wr;
 	struct ib_mad_send_wc mad_send_wc;
-	unsigned long flags, delay;
+	unsigned long flags;
 
 	mad_agent_priv = container_of(work, struct ib_mad_agent_private,
-				      timed_work.work);
+				      timeout_work);
 	mad_send_wc.vendor_err = 0;
 
 	spin_lock_irqsave(&mad_agent_priv->lock, flags);
@@ -2482,12 +2485,8 @@ static void timeout_sends(struct work_struct *work)
 					 agent_list);
 
 		if (time_after(mad_send_wr->timeout, jiffies)) {
-			delay = mad_send_wr->timeout - jiffies;
-			if ((long)delay <= 0)
-				delay = 1;
-			queue_delayed_work(mad_agent_priv->qp_info->
-					   port_priv->wq,
-					   &mad_agent_priv->timed_work, delay);
+			mod_timer(&mad_agent_priv->timeout_timer,
+				  mad_send_wr->timeout);
 			break;
 		}
 
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 05ce331..1526fa2 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -99,7 +99,8 @@ struct ib_mad_agent_private {
 	struct list_head send_list;
 	struct list_head wait_list;
 	struct list_head done_list;
-	struct delayed_work timed_work;
+	struct work_struct timeout_work;
+	struct timer_list timeout_timer;
 	unsigned long timeout;
 	struct list_head local_list;
 	struct work_struct local_work;


From nmehrotra at riorey.com  Fri Aug 14 16:41:08 2009
From: nmehrotra at riorey.com (Nitin Mehrotra)
Date: Fri, 14 Aug 2009 19:41:08 -0400
Subject: How to destroy IB resources (was Re: [ofa-general] Help - RDMA
	event files remain open after acknowledging them)
In-Reply-To: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>
References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>
	<1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>
	<9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>
Message-ID: <4A85F614.7040106@riorey.com>


Hmm, I do amortize the cost of  ibv_ack_cq_event() over multiple 
ibv_get_cq_event() calls; however when the shutdown is in progress I 
don't have the last event that was "gotten" so I have to call 
ibv_get_cq_event() one last time to get an event to acknowledge against. 
I suppose it's probably better to keep the last event processed if it 
hasn't been acknowledged and use it to issue the final acknowledge when 
shutting down. Then I wouldn't have to make that ibv_get_cq_event() call.

One last question, when I create the completing event queue I set it to 
non-blocking but I find that during shutdown I have to do that again 
before making the final call to ibv_get_cq_event() otherwise it blocks. 
Which I suppose is why it returns EAGAIN when there are no pending 
events, but I don't understand why I have to set it to non-blocking again.

Anyway, much thanks for all your help.

Nitin

Sean Hefty wrote:
>> I guess my question is, what's the best way to destroy IB resources? (Perhaps
>> even, what's the best way to init them in the first place).
>>     
>
> If you're destroying the CQ, there's no need to call ibv_get_cq_event() or
> ibv_poll_cq(), unless you need completion information (for example, from flushed
> receives).
>
> However, every successful call to ibv_get_cq_event() needs a corresponding call
> to ibv_ack_cq_event().  You can call ack(1) for each cq event, or count the
> number of times that get returns success and call ack(get_cnt) once before
> calling destroy.  Note that the count refers to the number of cq events, and not
> the number of completions returned through ibv_poll_cq.
>
> For your drain_cq() function, you should be safe doing something like this:
>
> while (ibv_poll_cq(...) > 0)
> 	/* optional processing of any left over completions */;
>
> ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */
> ibv_destroy_cq(...);
>
>   
>> ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return
>> error EBUSY
>>     
>
> This sounds like a QP isn't being destroyed.  I'm not sure that anything else
> fails CQ destruction with EBUSY.
>
> Btw, if you're using the rdma_cm interface, then it's simpler to use the
> rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP
> state transitions for you.
>
> - Sean
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.392 / Virus Database: 270.13.56/2302 - Release Date: 08/14/09 06:10:00
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090814/d7a4ad88/attachment.html>

From sean.hefty at intel.com  Fri Aug 14 21:36:32 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 14 Aug 2009 21:36:32 -0700
Subject: [ofa-general] RE: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adabpmi3uun.fsf_-_@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>	<adavdm0weue.fsf@cisco.com>	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>	<adar5wow9r7.fsf@cisco.com>	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>	<adaws6bt8lf.fsf@cisco.com>	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>	<adatz0mi03d.fsf@cisco.com>	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>	<adaws5gg71x.fsf@cisco.com>	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>	<ada8whr8kf7.fsf@cisco.com>	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>	<adazla76rh4.fsf@cisco.com>
	<adabpmi3uun.fsf_-_@cisco.com>
Message-ID: <956908184AA146C5B1131524100A0EFE@amr.corp.intel.com>

>How about this approach?  Basically it just open-codes delayed work by
>splitting the timer and the work struct, and switches to mod_timer()
>instead of del_timer() + add_timer().  It passes very light testing here
>(basically I started ipoib and nothing blew up).

The approach looks okay to me. 

>@@ -512,7 +523,8 @@ static void unregister_mad_agent(struct
>ib_mad_agent_private *mad_agent_priv)
> 	 */
> 	cancel_mads(mad_agent_priv);
> 	port_priv = mad_agent_priv->qp_info->port_priv;
>-	cancel_delayed_work(&mad_agent_priv->timed_work);
>+	del_timer_sync(&mad_agent_priv->timeout_timer);
>+	cancel_work_sync(&mad_agent_priv->timeout_work);

I had to check if there was a race between del_timer_sync() and the worker
thread, but the call to cancel_mads() looks like it prevents any issues.

- Sean


From rdreier at cisco.com  Fri Aug 14 21:59:17 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 14 Aug 2009 21:59:17 -0700
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <956908184AA146C5B1131524100A0EFE@amr.corp.intel.com> (Sean
	Hefty's message of "Fri, 14 Aug 2009 21:36:32 -0700")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<adavdm0weue.fsf@cisco.com>
	<e2e108260907101229i2f81cd50w859563357a835cce@mail.gmail.com>
	<adar5wow9r7.fsf@cisco.com>
	<e2e108260907110343w9d0377sc5676cec4aa00398@mail.gmail.com>
	<adaws6bt8lf.fsf@cisco.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
	<956908184AA146C5B1131524100A0EFE@amr.corp.intel.com>
Message-ID: <ada3a7t4qqi.fsf@cisco.com>


 > > 	cancel_mads(mad_agent_priv);
 > > 	port_priv = mad_agent_priv->qp_info->port_priv;
 > >-	cancel_delayed_work(&mad_agent_priv->timed_work);
 > >+	del_timer_sync(&mad_agent_priv->timeout_timer);
 > >+	cancel_work_sync(&mad_agent_priv->timeout_work);
 > 
 > I had to check if there was a race between del_timer_sync() and the worker
 > thread, but the call to cancel_mads() looks like it prevents any issues.

Yeah, I think it's OK, and in any case any race is already there I think
(since cancel_delayed_work is essentially equivalent to del_timer_sync)

Thanks.


From bart.vanassche at gmail.com  Fri Aug 14 22:56:18 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sat, 15 Aug 2009 07:56:18 +0200
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adabpmi3uun.fsf_-_@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
Message-ID: <e2e108260908142256r162b9838na8aeb57a1fc19348@mail.gmail.com>

On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier<rdreier at cisco.com> wrote:
> How about this approach?  Basically it just open-codes delayed work by
> splitting the timer and the work struct, and switches to mod_timer()
> instead of del_timer() + add_timer().  It passes very light testing here
> (basically I started ipoib and nothing blew up).

The patch looks fine to me.

Bart.


From vlad at lists.openfabrics.org  Sat Aug 15 03:01:37 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 15 Aug 2009 03:01:37 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090815-0200 daily build status
Message-ID: <20090815100138.18BB8E61D07@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hnrose at comcast.net  Sat Aug 15 06:46:24 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sat, 15 Aug 2009 09:46:24 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Fix
	CHECK_AND_SET_VAL macro
Message-ID: <20090815134624.GA25048@comcast.net>


Changed check from > to != since using integer comparison
and some masks can use full range and hence be negative 

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 330c6aa..e1e2cfc 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * Produced at Lawrence Livermore National Laboratory.
  * Written by Ira Weiny <weiny2 at llnl.gov>.
@@ -922,7 +923,7 @@ static int parse_lid_and_ports(bind_handle_t h,
 
 #define cl_hton8(x) (x)
 #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \
-	if ((int##size##_t) val > (int##size##_t) comp_with) { \
+	if ((int##size##_t) val != (int##size##_t) comp_with) { \
 		target = cl_hton##size((uint##size##_t) val); \
 		comp_mask |= IB_##name##_COMPMASK_##mask; \
 	}


From hal.rosenstock at gmail.com  Sat Aug 15 07:06:31 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 15 Aug 2009 10:06:31 -0400
Subject: [ofa-general] will opensm respond to requests that do not 
	originate from qp1
In-Reply-To: <20090813210924.GQ16677@obsidianresearch.com>
References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com>
	<f0e08f230908131241i63578d69u920eb6be000d79d5@mail.gmail.com>
	<6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com>
	<20090813200023.GO16677@obsidianresearch.com>
	<F4A03316B00D4CBD9334B87E8C51A8B2@amr.corp.intel.com>
	<20090813210924.GQ16677@obsidianresearch.com>
Message-ID: <f0e08f230908150706gd4b47a2nb2c9931dbb4673f8@mail.gmail.com>

On 8/13/09, Jason Gunthorpe <jgunthorpe at obsidianresearch.com> wrote:
>
> On Thu, Aug 13, 2009 at 01:14:19PM -0700, Sean Hefty wrote:
> > >Speaking of which, do we have an API to get the node's SM_Key for SA
> > >packet construction?
> >
> > Not that I'm aware of.  The ib-diags take the smkey as a command line
> option.
>
> Hmm, and the kernel wires it to zero.


 What are you referring to being wired by kernel to zero ? AFAIK neither use
(there are two) of SM_Key is wired to zero.


> That's uncool.
>
> So, any process that can create a QP can alter, say, the nodes
> multicast group membership.


>
> Thats a bit of a security problem.
>
> I admit though, I haven't been able to discern what the SM_Key should
> be set to from the spec..


It's a policy (SM admin) decision.

-- Hal


>
>
> --
> Jason Gunthorpe <jgunthorpe at obsidianresearch.com>        (780)4406067x832
> Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090815/599bb9b1/attachment.html>

From bryan.d.green at nasa.gov  Sat Aug 15 15:55:38 2009
From: bryan.d.green at nasa.gov (Bryan Green)
Date: Sat, 15 Aug 2009 15:55:38 -0700
Subject: [ofa-general] librdmacm - okay to select on a cm channel's file
	descriptor?
Message-ID: <20090815225538.ABA412391C7@ece06.nas.nasa.gov>

Hi,

I'm using librdmacm for connection management (on Linux).

In an attempt to get unexpected DISCONNECT notifications during
ib communication, I'm trying to use 'select()' on the cm channel's file
descriptor, testing it for readability.  I've found that this works some of
the time, but not all of the time.

Is this a legitimate way to test for disconnections, or am I required to
either make the descriptor nonblocking and just poll, or use a background
thread for receiving cm events?  I'd rather not use the nonblocking
approach, because I'd like to simultaneously select on the cm channel
descriptor and an ibv_comp_channel descriptor.  I'm not sure if
selecting on the ibv_comp_channel descriptor is acceptable either, but it
appears to work.

I'd appreciate it if anyone can enlighten me on this.

Thanks,
-bryan


From sashak at voltaire.com  Sun Aug 16 01:21:16 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 11:21:16 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: make accessors function for
 timeout values used in libibmad
In-Reply-To: <20090806160106.4725041e.weiny2@llnl.gov>
References: <20090806160106.4725041e.weiny2@llnl.gov>
Message-ID: <20090816082116.GE25501@me>

On 16:01 Thu 06 Aug     , Ira Weiny wrote:
> Sasha,
> 
> In using mad_send_via and mad_receive_via I have found getting the timeout and retry values from the mad layer to be beneficial.
> 
> This and the patch that follows export functions to get those values as well as standardize the use of them internally.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Mon, 27 Jul 2009 13:48:17 -0700
> Subject: [PATCH] libibmad: make accessors function for timeout values used in libibmad
> 
> 	In addition use this function to determine the timeout to be used throughout the library.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 01:30:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 11:30:53 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: make accessors function for
 retry values used in libibmad
In-Reply-To: <20090806160107.83193923.weiny2@llnl.gov>
References: <20090806160107.83193923.weiny2@llnl.gov>
Message-ID: <20090816083053.GF25501@me>

On 16:01 Thu 06 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 6 Aug 2009 15:27:30 -0700
> Subject: [PATCH] libibmad: make accessors function for retry values used in libibmad
> 
>         In addition use this function to determine the retries used throughout the library.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From tziporet at dev.mellanox.co.il  Sun Aug 16 02:23:18 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 16 Aug 2009 12:23:18 +0300
Subject: [ofa-general] RHEL 5.3 (2.6.18-128.1.1.el5 kernel) and connected
	mode
In-Reply-To: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com>
References: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com>
Message-ID: <4A87D006.8030308@mellanox.co.il>

Robert Cummins wrote:
> Hello,
>
> IHAC that is experiencing a problem with IB.  Specifically, when placing
> the Infinihost III card in connected mode with 'echo connected
>   
>> /sys/class/net/ib0/mode' some nodes stop responding.  By 'stop
>>     
> responding' I mean:
>
>   - ping <ib ip address> doesn't work (no packets returned; 100% packet
> loss)
>   - ib_rdma_bw -b node never runs
>   - ibping does work
>
>   
...


> It should be noted that the I have four nodes that fail and nearly 20
> that 'work'.   The failing nodes are running the same kernel
> (2.6.18-128.el5) while the working nodes are running the
> 2.6.18-128.1.1.el5 kernel.  I am at a loss as to how to proceed with
> debugging this short of getting the latest OFED distro and building it.
>
> Has anyone else run into this problem and if so, how did you get around
> it?  
>
>   
What is the FW version you use?
Can you see if there are any interesting messages in /var/log/messages, 
especially from mthca driver

Tziporet


From sashak at voltaire.com  Sun Aug 16 02:49:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 12:49:27 +0300
Subject: [ofa-general] [PATCH] libibmad: fix warnings
In-Reply-To: <20090806160107.83193923.weiny2@llnl.gov>
References: <20090806160107.83193923.weiny2@llnl.gov>
Message-ID: <20090816094927.GG25501@me>


Fix compilation warnings "passing argument 1 of ‘mad_get_retries’
discards qualifiers from pointer target type" for mad_get_timeout() and
mad_get_retries() functions.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 libibmad/include/infiniband/mad.h |    4 ++--
 libibmad/src/mad.c                |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 0d0dcf1..d8053b4 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -803,8 +803,8 @@ MAD_EXPORT void mad_rpc_set_retries(struct ibmad_port *port, int retries);
 MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout);
 MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls);
 
-MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms);
-MAD_EXPORT int mad_get_retries(struct ibmad_port *srcport);
+MAD_EXPORT int mad_get_timeout(const struct ibmad_port *srcport, int override_ms);
+MAD_EXPORT int mad_get_retries(const struct ibmad_port *srcport);
 
 
 /* register.c */
diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c
index 7192dd6..1361e2b 100644
--- a/libibmad/src/mad.c
+++ b/libibmad/src/mad.c
@@ -64,13 +64,13 @@ uint64_t mad_trid(void)
 	return next;
 }
 
-int mad_get_timeout(struct ibmad_port *srcport, int override_ms)
+int mad_get_timeout(const struct ibmad_port *srcport, int override_ms)
 {
 	return (override_ms ? override_ms :
 	    srcport->timeout ? srcport->timeout : madrpc_timeout);
 }
 
-int mad_get_retries(struct ibmad_port *srcport)
+int mad_get_retries(const struct ibmad_port *srcport)
 {
 	return (srcport->retries ? srcport->retries : madrpc_retries);
 }
-- 
1.6.4


From vlad at lists.openfabrics.org  Sun Aug 16 03:07:48 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 16 Aug 2009 03:07:48 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090816-0200 daily build status
Message-ID: <20090816100748.A55C5E28204@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Sun Aug 16 03:02:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:02:45 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix
	CHECK_AND_SET_VAL macro
In-Reply-To: <20090815134624.GA25048@comcast.net>
References: <20090815134624.GA25048@comcast.net>
Message-ID: <20090816100245.GJ25501@me>

On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> 
> Changed check from > to != since using integer comparison
> and some masks can use full range and hence be negative 

Any example?

Sasha


From sashak at voltaire.com  Sun Aug 16 03:03:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:03:03 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix typo in
	option name
In-Reply-To: <20090814145249.GA8448@comcast.net>
References: <20090814145249.GA8448@comcast.net>
Message-ID: <20090816100303.GK25501@me>

On 10:52 Fri 14 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 03:05:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:05:25 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_sm_mad_ctrl.c: Fix endian of
 status in error message
In-Reply-To: <20090814115607.GA18583@comcast.net>
References: <20090814115607.GA18583@comcast.net>
Message-ID: <20090816100525.GL25501@me>

On 07:56 Fri 14 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 03:06:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:06:03 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: In osm_dump_dr_smp,
 fix endian of status
In-Reply-To: <20090814115823.GB18583@comcast.net>
References: <20090814115823.GB18583@comcast.net>
Message-ID: <20090816100603.GM25501@me>

On 07:58 Fri 14 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 03:11:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:11:01 +0300
Subject: [ofa-general] Re: [PATCH] opensm/libvendor/osm_vendor_ibumad.c:
 Handle umad_alloc failure in osm_vendor_get
In-Reply-To: <20090814141132.GA31087@comcast.net>
References: <20090814141132.GA31087@comcast.net>
Message-ID: <20090816101101.GN25501@me>

On 10:11 Fri 14 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Sun Aug 16 03:20:39 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 16 Aug 2009 06:20:39 -0400
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix 
	CHECK_AND_SET_VAL macro
In-Reply-To: <20090816100245.GJ25501@me>
References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me>
Message-ID: <f0e08f230908160320s3326000frcf3bc42e714b7cc0@mail.gmail.com>

On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> >
> > Changed check from > to != since using integer comparison
> > and some masks can use full range and hence be negative
>
> Any example?


Pkey for one. I think there are others too but didn't do a full audit of all
the components which use CHECK_AND_SET_VAL.

-- Hal


>
>
> Sasha
>  _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090816/c22cf438/attachment.html>

From sashak at voltaire.com  Sun Aug 16 03:29:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:29:04 +0300
Subject: [ofa-general] Re: [PATCH 1/5] libibnetdisc: make all fields of
	ibnd_node_t public
In-Reply-To: <20090813204242.b659d8f5.weiny2@llnl.gov>
References: <20090813204242.b659d8f5.weiny2@llnl.gov>
Message-ID: <20090816102904.GP25501@me>

On 20:42 Thu 13 Aug     , Ira Weiny wrote:
> 
>  static void dump_endnode(ib_portid_t * path, char *prompt,
> -			 struct ibnd_node *node, struct ibnd_port *port)
> +			 ibnd_node_t * node, struct ibnd_port *port)
>  {
>  	char type[64];
>  	if (!show_progress)
>  		return;
>  
> -	mad_dump_node_type(type, 64, &(node->node.type), sizeof(int));
> -
> -	printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n",
> -	       portid2str(path), prompt, type, node->node.guid,
> -	       node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum,
> -	       port->port.base_lid,
> -	       port->port.base_lid + (1 << port->port.lmc) - 1,
> -	       node->node.nodedesc);
> +	mad_dump_node_type(type, 64, &(node->type), sizeof(int)),

',' at end of the statement. I'm fixing this (again :))

Sasha

> +	    printf("%s -> %s %s {%016" PRIx64
> +		   "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path),
> +		   prompt, type, node->guid,
> +		   node->type == IB_NODE_SWITCH ? 0 : port->port.portnum,
> +		   port->port.base_lid,
> +		   port->port.base_lid + (1 << port->port.lmc) - 1,
> +		   node->nodedesc);
>  }


From sashak at voltaire.com  Sun Aug 16 03:39:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:39:02 +0300
Subject: [ofa-general] Re: [PATCH 1/5] libibnetdisc: make all fields of
	ibnd_node_t public
In-Reply-To: <20090813204242.b659d8f5.weiny2@llnl.gov>
References: <20090813204242.b659d8f5.weiny2@llnl.gov>
Message-ID: <20090816103902.GQ25501@me>

On 20:42 Thu 13 Aug     , Ira Weiny wrote:
> 

It would be really nice to have a commit message (in addition to
subject) for each patch. Cover email ([PATH 0/N]) is not saved in
change history.

> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Tue, 11 Aug 2009 15:15:21 -0700
> Subject: [PATCH] libibnetdisc: make all fields of ibnd_node_t public
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha

> ---
>  .../libibnetdisc/include/infiniband/ibnetdisc.h    |   12 +-
>  infiniband-diags/libibnetdisc/src/chassis.c        |  147 ++++++++---------
>  infiniband-diags/libibnetdisc/src/ibnetdisc.c      |  173 ++++++++++----------
>  infiniband-diags/libibnetdisc/src/internal.h       |   22 +--
>  4 files changed, 166 insertions(+), 188 deletions(-)
> 
> diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
> index 121709d..e7f5f6a 100644
> --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
> +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
> @@ -45,8 +45,8 @@ struct port;			/* forward declare */
>  /** =========================================================================
>   * Node
>   */
> -typedef struct node {
> -	struct node *next;	/* all node list in fabric */
> +typedef struct ibnd_node {
> +	struct ibnd_node *next;	/* all node list in fabric */
>  
>  	ib_portid_t path_portid;	/* path from "from_node" */
>  	int dist;		/* num of hops from "from_node" */
> @@ -72,12 +72,18 @@ typedef struct node {
>  				   items MAY BE NULL!  (ie 0 == switches only) */
>  
>  	/* chassis info */
> -	struct node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
> +	struct ibnd_node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
>  	struct chassis *chassis;	/* if != NULL the chassis this node belongs to */
>  	unsigned char ch_type;
>  	unsigned char ch_anafanum;
>  	unsigned char ch_slotnum;
>  	unsigned char ch_slot;
> +
> +	/* internal use only */
> +	unsigned char ch_found;
> +	struct ibnd_node *htnext;	/* hash table list */
> +	struct ibnd_node *dnext;	/* nodesdist next */
> +	struct ibnd_node *type_next;	/* next based on type */
>  } ibnd_node_t;
>  
>  /** =========================================================================
> diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
> index 120b4b6..0dd259a 100644
> --- a/infiniband-diags/libibnetdisc/src/chassis.c
> +++ b/infiniband-diags/libibnetdisc/src/chassis.c
> @@ -239,68 +239,68 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
>  		return 0;
>  }
>  
> -static int is_router(struct ibnd_node *n)
> +static int is_router(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_IB_FC_ROUTER ||
>  		devid == VTR_DEVID_IB_IP_ROUTER);
>  }
>  
> -static int is_spine_9096(struct ibnd_node *n)
> +static int is_spine_9096(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SFB4 || devid == VTR_DEVID_SFB4_DDR);
>  }
>  
> -static int is_spine_9288(struct ibnd_node *n)
> +static int is_spine_9288(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SFB12 || devid == VTR_DEVID_SFB12_DDR);
>  }
>  
> -static int is_spine_2004(struct ibnd_node *n)
> +static int is_spine_2004(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SFB2004);
>  }
>  
> -static int is_spine_2012(struct ibnd_node *n)
> +static int is_spine_2012(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SFB2012);
>  }
>  
> -static int is_spine(struct ibnd_node *n)
> +static int is_spine(ibnd_node_t * n)
>  {
>  	return (is_spine_9096(n) || is_spine_9288(n) ||
>  		is_spine_2004(n) || is_spine_2012(n));
>  }
>  
> -static int is_line_24(struct ibnd_node *n)
> +static int is_line_24(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> -	return (devid == VTR_DEVID_SLB24 || devid == VTR_DEVID_SLB24_DDR ||
> -		devid == VTR_DEVID_SRB2004);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
> +	return (devid == VTR_DEVID_SLB24 ||
> +		devid == VTR_DEVID_SLB24_DDR || devid == VTR_DEVID_SRB2004);
>  }
>  
> -static int is_line_8(struct ibnd_node *n)
> +static int is_line_8(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SLB8);
>  }
>  
> -static int is_line_2024(struct ibnd_node *n)
> +static int is_line_2024(ibnd_node_t * n)
>  {
> -	uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F);
> +	uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F);
>  	return (devid == VTR_DEVID_SLB2024);
>  }
>  
> -static int is_line(struct ibnd_node *n)
> +static int is_line(ibnd_node_t * n)
>  {
>  	return (is_line_24(n) || is_line_8(n) || is_line_2024(n));
>  }
>  
> -int is_chassis_switch(struct ibnd_node *n)
> +int is_chassis_switch(ibnd_node_t * n)
>  {
>  	return (is_spine(n) || is_line(n));
>  }
> @@ -349,7 +349,7 @@ char anafa_spine4_slot_2_slb[25] = {
>  
>  /*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
>  
> -static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport)
> +static int get_sfb_slot(ibnd_node_t * node, ibnd_port_t * lineport)
>  {
>  	ibnd_node_t *n = (ibnd_node_t *) node;
>  
> @@ -372,25 +372,24 @@ static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport)
>  		n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum];
>  	} else {
>  		IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64,
> -			   node->node.guid);
> +			   node->guid);
>  		return (-1);
>  	}
>  	return (0);
>  }
>  
> -static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
> +static int get_router_slot(ibnd_node_t * n, ibnd_port_t * spineport)
>  {
> -	ibnd_node_t *n = (ibnd_node_t *) node;
>  	uint64_t guessnum = 0;
>  
> -	node->ch_found = 1;
> +	n->ch_found = 1;
>  
>  	n->ch_slot = SRBD_CS;
> -	if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) {
> +	if (is_spine_9096(spineport->node)) {
>  		n->ch_type = ISR9096_CT;
>  		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
>  		n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
> -	} else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_9288(spineport->node)) {
>  		n->ch_type = ISR9288_CT;
>  		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
>  		/* this is a smart guess based on nodeguids order on sFB-12 module */
> @@ -399,7 +398,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
>  		/* module 2 <--> remote anafa 2 */
>  		/* module 3 <--> remote anafa 1 */
>  		n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2));
> -	} else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_2012(spineport->node)) {
>  		n->ch_type = ISR2012_CT;
>  		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
>  		/* this is a smart guess based on nodeguids order on sFB-12 module */
> @@ -408,7 +407,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
>  		// module 2 <--> remote anafa 2
>  		// module 3 <--> remote anafa 1
>  		n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2));
> -	} else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_2004(spineport->node)) {
>  		n->ch_type = ISR2004_CT;
>  		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
>  		n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
> @@ -423,19 +422,19 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport)
>  static int get_slb_slot(ibnd_node_t * n, ibnd_port_t * spineport)
>  {
>  	n->ch_slot = LINE_CS;
> -	if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) {
> +	if (is_spine_9096(spineport->node)) {
>  		n->ch_type = ISR9096_CT;
>  		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
>  		n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
> -	} else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_9288(spineport->node)) {
>  		n->ch_type = ISR9288_CT;
>  		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
>  		n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
> -	} else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_2012(spineport->node)) {
>  		n->ch_type = ISR2012_CT;
>  		n->ch_slotnum = line_slot_2_sfb12[spineport->portnum];
>  		n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
> -	} else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) {
> +	} else if (is_spine_2004(spineport->node)) {
>  		n->ch_type = ISR2004_CT;
>  		n->ch_slotnum = line_slot_2_sfb4[spineport->portnum];
>  		n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
> @@ -454,12 +453,11 @@ static void voltaire_portmap(ibnd_port_t * port);
>  	It could be optimized so, but time overhead is very small
>  	and its only diag.util
>  */
> -static int fill_voltaire_chassis_record(struct ibnd_node *node)
> +static int fill_voltaire_chassis_record(ibnd_node_t * node)
>  {
> -	ibnd_node_t *n = (ibnd_node_t *) node;
>  	int p = 0;
>  	ibnd_port_t *port;
> -	struct ibnd_node *remnode = 0;
> +	ibnd_node_t *remnode = 0;
>  
>  	if (node->ch_found)	/* somehow this node has already been passed */
>  		return (0);
> @@ -470,25 +468,23 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node)
>  	/* in such case node->ports is actually a requested port... */
>  	if (is_router(node)) {
>  		/* find the remote node */
> -		for (p = 1; p <= node->node.numports; p++) {
> -			port = node->node.ports[p];
> -			if (port &&
> -			    is_spine(CONV_NODE_INTERNAL
> -				     (port->remoteport->node)))
> +		for (p = 1; p <= node->numports; p++) {
> +			port = node->ports[p];
> +			if (port && is_spine(port->remoteport->node))
>  				get_router_slot(node, port->remoteport);
>  		}
>  	} else if (is_spine(node)) {
> -		for (p = 1; p <= node->node.numports; p++) {
> -			port = node->node.ports[p];
> +		for (p = 1; p <= node->numports; p++) {
> +			port = node->ports[p];
>  			if (!port || !port->remoteport)
>  				continue;
> -			remnode = CONV_NODE_INTERNAL(port->remoteport->node);
> -			if (remnode->node.type != IB_NODE_SWITCH) {
> +			remnode = port->remoteport->node;
> +			if (remnode->type != IB_NODE_SWITCH) {
>  				if (!remnode->ch_found)
>  					get_router_slot(remnode, port);
>  				continue;
>  			}
> -			if (!n->ch_type)
> +			if (!node->ch_type)
>  				/* we assume here that remoteport belongs to line */
>  				if (get_sfb_slot(node, port->remoteport))
>  					return (-1);
> @@ -497,20 +493,20 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node)
>  		}
>  
>  	} else if (is_line(node)) {
> -		for (p = 1; p <= node->node.numports; p++) {
> -			port = node->node.ports[p];
> +		for (p = 1; p <= node->numports; p++) {
> +			port = node->ports[p];
>  			if (!port || port->portnum > 12 || !port->remoteport)
>  				continue;
>  			/* we assume here that remoteport belongs to spine */
> -			if (get_slb_slot(n, port->remoteport))
> +			if (get_slb_slot(node, port->remoteport))
>  				return (-1);
>  			break;
>  		}
>  	}
>  
>  	/* for each port of this node, map external ports */
> -	for (p = 1; p <= node->node.numports; p++) {
> -		port = node->node.ports[p];
> +	for (p = 1; p <= node->numports; p++) {
> +		port = node->ports[p];
>  		if (!port)
>  			continue;
>  		voltaire_portmap(port);
> @@ -534,8 +530,7 @@ static int get_spine_index(ibnd_node_t * node)
>  {
>  	int retval;
>  
> -	if (is_spine_9288(CONV_NODE_INTERNAL(node))
> -	    || is_spine_2012(CONV_NODE_INTERNAL(node)))
> +	if (is_spine_9288(node) || is_spine_2012(node))
>  		retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum;
>  	else
>  		retval = node->ch_slotnum;
> @@ -586,7 +581,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis)
>  	for (i = 1; i <= LINES_MAX_NUM; i++) {
>  		node = chassis->linenode[i];
>  
> -		if (!(node && is_line(CONV_NODE_INTERNAL(node))))
> +		if (!(node && is_line(node)))
>  			continue;	/* empty slot or router */
>  
>  		for (p = 1; p <= node->numports; p++) {
> @@ -596,7 +591,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis)
>  
>  			remnode = port->remoteport->node;
>  
> -			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
> +			if (!remnode->ch_found)
>  				continue;	/* some error - spine not initialized ? FIXME */
>  			if (insert_spine(remnode, chassis))
>  				return (-1);
> @@ -621,7 +616,7 @@ static int pass_on_spines_catch_lines(ibnd_chassis_t * chassis)
>  				continue;
>  			remnode = port->remoteport->node;
>  
> -			if (!CONV_NODE_INTERNAL(remnode)->ch_found)
> +			if (!remnode->ch_found)
>  				continue;	/* some error - line/router not initialized ? FIXME */
>  			if (insert_line_router(remnode, chassis))
>  				return (-1);
> @@ -655,10 +650,10 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t * chassis)
>  	in that chassis
>  	chassis structure = structure of one standalone chassis
>  */
> -static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis)
> +static int build_chassis(ibnd_node_t * node, ibnd_chassis_t * chassis)
>  {
>  	int p = 0;
> -	struct ibnd_node *remnode = 0;
> +	ibnd_node_t *remnode = 0;
>  	ibnd_port_t *port = 0;
>  
>  	/* we get here with node = chassis_spine */
> @@ -666,16 +661,16 @@ static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis)
>  		return (-1);
>  
>  	/* loop: pass on all ports of node */
> -	for (p = 1; p <= node->node.numports; p++) {
> -		port = node->node.ports[p];
> +	for (p = 1; p <= node->numports; p++) {
> +		port = node->ports[p];
>  		if (!port || !port->remoteport)
>  			continue;
> -		remnode = CONV_NODE_INTERNAL(port->remoteport->node);
> +		remnode = port->remoteport->node;
>  
>  		if (!remnode->ch_found)
>  			continue;	/* some error - line or router not initialized ? FIXME */
>  
> -		insert_line_router(&(remnode->node), chassis);
> +		insert_line_router(remnode, chassis);
>  	}
>  
>  	if (pass_on_lines_catch_spines(chassis))
> @@ -764,13 +759,11 @@ int int2ext_map_slb2024[2][25] = {
>  /* map internal ports to external ports if appropriate */
>  static void voltaire_portmap(ibnd_port_t * port)
>  {
> -	struct ibnd_node *n = CONV_NODE_INTERNAL(port->node);
>  	int portnum = port->portnum;
>  	int chipnum = 0;
>  	ibnd_node_t *node = port->node;
>  
> -	if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node))
> -	    || (portnum < 13 || portnum > 24)) {
> +	if (!node->ch_found || !is_line(node) || (portnum < 13 || portnum > 24)) {
>  		port->ext_portnum = 0;
>  		return;
>  	}
> @@ -782,9 +775,9 @@ static void voltaire_portmap(ibnd_port_t * port)
>  
>  	chipnum = port->node->ch_anafanum - 1;
>  
> -	if (is_line_24(CONV_NODE_INTERNAL(node)))
> +	if (is_line_24(node))
>  		port->ext_portnum = int2ext_map_slb24[chipnum][portnum];
> -	else if (is_line_2024(CONV_NODE_INTERNAL(node)))
> +	else if (is_line_2024(node))
>  		port->ext_portnum = int2ext_map_slb2024[chipnum][portnum];
>  	else
>  		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
> @@ -828,7 +821,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
>  */
>  int group_nodes(struct ibnd_fabric *fabric)
>  {
> -	struct ibnd_node *node;
> +	ibnd_node_t *node;
>  	int dist;
>  	int chassisnum = 0;
>  	ibnd_chassis_t *chassis;
> @@ -842,7 +835,7 @@ int group_nodes(struct ibnd_fabric *fabric)
>  	/* not very efficient but clear code so... */
>  	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
>  		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
> -			if (mad_get_field(node->node.info, 0,
> +			if (mad_get_field(node->info, 0,
>  					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
>  				if (fill_voltaire_chassis_record(node))
>  					return (-1);
> @@ -853,13 +846,11 @@ int group_nodes(struct ibnd_fabric *fabric)
>  	/* algorithm: catch spine and find all surrounding nodes */
>  	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
>  		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
> -			if (mad_get_field(node->node.info, 0,
> +			if (mad_get_field(node->info, 0,
>  					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
>  				continue;
> -			//if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node))
>  			if (!node->ch_found
> -			    || (node->node.chassis
> -				&& node->node.chassis->chassisnum)
> +			    || (node->chassis && node->chassis->chassisnum)
>  			    || !is_spine(node))
>  				continue;
>  			if (add_chassis(fabric))
> @@ -874,10 +865,10 @@ int group_nodes(struct ibnd_fabric *fabric)
>  	/* grouped by common SystemImageGUID */
>  	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
>  		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
> -			if (mad_get_field(node->node.info, 0,
> +			if (mad_get_field(node->info, 0,
>  					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
>  				continue;
> -			if (mad_get_field64(node->node.info, 0,
> +			if (mad_get_field64(node->info, 0,
>  					    IB_NODE_SYSTEM_GUID_F)) {
>  				chassis =
>  				    find_chassisguid(fabric,
> @@ -901,10 +892,10 @@ int group_nodes(struct ibnd_fabric *fabric)
>  	/* (defined as chassis->nodecount > 1) */
>  	for (dist = 0; dist <= MAXHOPS;) {
>  		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
> -			if (mad_get_field(node->node.info, 0,
> +			if (mad_get_field(node->info, 0,
>  					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
>  				continue;
> -			if (mad_get_field64(node->node.info, 0,
> +			if (mad_get_field64(node->info, 0,
>  					    IB_NODE_SYSTEM_GUID_F)) {
>  				chassis =
>  				    find_chassisguid(fabric,
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index b33be8d..b883d4a 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -98,18 +98,17 @@ static int get_port_info(struct ibmad_port *ibmad_port,
>   * Returns -1 if error.
>   */
>  static int query_node_info(struct ibmad_port *ibmad_port,
> -			   struct ibnd_fabric *fabric, struct ibnd_node *node,
> +			   struct ibnd_fabric *fabric, ibnd_node_t * node,
>  			   ib_portid_t * portid)
>  {
> -	if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, 0,
> +	if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0,
>  			   ibmad_port))
>  		return -1;
>  
>  	/* decode just a couple of fields for quicker reference. */
> -	mad_decode_field(node->node.info, IB_NODE_GUID_F, &(node->node.guid));
> -	mad_decode_field(node->node.info, IB_NODE_TYPE_F, &(node->node.type));
> -	mad_decode_field(node->node.info, IB_NODE_NPORTS_F,
> -			 &(node->node.numports));
> +	mad_decode_field(node->info, IB_NODE_GUID_F, &(node->guid));
> +	mad_decode_field(node->info, IB_NODE_TYPE_F, &(node->type));
> +	mad_decode_field(node->info, IB_NODE_NPORTS_F, &(node->numports));
>  
>  	return (0);
>  }
> @@ -118,15 +117,14 @@ static int query_node_info(struct ibmad_port *ibmad_port,
>   * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
>   */
>  static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
> -		      struct ibnd_node *inode, struct ibnd_port *iport,
> +		      ibnd_node_t * node, struct ibnd_port *iport,
>  		      ib_portid_t * portid)
>  {
>  	int rc = 0;
> -	ibnd_node_t *node = &(inode->node);
>  	ibnd_port_t *port = &(iport->port);
> -	void *nd = inode->node.nodedesc;
> +	void *nd = node->nodedesc;
>  
> -	if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0)
> +	if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0)
>  		return rc;
>  
>  	port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F);
> @@ -204,30 +202,30 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
>  }
>  
>  static void dump_endnode(ib_portid_t * path, char *prompt,
> -			 struct ibnd_node *node, struct ibnd_port *port)
> +			 ibnd_node_t * node, struct ibnd_port *port)
>  {
>  	char type[64];
>  	if (!show_progress)
>  		return;
>  
> -	mad_dump_node_type(type, 64, &(node->node.type), sizeof(int));
> -
> -	printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n",
> -	       portid2str(path), prompt, type, node->node.guid,
> -	       node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum,
> -	       port->port.base_lid,
> -	       port->port.base_lid + (1 << port->port.lmc) - 1,
> -	       node->node.nodedesc);
> +	mad_dump_node_type(type, 64, &(node->type), sizeof(int)),
> +	    printf("%s -> %s %s {%016" PRIx64
> +		   "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path),
> +		   prompt, type, node->guid,
> +		   node->type == IB_NODE_SWITCH ? 0 : port->port.portnum,
> +		   port->port.base_lid,
> +		   port->port.base_lid + (1 << port->port.lmc) - 1,
> +		   node->nodedesc);
>  }
>  
> -static struct ibnd_node *find_existing_node(struct ibnd_fabric *fabric,
> -					    struct ibnd_node *new)
> +static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
> +				       ibnd_node_t * new)
>  {
> -	int hash = HASHGUID(new->node.guid) % HTSZ;
> -	struct ibnd_node *node;
> +	int hash = HASHGUID(new->guid) % HTSZ;
> +	ibnd_node_t *node;
>  
>  	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
> -		if (node->node.guid == new->node.guid)
> +		if (node->guid == new->guid)
>  			return node;
>  
>  	return NULL;
> @@ -237,7 +235,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
>  {
>  	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
>  	int hash = HASHGUID(guid) % HTSZ;
> -	struct ibnd_node *node;
> +	ibnd_node_t *node;
>  
>  	if (!fabric) {
>  		IBND_DEBUG("fabric parameter NULL\n");
> @@ -245,7 +243,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
>  	}
>  
>  	for (node = f->nodestbl[hash]; node; node = node->htnext)
> -		if (node->node.guid == guid)
> +		if (node->guid == guid)
>  			return (ibnd_node_t *) node;
>  
>  	return NULL;
> @@ -273,7 +271,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
>  	void *nd = node->nodedesc;
>  	int p = 0;
>  	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
> -	struct ibnd_node *n = CONV_NODE_INTERNAL(node);
>  
>  	if (_check_ibmad_port(ibmad_port) < 0)
>  		return (NULL);
> @@ -288,36 +285,36 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
>  		return (NULL);
>  	}
>  
> -	if (query_node_info(ibmad_port, f, n, &(n->node.path_portid)))
> +	if (query_node_info(ibmad_port, f, node, &(node->path_portid)))
>  		return (NULL);
>  
> -	if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, 0,
> +	if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0,
>  			   ibmad_port))
>  		return (NULL);
>  
>  	/* update all the port info's */
> -	for (p = 1; p >= n->node.numports; p++) {
> -		get_port_info(ibmad_port, f,
> -			      CONV_PORT_INTERNAL(n->node.ports[p]), p,
> -			      &(n->node.path_portid));
> +	for (p = 1; p >= node->numports; p++) {
> +		get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]),
> +			      p, &(node->path_portid));
>  	}
>  
> -	if (n->node.type != IB_NODE_SWITCH)
> +	if (node->type != IB_NODE_SWITCH)
>  		goto done;
>  
> -	if (!smp_query_via(portinfo_port0, &(n->node.path_portid),
> -			   IB_ATTR_PORT_INFO, 0, 0, ibmad_port))
> +	if (!smp_query_via
> +	    (portinfo_port0, &(node->path_portid), IB_ATTR_PORT_INFO, 0, 0,
> +	     ibmad_port))
>  		return (NULL);
>  
> -	n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F);
> -	n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F);
> +	node->smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F);
> +	node->smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F);
>  
> -	if (!smp_query_via(node->switchinfo, &(n->node.path_portid),
> +	if (!smp_query_via(node->switchinfo, &(node->path_portid),
>  			   IB_ATTR_SWITCH_INFO, 0, 0, ibmad_port))
>  		node->smaenhsp0 = 0;	/* assume base SP0 */
>  	else
>  		mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F,
> -				 &n->node.smaenhsp0);
> +				 &node->smaenhsp0);
>  
>  done:
>  	return (node);
> @@ -358,10 +355,9 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
>  	return (rc);
>  }
>  
> -static void add_to_nodeguid_hash(struct ibnd_node *node,
> -				 struct ibnd_node *hash[])
> +static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[])
>  {
> -	int hash_idx = HASHGUID(node->node.guid) % HTSZ;
> +	int hash_idx = HASHGUID(node->guid) % HTSZ;
>  
>  	node->htnext = hash[hash_idx];
>  	hash[hash_idx] = node;
> @@ -376,9 +372,9 @@ static void add_to_portguid_hash(struct ibnd_port *port,
>  	hash[hash_idx] = port;
>  }
>  
> -static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric)
> +static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
>  {
> -	switch (node->node.type) {
> +	switch (node->type) {
>  	case IB_NODE_CA:
>  		node->type_next = fabric->ch_adapters;
>  		fabric->ch_adapters = node;
> @@ -394,21 +390,21 @@ static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric)
>  	}
>  }
>  
> -static void add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric)
> +static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
>  {
> -	int dist = node->node.dist;
> -	if (node->node.type != IB_NODE_SWITCH)
> +	int dist = node->dist;
> +	if (node->type != IB_NODE_SWITCH)
>  		dist = MAXHOPS;	/* special Ca list */
>  
>  	node->dnext = fabric->nodesdist[dist];
>  	fabric->nodesdist[dist] = node;
>  }
>  
> -static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
> -				     struct ibnd_node *temp, ib_portid_t * path,
> -				     int dist)
> +static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
> +				ibnd_node_t * temp, ib_portid_t * path,
> +				int dist)
>  {
> -	struct ibnd_node *node;
> +	ibnd_node_t *node;
>  
>  	node = malloc(sizeof(*node));
>  	if (!node) {
> @@ -417,13 +413,13 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
>  	}
>  
>  	memcpy(node, temp, sizeof(*node));
> -	node->node.dist = dist;
> -	node->node.path_portid = *path;
> +	node->dist = dist;
> +	node->path_portid = *path;
>  
>  	add_to_nodeguid_hash(node, fabric->nodestbl);
>  
>  	/* add this to the all nodes list */
> -	node->node.next = fabric->fabric.nodes;
> +	node->next = fabric->fabric.nodes;
>  	fabric->fabric.nodes = (ibnd_node_t *) node;
>  
>  	add_to_type_list(node, fabric);
> @@ -432,26 +428,24 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric,
>  	return node;
>  }
>  
> -static struct ibnd_port *find_existing_port_node(struct ibnd_node *node,
> +static struct ibnd_port *find_existing_port_node(ibnd_node_t * node,
>  						 struct ibnd_port *port)
>  {
> -	if (port->port.portnum > node->node.numports
> -	    || node->node.ports == NULL)
> +	if (port->port.portnum > node->numports || node->ports == NULL)
>  		return (NULL);
>  
> -	return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum]));
> +	return (CONV_PORT_INTERNAL(node->ports[port->port.portnum]));
>  }
>  
>  static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
> -					  struct ibnd_node *node,
> +					  ibnd_node_t * node,
>  					  struct ibnd_port *temp)
>  {
>  	struct ibnd_port *port;
>  
> -	if (node->node.ports == NULL) {
> -		node->node.ports =
> -		    calloc(sizeof(*node->node.ports), node->node.numports + 1);
> -		if (!node->node.ports) {
> +	if (node->ports == NULL) {
> +		node->ports = calloc(sizeof(*node->ports), node->numports + 1);
> +		if (!node->ports) {
>  			IBND_ERROR("Failed to allocate the ports array\n");
>  			return (NULL);
>  		}
> @@ -467,20 +461,19 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
>  	port->port.node = (ibnd_node_t *) node;
>  	port->port.ext_portnum = 0;
>  
> -	node->node.ports[temp->port.portnum] = (ibnd_port_t *) port;
> +	node->ports[temp->port.portnum] = (ibnd_port_t *) port;
>  
>  	add_to_portguid_hash(port, fabric->portstbl);
>  	return port;
>  }
>  
> -static void link_ports(struct ibnd_node *node, struct ibnd_port *port,
> -		       struct ibnd_node *remotenode,
> -		       struct ibnd_port *remoteport)
> +static void link_ports(ibnd_node_t * node, struct ibnd_port *port,
> +		       ibnd_node_t * remotenode, struct ibnd_port *remoteport)
>  {
>  	IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64
> -		   " %p->%p:%u\n", node->node.guid, node, port,
> -		   port->port.portnum, remotenode->node.guid, remotenode,
> -		   remoteport, remoteport->port.portnum);
> +		   " %p->%p:%u\n", node->guid, node, port, port->port.portnum,
> +		   remotenode->guid, remotenode, remoteport,
> +		   remoteport->port.portnum);
>  	if (port->port.remoteport)
>  		port->port.remoteport->remoteport = NULL;
>  	if (remoteport->port.remoteport)
> @@ -490,14 +483,14 @@ static void link_ports(struct ibnd_node *node, struct ibnd_port *port,
>  }
>  
>  static int get_remote_node(struct ibmad_port *ibmad_port,
> -			   struct ibnd_fabric *fabric, struct ibnd_node *node,
> +			   struct ibnd_fabric *fabric, ibnd_node_t * node,
>  			   struct ibnd_port *port, ib_portid_t * path,
>  			   int portnum, int dist)
>  {
>  	int rc = 0;
> -	struct ibnd_node node_buf;
> +	ibnd_node_t node_buf;
>  	struct ibnd_port port_buf;
> -	struct ibnd_node *remotenode, *oldnode;
> +	ibnd_node_t *remotenode, *oldnode;
>  	struct ibnd_port *remoteport, *oldport;
>  
>  	memset(&node_buf, 0, sizeof(node_buf));
> @@ -554,9 +547,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
>  	int rc = 0;
>  	struct ibnd_fabric *fabric = NULL;
>  	ib_portid_t my_portid = { 0 };
> -	struct ibnd_node node_buf;
> +	ibnd_node_t node_buf;
>  	struct ibnd_port port_buf;
> -	struct ibnd_node *node;
> +	ibnd_node_t *node;
>  	struct ibnd_port *port;
>  	int i;
>  	int dist = 0;
> @@ -605,7 +598,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
>  		goto error;
>  
>  	rc = get_remote_node(ibmad_port, fabric, node, port, from,
> -			     mad_get_field(node->node.info, 0,
> +			     mad_get_field(node->info, 0,
>  					   IB_NODE_LOCAL_PORT_F), 0);
>  	if (rc < 0)
>  		goto error;
> @@ -616,13 +609,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
>  
>  		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
>  
> -			path = &node->node.path_portid;
> +			path = &node->path_portid;
>  
>  			IBND_DEBUG("dist %d node %p\n", dist, node);
>  			dump_endnode(path, "processing", node, port);
>  
> -			for (i = 1; i <= node->node.numports; i++) {
> -				if (i == mad_get_field(node->node.info, 0,
> +			for (i = 1; i <= node->numports; i++) {
> +				if (i == mad_get_field(node->info, 0,
>  						       IB_NODE_LOCAL_PORT_F))
>  					continue;
>  
> @@ -644,9 +637,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
>  					goto error;
>  
>  				/* If switch, set port GUID to node port GUID */
> -				if (node->node.type == IB_NODE_SWITCH) {
> +				if (node->type == IB_NODE_SWITCH) {
>  					port->port.guid =
> -					    mad_get_field64(node->node.info, 0,
> +					    mad_get_field64(node->info, 0,
>  							    IB_NODE_PORT_GUID_F);
>  				}
>  
> @@ -666,14 +659,14 @@ error:
>  	return (NULL);
>  }
>  
> -static void destroy_node(struct ibnd_node *node)
> +static void destroy_node(ibnd_node_t * node)
>  {
>  	int p = 0;
>  
> -	for (p = 0; p <= node->node.numports; p++) {
> -		free(node->node.ports[p]);
> +	for (p = 0; p <= node->numports; p++) {
> +		free(node->ports[p]);
>  	}
> -	free(node->node.ports);
> +	free(node->ports);
>  	free(node);
>  }
>  
> @@ -681,8 +674,8 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
>  {
>  	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
>  	int dist = 0;
> -	struct ibnd_node *node = NULL;
> -	struct ibnd_node *next = NULL;
> +	ibnd_node_t *node = NULL;
> +	ibnd_node_t *next = NULL;
>  	ibnd_chassis_t *ch, *ch_next;
>  
>  	if (!fabric)
> @@ -747,8 +740,8 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
>  			  int node_type, void *user_data)
>  {
>  	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
> -	struct ibnd_node *list = NULL;
> -	struct ibnd_node *cur = NULL;
> +	ibnd_node_t *list = NULL;
> +	ibnd_node_t *cur = NULL;
>  
>  	if (!fabric) {
>  		IBND_DEBUG("fabric parameter NULL\n");
> diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
> index 38555a0..449bd70 100644
> --- a/infiniband-diags/libibnetdisc/src/internal.h
> +++ b/infiniband-diags/libibnetdisc/src/internal.h
> @@ -49,18 +49,6 @@
>  #define	IBND_ERROR(fmt, ...) \
>  		fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__)
>  
> -struct ibnd_node {
> -	/* This member MUST BE FIRST */
> -	ibnd_node_t node;
> -
> -	/* internal use only */
> -	unsigned char ch_found;
> -	struct ibnd_node *htnext;	/* hash table list */
> -	struct ibnd_node *dnext;	/* nodesdist next */
> -	struct ibnd_node *type_next;	/* next based on type */
> -};
> -#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node)
> -
>  struct ibnd_port {
>  	/* This member MUST BE FIRST */
>  	ibnd_port_t port;
> @@ -79,15 +67,15 @@ struct ibnd_fabric {
>  	ibnd_fabric_t fabric;
>  
>  	/* internal use only */
> -	struct ibnd_node *nodestbl[HTSZ];
> +	ibnd_node_t *nodestbl[HTSZ];
>  	struct ibnd_port *portstbl[HTSZ];
> -	struct ibnd_node *nodesdist[MAXHOPS + 1];
> +	ibnd_node_t *nodesdist[MAXHOPS + 1];
>  	ibnd_chassis_t *first_chassis;
>  	ibnd_chassis_t *current_chassis;
>  	ibnd_chassis_t *last_chassis;
> -	struct ibnd_node *switches;
> -	struct ibnd_node *ch_adapters;
> -	struct ibnd_node *routers;
> +	ibnd_node_t *switches;
> +	ibnd_node_t *ch_adapters;
> +	ibnd_node_t *routers;
>  	ib_portid_t selfportid;
>  };
>  #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Sun Aug 16 03:41:14 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 13:41:14 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix
	CHECK_AND_SET_VAL macro
In-Reply-To: <f0e08f230908160320s3326000frcf3bc42e714b7cc0@mail.gmail.com>
References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me>
	<f0e08f230908160320s3326000frcf3bc42e714b7cc0@mail.gmail.com>
Message-ID: <20090816104114.GR25501@me>

On 06:20 Sun 16 Aug     , Hal Rosenstock wrote:
> On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
> 
> > On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> > >
> > > Changed check from > to != since using integer comparison
> > > and some masks can use full range and hence be negative
> >
> > Any example?
> 
> 
> Pkey for one.

Will you pass negative value of PKey?

Sasha


From hal.rosenstock at gmail.com  Sun Aug 16 03:56:37 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 16 Aug 2009 06:56:37 -0400
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix 
	CHECK_AND_SET_VAL macro
In-Reply-To: <20090816104114.GR25501@me>
References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me>
	<f0e08f230908160320s3326000frcf3bc42e714b7cc0@mail.gmail.com>
	<20090816104114.GR25501@me>
Message-ID: <f0e08f230908160356i5174e25bm34f856a1664447d4@mail.gmail.com>

On Sun, Aug 16, 2009 at 6:41 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> On 06:20 Sun 16 Aug     , Hal Rosenstock wrote:
> > On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky <sashak at voltaire.com
> >wrote:
> >
> > > On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> > > >
> > > > Changed check from > to != since using integer comparison
> > > > and some masks can use full range and hence be negative
> > >
> > > Any example?
> >
> >
> > Pkey for one.
>
> Will you pass negative value of PKey?


Aren't Pkeys 0x8001 - 0xffff valid ?

-- Hal


>
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090816/64bcb560/attachment.html>

From sashak at voltaire.com  Sun Aug 16 04:02:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:02:00 +0300
Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc:
 Introduce a context object.
In-Reply-To: <20090813204306.dffc3237.weiny2@llnl.gov>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
Message-ID: <20090816110200.GS25501@me>

On 20:43 Thu 13 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 13 Aug 2009 20:16:01 -0700
> Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object.
> 
> 	This object must be created before query functions can be used and is
> 	used to control the functionality of the queries.

Why is it needed? I see that it complicates API, but what is a benefits?

Sasha


From hnrose at comcast.net  Sun Aug 16 04:02:24 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 16 Aug 2009 07:02:24 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Allow pkey and
	qkey to be hidden
Message-ID: <20090816110224.GA23535@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 313f9a7..3a35aa7 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1543,6 +1543,10 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->numb_path = strtoul(optarg, NULL, 0);
 		break;
 	case 18:
+		if (!isxdigit(*optarg) && !(optarg = getpass("P_Key: "))) {
+			fprintf(stderr, "cannot get P_Key\n");
+			ibdiag_show_usage();
+		}
 		p->pkey = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'Q':
@@ -1561,6 +1565,10 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'q':
+		if (!isxdigit(*optarg) && !(optarg = getpass("Q_Key: "))) {
+			fprintf(stderr, "cannot get Q_Key\n");
+			ibdiag_show_usage();
+		}
 		p->qkey = strtoul(optarg, NULL, 0);
 		break;
 	case 'T':
@@ -1637,7 +1645,9 @@ int main(int argc, char **argv)
 		{"mgid", 17, 1, "<gid>", "Multicast GID (MCMemberRecord)"},
 		{"reversible", 'r', 1, NULL, "Reversible path (PathRecord)"},
 		{"numb_path", 'n', 1, NULL, "Number of paths (PathRecord)"},
-		{"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)"},
+		{"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)."
+		 " If non-numeric value (like 'x') is specified then"
+		 " saquery will prompt for a value"},
 		{"qos_class", 'Q', 1, NULL, "QoS Class (PathRecord)"},
 		{"sl", 19, 1, NULL,
 		 "Service level (PathRecord, MCMemberRecord)"},
@@ -1647,7 +1657,9 @@ int main(int argc, char **argv)
 		 "Rate and selector (PathRecord, MCMemberRecord)"},
 		{"pkt_lifetime", 20, 1, NULL,
 		 "Packet lifetime and selector (PathRecord, MCMemberRecord)"},
-		{"qkey", 'q', 1, NULL, "Q_Key (MCMemberRecord)"},
+		{"qkey", 'q', 1, NULL, "Q_Key (MCMemberRecord)."
+		 " If non-numeric value (like 'x') is specified then"
+		 " saquery will prompt for a value"},
 		{"tclass", 'T', 1, NULL,
 		 "Traffic Class (PathRecord, MCMemberRecord)"},
 		{"flow_label", 'F', 1, NULL,


From sashak at voltaire.com  Sun Aug 16 04:21:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:21:25 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix
	CHECK_AND_SET_VAL macro
In-Reply-To: <f0e08f230908160356i5174e25bm34f856a1664447d4@mail.gmail.com>
References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me>
	<f0e08f230908160320s3326000frcf3bc42e714b7cc0@mail.gmail.com>
	<20090816104114.GR25501@me>
	<f0e08f230908160356i5174e25bm34f856a1664447d4@mail.gmail.com>
Message-ID: <20090816112125.GT25501@me>

On 06:56 Sun 16 Aug     , Hal Rosenstock wrote:
> On Sun, Aug 16, 2009 at 6:41 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
> 
> > On 06:20 Sun 16 Aug     , Hal Rosenstock wrote:
> > > On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky <sashak at voltaire.com
> > >wrote:
> > >
> > > > On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> > > > >
> > > > > Changed check from > to != since using integer comparison
> > > > > and some masks can use full range and hence be negative
> > > >
> > > > Any example?
> > >
> > >
> > > Pkey for one.
> >
> > Will you pass negative value of PKey?
> 
> 
> Aren't Pkeys 0x8001 - 0xffff valid ?

Yes, it is valid. And now I'm starting to understand what are you was
about - during the check it will be converted to int16_t and then
negative generated. More detailed patch description would be helpful.

Sasha


From sashak at voltaire.com  Sun Aug 16 04:22:14 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:22:14 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix
	CHECK_AND_SET_VAL macro
In-Reply-To: <20090815134624.GA25048@comcast.net>
References: <20090815134624.GA25048@comcast.net>
Message-ID: <20090816112214.GU25501@me>

On 09:46 Sat 15 Aug     , Hal Rosenstock wrote:
> 
> Changed check from > to != since using integer comparison
> and some masks can use full range and hence be negative 
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 04:28:50 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:28:50 +0300
Subject: [ofa-general] Re: [PATCH 2/5] libibnetdisc: make all fields of
	ibnd_port_t public
In-Reply-To: <20090813204246.59efeb5e.weiny2@llnl.gov>
References: <20090813204246.59efeb5e.weiny2@llnl.gov>
Message-ID: <20090816112850.GV25501@me>

On 20:42 Thu 13 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 13 Aug 2009 19:54:00 -0700
> Subject: [PATCH] libibnetdisc: make all fields of ibnd_port_t public
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 16 04:41:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:41:27 +0300
Subject: [ofa-general] Re: [PATCH 3/5] libibnetdisc: make all fields of
 ibnd_fabric_t public
In-Reply-To: <20090813204251.df6446c1.weiny2@llnl.gov>
References: <20090813204251.df6446c1.weiny2@llnl.gov>
Message-ID: <20090816114127.GW25501@me>

On 20:42 Thu 13 Aug     , Ira Weiny wrote:
> 
> @@ -108,8 +107,8 @@ typedef struct ibnd_port {
>  /** =========================================================================
>   * Chassis
>   */
> -typedef struct chassis {
> -	struct chassis *next;
> +typedef struct ibnd_chassis {
> +	struct ibnd_chassis *next;
>  	uint64_t chassisguid;
>  	unsigned char chassisnum;
>  
> @@ -124,11 +123,17 @@ typedef struct chassis {
>  	ibnd_node_t *linenode[LINES_MAX_NUM + 1];
>  } ibnd_chassis_t;
>  
> +/* HASH table defines */
> +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))

Why should this macro be published (by moving from internal.h to
ibnetdisc.h)?

As far I can see it is only used in ibnetdisc.c, so actually we can keep
it internally and to move to this file.

Sasha


From sashak at voltaire.com  Sun Aug 16 04:42:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Aug 2009 14:42:27 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Allow pkey and
 qkey to be hidden
In-Reply-To: <20090816110224.GA23535@comcast.net>
References: <20090816110224.GA23535@comcast.net>
Message-ID: <20090816114227.GX25501@me>

On 07:02 Sun 16 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From yevgenyp at mellanox.co.il  Sun Aug 16 06:12:54 2009
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Sun, 16 Aug 2009 16:12:54 +0300
Subject: [ofa-general][PATCH] mlx4_core: Avoid double icms free
Message-ID: <4A8805D6.10803@mellanox.co.il>

On cleanup flow on init_hca, the function calls close_hca(), followed by
free_icms() and UNMAP_FA(). Both those functions are also called from close_hca().

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
 drivers/net/mlx4/main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index dac621b..a0a52f1 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -786,7 +786,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
 	return 0;
 
 err_close:
-	mlx4_close_hca(dev);
+	mlx4_CLOSE_HCA(dev, 0);
 
 err_free_icm:
 	mlx4_free_icms(dev);
-- 
1.6.0


From bart.vanassche at gmail.com  Sun Aug 16 08:51:55 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sun, 16 Aug 2009 17:51:55 +0200
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adabpmi3uun.fsf_-_@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
Message-ID: <e2e108260908160851p327eae6fr872489d1899ddd0b@mail.gmail.com>

On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier <rdreier at cisco.com> wrote:
> How about this approach?  Basically it just open-codes delayed work by
> splitting the timer and the work struct, and switches to mod_timer()
> instead of del_timer() + add_timer().  It passes very light testing here
> (basically I started ipoib and nothing blew up).
[ ... ]

Update: after two days of stress testing, still no lockdep complaints.
So it seems like the posted patch solves this issue. Thanks !

Bart.


From sdake at redhat.com  Sun Aug 16 17:27:55 2009
From: sdake at redhat.com (Steven Dake)
Date: Sun, 16 Aug 2009 17:27:55 -0700
Subject: [ofa-general] librdmacm - okay to select on a cm channel's
	file descriptor?
In-Reply-To: <20090815225538.ABA412391C7@ece06.nas.nasa.gov>
References: <20090815225538.ABA412391C7@ece06.nas.nasa.gov>
Message-ID: <1250468875.19265.13.camel@localhost.localdomain>

On Sat, 2009-08-15 at 15:55 -0700, Bryan Green wrote:
> Hi,
> 
> I'm using librdmacm for connection management (on Linux).
> 
> In an attempt to get unexpected DISCONNECT notifications during
> ib communication, I'm trying to use 'select()' on the cm channel's file
> descriptor, testing it for readability.  I've found that this works some of
> the time, but not all of the time.
> 

What I have done is the following:
        struct rdma_event_channel *mcast_channel;

        mcast_channel = rdma_create_event_channel();

then select/poll on mcast_channel->fd and call my connection manager
event handler when there is a new event.

My event handler looks like:
        res = rdma_get_cm_event (mcast_channel, &event);

        switch (event->event) {
	}
        rdma_ack_cm_event (event);


This ack_cm_event removes the event from the file descriptor.

It isn't clear from your message if this is what your doing, but this
works for me.  Note I am using UD mode so I am not certain if there may
be a bug in disconnect events (since UD doesn't generate these) for your
application.

> Is this a legitimate way to test for disconnections, or am I required to
> either make the descriptor nonblocking and just poll, or use a background
> thread for receiving cm events?  I'd rather not use the nonblocking
> approach, because I'd like to simultaneously select on the cm channel
> descriptor and an ibv_comp_channel descriptor.  I'm not sure if
> selecting on the ibv_comp_channel descriptor is acceptable either, but it
> appears to work.
> 

selecting on ibv_comp_channel created descriptors works for me for the
completion queue events only.

Regards
-steve

> 

> I'd appreciate it if anyone can enlighten me on this.
> 
> Thanks,
> -bryan
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Sun Aug 16 20:47:02 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 16 Aug 2009 20:47:02 -0700
Subject: [ofa-general] librdmacm - okay to select on a cm channel's file
	descriptor?
In-Reply-To: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> (Bryan Green's
	message of "Sat, 15 Aug 2009 15:55:38 -0700")
References: <20090815225538.ABA412391C7@ece06.nas.nasa.gov>
Message-ID: <adar5vb2jbd.fsf@cisco.com>


 > In an attempt to get unexpected DISCONNECT notifications during
 > ib communication, I'm trying to use 'select()' on the cm channel's file
 > descriptor, testing it for readability.  I've found that this works some of
 > the time, but not all of the time.

What happens when it doesn't work?  select() doesn't give you an event
but when you try to read there actually is an event there?

I took a quick look at the ucma kernel code and the implementation of
select() (the kernel uses poll() as the name but it all ends up in the
same code) looks straightforwardly correct -- there's only one place
where events are added to the queue for a file, and that place wakes up
the poll wait queue.  But maybe there is a funny bug somehow.

 - R.


From niftyompi at niftyegg.com  Sun Aug 16 21:44:37 2009
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Sun, 16 Aug 2009 21:44:37 -0700
Subject: [ofa-general] Manipulating Credits in Infiniband
In-Reply-To: <ed1288770908122341k33937aco5428ec7465acf90d@mail.gmail.com>
References: <ed1288770908100911h46524f4ch34cc6582bb1c03b@mail.gmail.com>
	<20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net>
	<ed1288770908122341k33937aco5428ec7465acf90d@mail.gmail.com>
Message-ID: <20090817044437.GA3592@compegg>

On Thu, Aug 13, 2009 at 02:41:37AM -0400, Ashwath Narasimhan wrote:
> 
>    Dear Tom/all
> 
>      I understand the end to end credit based flow control at the link
>    layer where we have a 32 bit Flow control packet being sent for each VL
>    (with FCCL and FCTBS fields) but I fail to understand where this scheme
>    is implemented in the driver. (OFED linux- 1.4 stack, hw-mthca) . I can
>    see a file with a credit table mapped to different credits counts and
>    another that computes the AETH based on this credit table.
> 
>    1. Is this the place where the flow control packets are formulated?

If you do some back of the envelope computations you will find that
much of the low level flow control must be done in a firmware/ hardware state 
machine.   The maximum interrupt service rate and the maximum IB packet
rates are not even close.   Thus you will not find it in the driver.

So as you scan the Mellanox driver you will discover
a hand off from the driver to the firmware.   In some cases the driver
will initialize the link layer and you will see this.   You might
see it in the error recovery/ reset part of the driver but for hw-mtca
I think it is well hidden in firmware.    Error recovery is one place to
look because it might need to restore the credit balance so data can flow.
Without credit data does not flow.

>    2. If yes, I don't see them computing this for each VL. why? If no, is
>    it a mid layer flow control?

VL's are interesting, the IB specification is full of may, might, optional, future, etc
and as such most hardware does the minimum with VL.   This is changing.

One valuable thing to research is the other credit based link level interfaces
on common modern hardware. i.e.  AMD uses this on their HT links. See also  ATM links,
Fibre Channel, PCIe... 

Also identify management packets, reliable and unreliable transport.

 
>    3. And thats why I have this basic question--> is the link layer
>    implemented as part of OFED stack at all? or does it go into the
>    hardware HCA as firmware? As I understand the hardware vendor only
>    provides verbs to communicate with the HCA.

Link layer is 99 and 44/100% hardware.


>    Pardon me if i am bundling you all with a lot with questions. I am new
>    to all this and I am trying my best to understand the stack.

You might compare and contrast the QLogic drivers and the Mellanox
drivers.  The hardware design is very different.  To that point
the older QLogic hardware (Pathscale) has no firmware in the way that
Mellanox does.   This will let you see informative learning differences. 

In general there is no need to manipulate credits unless you are designing 
hardware or you are a hardware vendor.


>    Thank you,
> 
>    Ashwath
> 
>    On Tue, Aug 11, 2009 at 10:37 PM, Nifty Tom Mitchell
>    <[1]niftyompi at niftyegg.com> wrote:
> 
>    On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote:
>    >
>    >    I looked into the infiniband driver files. As I understand, in
>    order to
>    >    limit the data rate we manipulate the credits on either ends.
>    Since the
>    >    number of credits available depends on the receiver's work receive
>    >    queue size, I decided to limit the queue size to say 5 instead of
>    8192
>    >    (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my
>    higher
>    >    layer protocol is ipoib). I just want to confirm if I am doing the
>    >    right thing?
> 
>      Data rate is not manipulated by credits.
>      Credits and queue sizes are different and have different purposes.
>      Visit the Infiniband Trade Association web site and grab the IB
>      specifications to understand some of the hardware level parts.
>             [2]http://www.infinibandta.org/
>      InfiniBand offers credit based flow control and given the nature of
>      modern IB switches and processors a very small credit count can
>      still
>      result in full data rate.    Having said that flow control is the
>      lowest
>      level throttle in the system.   Reducing the credit count forces the
>      higher levels in the protocol stack to source or sink the data
>      through
>      the hardware before any more can be delivered.   Thus flow control
>      can
>      simplify the implementation of higher level protocols.   It can also
>      be used
>      to cost reduce or simplify hardware design (smaller hardware
>      buffers).
>      The IB specifications are way too long.  Start with this FAQ.
> 
>      [3]http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf
>      The IB specification is way too full of optional features.  A vendor
>      may
>      have XYZ working fine and dandy on one card and since it is optional
>      not
>      at all on another.
>      The various queue sizes for the various protocols built on top of
>      IB establish transfer behavior in keeping with system interrupt,
>      system process time slice, system kernel activity loads and needs.
>      It is counter intuitive but in some cases small queues result in
>      more responsive and agile systems, especially in the presence of
>      errors.
>      Since there are often multiple protocols on the IB stack all
>      protocols
>      will be impacted by credit tinkering.  Most vendors know their
>      hardware
>      so most drivers will have credit related code optimum.
>      In the case of TCP/IP the interaction between IB bandwidths&MTU
>      (IPoIB),
>      ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU
>      can
>      be "interesting" depending on host names, subnets, routing etc.
>      TCP/IP
>      has lots of tuning flags well above the IB driver.   I see 500+
>      net.*
>      sysctl knobs on this system.
>      As you change things do make the changes on all the moving parts,
>      benchmark
>      and keep a log.   Since there are multiple IB hardware vendors
>      it is important to track hardware specifics.  "lspci" is a good tool
>      to gather chip info.   With some cards you also need specifics about
>      the active firmware.
>      So go forth (RPN forever) and conquer.
>      --
>             T o m  M i t c h e l l
>             Found me a new hat, now what?
> 
>    --
>    regards,
>    Ashwath
> 
> References
> 
>    1. mailto:niftyompi at niftyegg.com
>    2. http://www.infinibandta.org/
>    3. http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From vlad at lists.openfabrics.org  Mon Aug 17 03:01:21 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 17 Aug 2009 03:01:21 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090817-0200 daily build status
Message-ID: <20090817100122.21366E61C00@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From Robert at saq.co.uk  Mon Aug 17 03:41:13 2009
From: Robert at saq.co.uk (Robert Dunkley)
Date: Mon, 17 Aug 2009 11:41:13 +0100
Subject: [ofa-general] OFED on 2.6.18 Xen Kernel
Message-ID: <C1EAC9C5E752D24C968FF091D446D8234591C8@ALTERNATEREALIT>

Hi Everyone,

I haven't even got as far as trying this yet but does OFED work on the
Xen.org 2.6.18 kernel? Should all the Infiniband options be left
disabled in the menu config?

Thanks,

Rob

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
Company Number: 06481952

http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.

Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.

ISPA Member


From mdidomenico4 at gmail.com  Mon Aug 17 05:16:59 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 08:16:59 -0400
Subject: [ofa-general] change mtu
Message-ID: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>

How do I change the MTU of an MT23108 card?  I have an AMD 8131
chipset server that needs this turned down below 1500, or atleast
that's what's suspected.


From dotanba at gmail.com  Mon Aug 17 05:48:06 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 17 Aug 2009 15:48:06 +0300
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
Message-ID: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>

Hi.

Which MTU do you try to change?
(of the IB link?)

Dotan

On Mon, Aug 17, 2009 at 3:16 PM, Michael Di
Domenico<mdidomenico4 at gmail.com> wrote:
> How do I change the MTU of an MT23108 card?  I have an AMD 8131
> chipset server that needs this turned down below 1500, or atleast
> that's what's suspected.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mdidomenico4 at gmail.com  Mon Aug 17 05:52:24 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 08:52:24 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
Message-ID: <e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>

Yes, for the IB card itself, not IPoIB

On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak<dotanba at gmail.com> wrote:
> Hi.
>
> Which MTU do you try to change?
> (of the IB link?)
>
> Dotan
>
> On Mon, Aug 17, 2009 at 3:16 PM, Michael Di
> Domenico<mdidomenico4 at gmail.com> wrote:
>> How do I change the MTU of an MT23108 card?  I have an AMD 8131
>> chipset server that needs this turned down below 1500, or atleast
>> that's what's suspected.
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>


From hal.rosenstock at gmail.com  Mon Aug 17 05:58:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Aug 2009 08:58:12 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
Message-ID: <f0e08f230908170558n59505adfoeb9b731bf5adb410@mail.gmail.com>

On Mon, Aug 17, 2009 at 8:16 AM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> How do I change the MTU of an MT23108 card?


Is this for 23108 <-> 23108 communication (and wanting a PathRecord with
performance optimized smaller MTU) ? Are you using OpenSM ?

-- Hal


>  I have an AMD 8131
> chipset server that needs this turned down below 1500, or atleast
> that's what's suspected.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090817/20427084/attachment.html>

From hal.rosenstock at gmail.com  Mon Aug 17 06:00:13 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Aug 2009 09:00:13 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
Message-ID: <f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>

On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> Yes, for the IB card itself, not IPoIB


The MTUCap of the card can affect the NeighborMTU which in turn can
affect IPoIB MTU (for UD mode).

-- Hal


>
>
> On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak<dotanba at gmail.com> wrote:
> > Hi.
> >
> > Which MTU do you try to change?
> > (of the IB link?)
> >
> > Dotan
> >
> > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di
> > Domenico<mdidomenico4 at gmail.com> wrote:
> >> How do I change the MTU of an MT23108 card?  I have an AMD 8131
> >> chipset server that needs this turned down below 1500, or atleast
> >> that's what's suspected.
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> >>
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090817/9366ddfc/attachment.html>

From dotanba at gmail.com  Mon Aug 17 06:02:15 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 17 Aug 2009 16:02:15 +0300
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
Message-ID: <2f3bf9a60908170602pa7ce57esbbdd1daf74cf373c@mail.gmail.com>

On Mon, Aug 17, 2009 at 3:52 PM, Michael Di
Domenico<mdidomenico4 at gmail.com> wrote:
> Yes, for the IB card itself, not IPoIB
>

Why do you want to do it?
(Do you know that even if the MTU of the link is 2K you can connect
the QPs to use MTU of 1K between them?)

Dotan


From mdidomenico4 at gmail.com  Mon Aug 17 06:02:25 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 09:02:25 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
Message-ID: <e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>

Yes, I have a group of machines with 8131 and MT23108 cards.
Supposedly, based on the AMD errata i need to turn the cards MTU down
from 2048 to below 1500

No i'm not running OpenSM, i'd prefer to do this at the machine level
if possible rather then the fabric level.

On Mon, Aug 17, 2009 at 9:00 AM, Hal Rosenstock<hal.rosenstock at gmail.com> wrote:
>
>
> On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico
> <mdidomenico4 at gmail.com> wrote:
>>
>> Yes, for the IB card itself, not IPoIB
>
>
> The MTUCap of the card can affect the NeighborMTU which in turn can
> affect IPoIB MTU (for UD mode).
>
> -- Hal
>
>>
>> On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak<dotanba at gmail.com> wrote:
>> > Hi.
>> >
>> > Which MTU do you try to change?
>> > (of the IB link?)
>> >
>> > Dotan
>> >
>> > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di
>> > Domenico<mdidomenico4 at gmail.com> wrote:
>> >> How do I change the MTU of an MT23108 card?  I have an AMD 8131
>> >> chipset server that needs this turned down below 1500, or atleast
>> >> that's what's suspected.
>> >> _______________________________________________
>> >> general mailing list
>> >> general at lists.openfabrics.org
>> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >>
>> >> To unsubscribe, please visit
>> >> http://openib.org/mailman/listinfo/openib-general
>> >>
>> >
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>
>


From dotanba at gmail.com  Mon Aug 17 06:08:24 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 17 Aug 2009 16:08:24 +0300
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
Message-ID: <2f3bf9a60908170608jeeee0dcx2ed693eb0744c31@mail.gmail.com>

Are you sure that AMD wants you to lower the MTU of the IB fabric?
(IB doesn't support MTU of 1500 bytes, only Ethernet)

Dotan


From hal.rosenstock at gmail.com  Mon Aug 17 06:09:50 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Aug 2009 09:09:50 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
Message-ID: <f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>

On Mon, Aug 17, 2009 at 9:02 AM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> Yes, I have a group of machines with 8131 and MT23108 cards.
> Supposedly, based on the AMD errata i need to turn the cards MTU down
> from 2048 to below 1500


Then you need to use 1024 or smaller (512 or 256).


>
>
> No i'm not running OpenSM,


And that SM doesn't support cranking the PR MTU down for Tavor communication
?


> i'd prefer to do this at the machine level
> if possible rather then the fabric level.


I think the MTUCap of the .ini file would need to be changed in order to get
the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024.

-- Hal


>
>
> On Mon, Aug 17, 2009 at 9:00 AM, Hal Rosenstock<hal.rosenstock at gmail.com>
> wrote:
> >
> >
> > On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico
> > <mdidomenico4 at gmail.com> wrote:
> >>
> >> Yes, for the IB card itself, not IPoIB
> >
> >
> > The MTUCap of the card can affect the NeighborMTU which in turn can
> > affect IPoIB MTU (for UD mode).
> >
> > -- Hal
> >
> >>
> >> On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak<dotanba at gmail.com> wrote:
> >> > Hi.
> >> >
> >> > Which MTU do you try to change?
> >> > (of the IB link?)
> >> >
> >> > Dotan
> >> >
> >> > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di
> >> > Domenico<mdidomenico4 at gmail.com> wrote:
> >> >> How do I change the MTU of an MT23108 card?  I have an AMD 8131
> >> >> chipset server that needs this turned down below 1500, or atleast
> >> >> that's what's suspected.
> >> >> _______________________________________________
> >> >> general mailing list
> >> >> general at lists.openfabrics.org
> >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >>
> >> >> To unsubscribe, please visit
> >> >> http://openib.org/mailman/listinfo/openib-general
> >> >>
> >> >
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http://openib.org/mailman/listinfo/openib-general
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090817/734c1c08/attachment.html>

From mdidomenico4 at gmail.com  Mon Aug 17 06:40:19 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 09:40:19 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
	<f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
Message-ID: <e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>

On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock<hal.rosenstock at gmail.com> wrote:
>> No i'm not running OpenSM,
>
> And that SM doesn't support cranking the PR MTU down for Tavor communication

Dunno, didn't check... But i shall.  How do i do it in OpenSM?

>>
>> i'd prefer to do this at the machine level
>> if possible rather then the fabric level.
>
> I think the MTUCap of the .ini file would need to be changed in order to get
> the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024.


From hal.rosenstock at gmail.com  Mon Aug 17 06:53:27 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Aug 2009 09:53:27 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
	<f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
	<e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>
Message-ID: <f0e08f230908170653v691bbc92jc41386686f8b2b84@mail.gmail.com>

On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock<hal.rosenstock at gmail.com>
> wrote:
> >> No i'm not running OpenSM,
> >
> > And that SM doesn't support cranking the PR MTU down for Tavor
> communication
>
> Dunno, didn't check... But i shall.  How do i do it in OpenSM?


enable_quirks option


>
>
> >>
> >> i'd prefer to do this at the machine level
> >> if possible rather then the fabric level.
> >
> > I think the MTUCap of the .ini file would need to be changed in order to
> get
> > the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090817/dde93a9d/attachment.html>

From mdidomenico4 at gmail.com  Mon Aug 17 07:18:12 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 10:18:12 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <f0e08f230908170653v691bbc92jc41386686f8b2b84@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
	<f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
	<e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>
	<f0e08f230908170653v691bbc92jc41386686f8b2b84@mail.gmail.com>
Message-ID: <e75d22a90908170718n148101e9sa76d87dc995a9e66@mail.gmail.com>

On Mon, Aug 17, 2009 at 9:53 AM, Hal Rosenstock<hal.rosenstock at gmail.com> wrote:
>
>
> On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico
> <mdidomenico4 at gmail.com> wrote:
>>
>> On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock<hal.rosenstock at gmail.com>
>> wrote:
>> >> No i'm not running OpenSM,
>> >
>> > And that SM doesn't support cranking the PR MTU down for Tavor
>> > communication
>>
>> Dunno, didn't check... But i shall.  How do i do it in OpenSM?
>
>
> enable_quirks option

Did the option really have to be case sensitive?  Come on...

I set the opensm config and the rdma_cm tavor_quirk=1
it's showing in the log file that it picked up the cached options, but
its still showing 2048 MTU under ibv_devinfo

Did i miss a step?


From hal.rosenstock at gmail.com  Mon Aug 17 07:59:19 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Aug 2009 10:59:19 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <e75d22a90908170718n148101e9sa76d87dc995a9e66@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
	<f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
	<e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>
	<f0e08f230908170653v691bbc92jc41386686f8b2b84@mail.gmail.com>
	<e75d22a90908170718n148101e9sa76d87dc995a9e66@mail.gmail.com>
Message-ID: <f0e08f230908170759idf795cdsa961354dc5e0750c@mail.gmail.com>

On Mon, Aug 17, 2009 at 10:18 AM, Michael Di Domenico <
mdidomenico4 at gmail.com> wrote:

> On Mon, Aug 17, 2009 at 9:53 AM, Hal Rosenstock<hal.rosenstock at gmail.com>
> wrote:
> >
> >
> > On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico
> > <mdidomenico4 at gmail.com> wrote:
> >>
> >> On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock<
> hal.rosenstock at gmail.com>
> >> wrote:
> >> >> No i'm not running OpenSM,
> >> >
> >> > And that SM doesn't support cranking the PR MTU down for Tavor
> >> > communication
> >>
> >> Dunno, didn't check... But i shall.  How do i do it in OpenSM?
> >
> >
> > enable_quirks option
>
> Did the option really have to be case sensitive?  Come on...
>
> I set the opensm config and the rdma_cm tavor_quirk=1


I'm unfamiliar with this RDMA CM option.


>
> it's showing in the log file that it picked up the cached options, but
> its still showing 2048 MTU under ibv_devinfo


It wouldn't show there; only in SA PR responses.

-- Hal


>
>
> Did i miss a step?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090817/68210a17/attachment.html>

From mdidomenico4 at gmail.com  Mon Aug 17 08:04:10 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 17 Aug 2009 11:04:10 -0400
Subject: [ofa-general] change mtu
In-Reply-To: <f0e08f230908170759idf795cdsa961354dc5e0750c@mail.gmail.com>
References: <e75d22a90908170516s19c62e67sf1d3263514a1187@mail.gmail.com>
	<2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com>
	<e75d22a90908170552o1817b9b0na0aab65a5c556bb9@mail.gmail.com>
	<f0e08f230908170600i30421db6h6c3f07a591e75909@mail.gmail.com>
	<e75d22a90908170602p539de982sa80616fe84b9b45e@mail.gmail.com>
	<f0e08f230908170609q5425c8eco4b75217f0c74eebf@mail.gmail.com>
	<e75d22a90908170640o4794a92dudf7c1a2bbc05ff18@mail.gmail.com>
	<f0e08f230908170653v691bbc92jc41386686f8b2b84@mail.gmail.com>
	<e75d22a90908170718n148101e9sa76d87dc995a9e66@mail.gmail.com>
	<f0e08f230908170759idf795cdsa961354dc5e0750c@mail.gmail.com>
Message-ID: <e75d22a90908170804o3bd4fc68sc94b9f17a92ea530@mail.gmail.com>

>> it's showing in the log file that it picked up the cached options, but
>> its still showing 2048 MTU under ibv_devinfo
>
> It wouldn't show there; only in SA PR responses.

Okay.  Here's what i've enabled so far, for those keep track....

Bios = MTRR changed from Continuous to Discrete
options ib_mthca msi_x=1 tune_pci=1
options rdma_cm tavor_quirk=1

I'm now pushing 550MB/sec, which is a remarkable improvement.


From weiny2 at llnl.gov  Mon Aug 17 08:30:23 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 17 Aug 2009 08:30:23 -0700
Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc:
 Introduce a context object.
In-Reply-To: <20090816110200.GS25501@me>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
Message-ID: <20090817083023.da17378b.weiny2@llnl.gov>

On Sun, 16 Aug 2009 14:02:00 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 20:43 Thu 13 Aug     , Ira Weiny wrote:
> > 
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Thu, 13 Aug 2009 20:16:01 -0700
> > Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object.
> > 
> > 	This object must be created before query functions can be used and is
> > 	used to control the functionality of the queries.
> 
> Why is it needed? I see that it complicates API, but what is a benefits?

The immediate benefit is coming with the multi-threaded implementation where
I plan on adding the following function.[*]

ibnd_set_num_threads(ibnd_ctx_t *ctx, int num);

Set/get functions can be added for anything which we need to pass to discover
without changing the discover (or other query) functionality and breaking the
API.

This also allows us to keep some state for the library private.  For example,
I might persist threads across calls to discover and only destroy them on a
ibnd_destroy_ctx call.

Ira

[*] and the reason behind this function is that I feel the proper number of
threads is going to be variable depending on the size and layout of the fabric
being processed as well as the number of CPU's available on the node.

> 
> Sasha


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From taylor at hpc.ufl.edu  Mon Aug 17 09:10:25 2009
From: taylor at hpc.ufl.edu (Charles A. Taylor)
Date: Mon, 17 Aug 2009 12:10:25 -0400
Subject: [ofa-general] IPoIB Transmit Timeouts
Message-ID: <1250525425.20238.37.camel@hotshot.phys.ufl.edu>

We upgraded our file servers to OFED 1.4.1 last Thursday and have since
been hit with a daily ration of the following across all eight of our
servers...

Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
msecs
Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
tx_tail 868165647

The difference between the head/tail is always 123.   The send queue
size is 128 according to...

cat /sys/module/ib_ipoib/parameters/send_queue_size 
128

>From the post below, others seem to have encountered this but we have
not seen any patches or work-arounds.   Has anyone solved this problem?

They were very stable under OFED 1.2.   We are running the
Lustre-patched kernel but we did that under OFED 1.2 + lustre 1.6.4.2 as
well and I'm pretty sure they don't touch the IB modules.

Relevant information:
=====================
CentOS 5.3
Lustre 1.8.0.1
2.6.18-128.1.6.el5_lustre.1.8.0.1smp
X86_64 (Opteron 275s)

hca_id: mthca0
        fw_ver:                         4.8.200
        node_guid:                      0005:ad00:0004:668c
        sys_image_guid:                 0002:c900:0100:d050
        vendor_id:                      0x02c9
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       MT_00A0000001
        phys_port_cnt:                  2
                port:   1
                        state:                  active (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               49
                        port_lmc:               0x00

                port:   2
                        state:                  active (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               98
                        port_lmc:               0x00


Charlie Taylor
UF HPC Center

> On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana <
> prade... at linux.vnet.ibm.com> wrote:
> 
> > Hal Rosenstock wrote:
> > > Hi,
> > >
> > > I'm seeing the following messages from IPoIB:
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > NETDEV WATCHDOG: ib0: transmit timed out
> > > ib0: transmit timeout: latency 1374 msecs
> > > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565
> > >
> > > What are the possible (and most likely) causes of post_send failures ? I
> > > went through the code for all the errors (some at the driver level) but
> > > none popped out at me.
> > >
> >
> > Is it possible that the receiver is overwhelmed and hence the tx_ring is
> > full?
> 


From bryan.d.green at nasa.gov  Mon Aug 17 13:20:36 2009
From: bryan.d.green at nasa.gov (Bryan Green)
Date: Mon, 17 Aug 2009 13:20:36 -0700
Subject: [ofa-general] librdmacm - okay to select on a cm channel's file
	descriptor?
In-Reply-To: Your message of "Sun, 16 Aug 2009 22:47:02 CDT."
	<adar5vb2jbd.fsf@cisco.com>
Message-ID: <20090817202036.46FA42391C7@ece06.nas.nasa.gov>

Roland Dreier writes:
> 
>  > In an attempt to get unexpected DISCONNECT notifications during
>  > ib communication, I'm trying to use 'select()' on the cm channel's file
>  > descriptor, testing it for readability.  I've found that this works some=
>  of
>  > the time, but not all of the time.
> 
> What happens when it doesn't work?  select() doesn't give you an event
> but when you try to read there actually is an event there?

Yes.

> I took a quick look at the ucma kernel code and the implementation of
> select() (the kernel uses poll() as the name but it all ends up in the
> same code) looks straightforwardly correct -- there's only one place
> where events are added to the queue for a file, and that place wakes up
> the poll wait queue.  But maybe there is a funny bug somehow.
> 
>  - R.

Thanks for confirming that use of select() should work.  I wasn't sure because
I couldn't find any reference to select() in the docs or examples.
After further investigation, I found that I was neglecting
to reinitialize the fd_set structure after each call to select.
It's working great now.

Thanks,
-bryan


From weiny2 at llnl.gov  Mon Aug 17 14:03:38 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 17 Aug 2009 14:03:38 -0700
Subject: [ofa-general] Re: [PATCH 3/5 V2] libibnetdisc: make all fields of
 ibnd_fabric_t public
In-Reply-To: <20090816114127.GW25501@me>
References: <20090813204251.df6446c1.weiny2@llnl.gov>
	<20090816114127.GW25501@me>
Message-ID: <20090817140338.edd83fe0.weiny2@llnl.gov>

On Sun, 16 Aug 2009 14:41:27 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 20:42 Thu 13 Aug     , Ira Weiny wrote:
> > 
> > @@ -108,8 +107,8 @@ typedef struct ibnd_port {
> >  /** =========================================================================
> >   * Chassis
> >   */
> > -typedef struct chassis {
> > -	struct chassis *next;
> > +typedef struct ibnd_chassis {
> > +	struct ibnd_chassis *next;
> >  	uint64_t chassisguid;
> >  	unsigned char chassisnum;
> >  
> > @@ -124,11 +123,17 @@ typedef struct chassis {
> >  	ibnd_node_t *linenode[LINES_MAX_NUM + 1];
> >  } ibnd_chassis_t;
> >  
> > +/* HASH table defines */
> > +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
> 
> Why should this macro be published (by moving from internal.h to
> ibnetdisc.h)?
> 
> As far I can see it is only used in ibnetdisc.c, so actually we can keep
> it internally and to move to this file.
> 

You are right, good catch.  I just copied it blindly with HTSZ which must be
there.

git am is not working now on the last two patches [4/5 and 5/5] so I am
sending new versions of them so that they apply cleanly.

V2 below,
Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 13 Aug 2009 20:08:51 -0700
Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public

	In addition clean up the name of the chassis struct

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   38 ++++++++----
 infiniband-diags/libibnetdisc/src/chassis.c        |   23 ++++----
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   63 +++++++++-----------
 infiniband-diags/libibnetdisc/src/internal.h       |   21 -------
 4 files changed, 66 insertions(+), 79 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 4a57855..c55ce00 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -38,8 +38,7 @@
 #include <infiniband/mad.h>
 #include <iba/ib_types.h>
 
-struct ib_fabric;		/* forward declare */
-struct chassis;			/* forward declare */
+struct ibnd_chassis;		/* forward declare */
 struct ibnd_port;		/* forward declare */
 
 /** =========================================================================
@@ -67,13 +66,13 @@ typedef struct ibnd_node {
 
 	char nodedesc[IB_SMP_DATA_SIZE];
 
-	struct ibnd_port **ports; /* in order array of port pointers
-				   the size of this array is info.numports + 1
-				   items MAY BE NULL!  (ie 0 == switches only) */
+	struct ibnd_port **ports;	/* in order array of port pointers
+					   the size of this array is info.numports + 1
+					   items MAY BE NULL!  (ie 0 == switches only) */
 
 	/* chassis info */
 	struct ibnd_node *next_chassis_node;	/* next node in ibnd_chassis_t->nodes */
-	struct chassis *chassis;	/* if != NULL the chassis this node belongs to */
+	struct ibnd_chassis *chassis;	/* if != NULL the chassis this node belongs to */
 	unsigned char ch_type;
 	unsigned char ch_anafanum;
 	unsigned char ch_slotnum;
@@ -92,9 +91,9 @@ typedef struct ibnd_node {
 typedef struct ibnd_port {
 	uint64_t guid;
 	int portnum;
-	int ext_portnum; /* optional if != 0 external port num */
-	ibnd_node_t *node; /* node this port belongs to */
-	struct ibnd_port *remoteport; /* null if SMA, or does not exist */
+	int ext_portnum;	/* optional if != 0 external port num */
+	ibnd_node_t *node;	/* node this port belongs to */
+	struct ibnd_port *remoteport;	/* null if SMA, or does not exist */
 	/* quick cache of info below */
 	uint16_t base_lid;
 	uint8_t lmc;
@@ -108,8 +107,8 @@ typedef struct ibnd_port {
 /** =========================================================================
  * Chassis
  */
-typedef struct chassis {
-	struct chassis *next;
+typedef struct ibnd_chassis {
+	struct ibnd_chassis *next;
 	uint64_t chassisguid;
 	unsigned char chassisnum;
 
@@ -124,11 +123,14 @@ typedef struct chassis {
 	ibnd_node_t *linenode[LINES_MAX_NUM + 1];
 } ibnd_chassis_t;
 
+#define HTSZ 137
+#define MAXHOPS		63
+
 /** =========================================================================
  * Fabric
  * Main fabric object which is returned and represents the data discovered
  */
-typedef struct ib_fabric {
+typedef struct ibnd_fabric {
 	/* the node the discover was initiated from
 	 * "from" parameter in ibnd_discover_fabric
 	 * or by default the node you ar running on
@@ -139,6 +141,18 @@ typedef struct ib_fabric {
 	/* NULL terminated list of all chassis found in the fabric */
 	ibnd_chassis_t *chassis;
 	int maxhops_discovered;
+
+	/* internal use only */
+	ibnd_node_t *nodestbl[HTSZ];
+	ibnd_port_t *portstbl[HTSZ];
+	ibnd_node_t *nodesdist[MAXHOPS + 1];
+	ibnd_chassis_t *first_chassis;
+	ibnd_chassis_t *current_chassis;
+	ibnd_chassis_t *last_chassis;
+	ibnd_node_t *switches;
+	ibnd_node_t *ch_adapters;
+	ibnd_node_t *routers;
+	ib_portid_t selfportid;
 } ibnd_fabric_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 0dd259a..4886cfc 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -91,7 +91,7 @@ char *ibnd_get_chassis_slot_str(ibnd_node_t * node, char *str, size_t size)
 	return (str);
 }
 
-static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric,
+static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric,
 				       unsigned char chassisnum)
 {
 	ibnd_chassis_t *current;
@@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node)
 		return sysimgguid;
 }
 
-static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f,
+static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric,
 					ibnd_node_t * node)
 {
 	ibnd_chassis_t *current;
 	uint64_t chguid;
 
 	chguid = get_chassisguid(node);
-	for (current = f->first_chassis; current; current = current->next) {
+	for (current = fabric->first_chassis; current; current = current->next) {
 		if (current->chassisguid == chguid)
 			return current;
 	}
@@ -224,7 +224,6 @@ static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f,
 
 uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	ibnd_chassis_t *chassis;
 
 	if (!fabric) {
@@ -232,7 +231,7 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum)
 		return 0;
 	}
 
-	chassis = find_chassisnum(f, chassisnum);
+	chassis = find_chassisnum(fabric, chassisnum);
 	if (chassis)
 		return chassis->chassisguid;
 	else
@@ -783,7 +782,7 @@ static void voltaire_portmap(ibnd_port_t * port)
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
 }
 
-static int add_chassis(struct ibnd_fabric *fabric)
+static int add_chassis(ibnd_fabric_t * fabric)
 {
 	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
 		IBND_ERROR("OOM: failed to allocate chassis object\n");
@@ -819,7 +818,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
 	Returns:
 	0 on success, -1 on failure
 */
-int group_nodes(struct ibnd_fabric *fabric)
+int group_nodes(ibnd_fabric_t * fabric)
 {
 	ibnd_node_t *node;
 	int dist;
@@ -833,7 +832,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 	/* an appropriate chassis record (slotnum and position) */
 	/* according to internal connectivity */
 	/* not very efficient but clear code so... */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
@@ -844,7 +843,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 
 	/* separate every Voltaire chassis from each other and build linked list of them */
 	/* algorithm: catch spine and find all surrounding nodes */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
@@ -863,7 +862,7 @@ int group_nodes(struct ibnd_fabric *fabric)
 
 	/* now make pass on nodes for chassis which are not Voltaire */
 	/* grouped by common SystemImageGUID */
-	for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) {
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
 		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
@@ -913,12 +912,12 @@ int group_nodes(struct ibnd_fabric *fabric)
 				}
 			}
 		}
-		if (dist == fabric->fabric.maxhops_discovered)
+		if (dist == fabric->maxhops_discovered)
 			dist = MAXHOPS;	/* skip to CAs */
 		else
 			dist++;
 	}
 
-	fabric->fabric.chassis = fabric->first_chassis;
+	fabric->chassis = fabric->first_chassis;
 	return (0);
 }
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 5d506ee..c69467e 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -67,7 +67,7 @@ void decode_port_info(ibnd_port_t * port)
 }
 
 static int get_port_info(struct ibmad_port *ibmad_port,
-			 struct ibnd_fabric *fabric, ibnd_port_t * port,
+			 ibnd_fabric_t * fabric, ibnd_port_t * port,
 			 int portnum, ib_portid_t * portid)
 {
 	char width[64], speed[64];
@@ -98,7 +98,7 @@ static int get_port_info(struct ibmad_port *ibmad_port,
  * Returns -1 if error.
  */
 static int query_node_info(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, ibnd_node_t * node,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
 			   ib_portid_t * portid)
 {
 	if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0,
@@ -116,7 +116,7 @@ static int query_node_info(struct ibmad_port *ibmad_port,
 /*
  * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
  */
-static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric,
+static int query_node(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 		      ibnd_node_t * node, ibnd_port_t * port,
 		      ib_portid_t * portid)
 {
@@ -175,28 +175,28 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
-static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f,
+static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 			ib_portid_t * portid, int nextport)
 {
 	int rc = 0;
 
 	if (portid->lid) {
 		/* If we were LID routed we need to set up the drslid */
-		if (!f->selfportid.lid)
-			if (ib_resolve_self_via(&f->selfportid, NULL, NULL,
+		if (!fabric->selfportid.lid)
+			if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL,
 						ibmad_port) < 0) {
 				IBND_ERROR("Failed to resolve self\n");
 				return -1;
 			}
 
-		portid->drpath.drslid = (uint16_t) f->selfportid.lid;
+		portid->drpath.drslid = (uint16_t) fabric->selfportid.lid;
 		portid->drpath.drdlid = 0xFFFF;
 	}
 
 	rc = add_port_to_dpath(&portid->drpath, nextport);
 
-	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
-		f->fabric.maxhops_discovered = portid->drpath.cnt;
+	if ((rc != -1) && (portid->drpath.cnt > fabric->maxhops_discovered))
+		fabric->maxhops_discovered = portid->drpath.cnt;
 	return (rc);
 }
 
@@ -215,7 +215,7 @@ static void dump_endnode(ib_portid_t * path, char *prompt,
 	       port->base_lid + (1 << port->lmc) - 1, node->nodedesc);
 }
 
-static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
+static ibnd_node_t *find_existing_node(ibnd_fabric_t * fabric,
 				       ibnd_node_t * new)
 {
 	int hash = HASHGUID(new->guid) % HTSZ;
@@ -230,7 +230,6 @@ static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric,
 
 ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int hash = HASHGUID(guid) % HTSZ;
 	ibnd_node_t *node;
 
@@ -239,7 +238,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 		return (NULL);
 	}
 
-	for (node = f->nodestbl[hash]; node; node = node->htnext)
+	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
 		if (node->guid == guid)
 			return (ibnd_node_t *) node;
 
@@ -267,7 +266,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 	char portinfo_port0[IB_SMP_DATA_SIZE];
 	void *nd = node->nodedesc;
 	int p = 0;
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 
 	if (_check_ibmad_port(ibmad_port) < 0)
 		return (NULL);
@@ -282,7 +280,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 		return (NULL);
 	}
 
-	if (query_node_info(ibmad_port, f, node, &(node->path_portid)))
+	if (query_node_info(ibmad_port, fabric, node, &(node->path_portid)))
 		return (NULL);
 
 	if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0,
@@ -291,7 +289,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
 
 	/* update all the port info's */
 	for (p = 1; p >= node->numports; p++) {
-		get_port_info(ibmad_port, f, node->ports[p],
+		get_port_info(ibmad_port, fabric, node->ports[p],
 			      p, &(node->path_portid));
 	}
 
@@ -319,7 +317,6 @@ done:
 
 ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int i = 0;
 	ibnd_node_t *rc;
 	ib_dr_path_t path;
@@ -329,7 +326,7 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 		return (NULL);
 	}
 
-	rc = f->fabric.from_node;
+	rc = fabric->from_node;
 
 	if (str2drpath(&path, dr_str, 0, 0) == -1) {
 		return (NULL);
@@ -368,7 +365,7 @@ static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[])
 	hash[hash_idx] = port;
 }
 
-static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
+static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric)
 {
 	switch (node->type) {
 	case IB_NODE_CA:
@@ -386,7 +383,7 @@ static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric)
 	}
 }
 
-static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
+static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric)
 {
 	int dist = node->dist;
 	if (node->type != IB_NODE_SWITCH)
@@ -396,7 +393,7 @@ static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric)
 	fabric->nodesdist[dist] = node;
 }
 
-static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
+static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
 				ibnd_node_t * temp, ib_portid_t * path,
 				int dist)
 {
@@ -415,8 +412,8 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric,
 	add_to_nodeguid_hash(node, fabric->nodestbl);
 
 	/* add this to the all nodes list */
-	node->next = fabric->fabric.nodes;
-	fabric->fabric.nodes = (ibnd_node_t *) node;
+	node->next = fabric->nodes;
+	fabric->nodes = (ibnd_node_t *) node;
 
 	add_to_type_list(node, fabric);
 	add_to_nodedist(node, fabric);
@@ -433,7 +430,7 @@ static struct ibnd_port *find_existing_port_node(ibnd_node_t * node,
 	return (node->ports[port->portnum]);
 }
 
-static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric,
+static struct ibnd_port *add_port_to_node(ibnd_fabric_t * fabric,
 					  ibnd_node_t * node,
 					  ibnd_port_t * temp)
 {
@@ -479,7 +476,7 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 }
 
 static int get_remote_node(struct ibmad_port *ibmad_port,
-			   struct ibnd_fabric *fabric, ibnd_node_t * node,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
 			   ibnd_port_t * port, ib_portid_t * path,
 			   int portnum, int dist)
 {
@@ -541,7 +538,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 				    ib_portid_t * from, int hops)
 {
 	int rc = 0;
-	struct ibnd_fabric *fabric = NULL;
+	ibnd_fabric_t *fabric = NULL;
 	ib_portid_t my_portid = { 0 };
 	ibnd_node_t node_buf;
 	ibnd_port_t port_buf;
@@ -587,7 +584,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	if (!node)
 		goto error;
 
-	fabric->fabric.from_node = (ibnd_node_t *) node;
+	fabric->from_node = (ibnd_node_t *) node;
 
 	port = add_port_to_node(fabric, node, &port_buf);
 	if (!port)
@@ -668,7 +665,6 @@ static void destroy_node(ibnd_node_t * node)
 
 void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	int dist = 0;
 	ibnd_node_t *node = NULL;
 	ibnd_node_t *next = NULL;
@@ -677,21 +673,21 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 	if (!fabric)
 		return;
 
-	ch = f->first_chassis;
+	ch = fabric->first_chassis;
 	while (ch) {
 		ch_next = ch->next;
 		free(ch);
 		ch = ch_next;
 	}
 	for (dist = 0; dist <= MAXHOPS; dist++) {
-		node = f->nodesdist[dist];
+		node = fabric->nodesdist[dist];
 		while (node) {
 			next = node->dnext;
 			destroy_node(node);
 			node = next;
 		}
 	}
-	free(f);
+	free(fabric);
 }
 
 void ibnd_debug(int i)
@@ -735,7 +731,6 @@ void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 			  int node_type, void *user_data)
 {
-	struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric);
 	ibnd_node_t *list = NULL;
 	ibnd_node_t *cur = NULL;
 
@@ -751,13 +746,13 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
 
 	switch (node_type) {
 	case IB_NODE_SWITCH:
-		list = f->switches;
+		list = fabric->switches;
 		break;
 	case IB_NODE_CA:
-		list = f->ch_adapters;
+		list = fabric->ch_adapters;
 		break;
 	case IB_NODE_ROUTER:
-		list = f->routers;
+		list = fabric->routers;
 		break;
 	default:
 		IBND_DEBUG("Invalid node_type specified %d\n", node_type);
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index f06d2c3..21ff476 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -40,8 +40,6 @@
 
 #include <infiniband/ibnetdisc.h>
 
-#define MAXHOPS		63
-
 #define	IBND_DEBUG(fmt, ...) \
 	if (ibdebug) { \
 		printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \
@@ -51,24 +49,5 @@
 
 /* HASH table defines */
 #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
-#define HTSZ 137
-
-struct ibnd_fabric {
-	/* This member MUST BE FIRST */
-	ibnd_fabric_t fabric;
-
-	/* internal use only */
-	ibnd_node_t *nodestbl[HTSZ];
-	ibnd_port_t *portstbl[HTSZ];
-	ibnd_node_t *nodesdist[MAXHOPS + 1];
-	ibnd_chassis_t *first_chassis;
-	ibnd_chassis_t *current_chassis;
-	ibnd_chassis_t *last_chassis;
-	ibnd_node_t *switches;
-	ibnd_node_t *ch_adapters;
-	ibnd_node_t *routers;
-	ib_portid_t selfportid;
-};
-#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
 
 #endif				/* _INTERNAL_H_ */
-- 
1.5.4.5


From weiny2 at llnl.gov  Mon Aug 17 14:03:41 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 17 Aug 2009 14:03:41 -0700
Subject: [ofa-general] Re: [PATCH 4/5 v2] infiniband-diags/libibnetdisc:
 Introduce a context object.
In-Reply-To: <20090817083023.da17378b.weiny2@llnl.gov>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
Message-ID: <20090817140341.3dcccc10.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 17 Aug 2009 13:10:45 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object.

	This object must be created before "query" functions can be used.

	The purpose of this is to allow for future data to be passed to query
	functions (ie ibnd_discover_fabric) without having to change the API of
	those functions.

	Adjusted to apply to v2 of "libibnetdisc: make all fields of ibnd_fabric_t
	public"

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/Makefile.am          |    4 +-
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   23 ++++--
 .../libibnetdisc/man/ibnd_create_ctx.3             |    2 +
 .../libibnetdisc/man/ibnd_destroy_ctx.3            |    2 +
 .../libibnetdisc/man/ibnd_discover_fabric.3        |   41 ++++++++---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   74 ++++++++++++++------
 infiniband-diags/libibnetdisc/src/internal.h       |    5 ++
 infiniband-diags/libibnetdisc/src/libibnetdisc.map |    2 +
 infiniband-diags/libibnetdisc/test/testleaks.c     |    7 ++-
 infiniband-diags/src/iblinkinfo.c                  |   14 ++--
 infiniband-diags/src/ibnetdiscover.c               |   13 +++-
 infiniband-diags/src/ibqueryerrors.c               |   18 +++--
 12 files changed, 147 insertions(+), 58 deletions(-)
 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3

diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am
index 7085f14..5619aad 100644
--- a/infiniband-diags/libibnetdisc/Makefile.am
+++ b/infiniband-diags/libibnetdisc/Makefile.am
@@ -45,7 +45,9 @@ man_MANS = man/ibnd_debug.3 \
 	man/ibnd_iter_nodes.3 \
 	man/ibnd_iter_nodes_type.3 \
 	man/ibnd_show_progress.3 \
-	man/ibnd_update_node.3
+	man/ibnd_update_node.3 \
+	man/ibnd_create_ctx.3 \
+	man/ibnd_destroy_ctx.3
 
 EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver $(man_MANS)
 
diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index c55ce00..ce1c74f 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -38,8 +38,11 @@
 #include <infiniband/mad.h>
 #include <iba/ib_types.h>
 
-struct ibnd_chassis;		/* forward declare */
-struct ibnd_port;		/* forward declare */
+typedef struct ibnd_ctx ibnd_ctx_t;
+
+/* forward declares */
+struct ibnd_chassis;
+struct ibnd_port;
 
 /** =========================================================================
  * Node
@@ -156,15 +159,21 @@ typedef struct ibnd_fabric {
 } ibnd_fabric_t;
 
 /** =========================================================================
- * Initialization (fabric operations)
+ * Initialization
  */
 MAD_EXPORT void ibnd_debug(int i);
-MAD_EXPORT void ibnd_show_progress(int i);
 
-MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port,
+MAD_EXPORT ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port);
+MAD_EXPORT void ibnd_destroy_ctx(ibnd_ctx_t * ctx);
+MAD_EXPORT int ibnd_show_progress(ibnd_ctx_t * ctx, int i);
+
+/** =========================================================================
+ * Fabric Operations
+ */
+MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 					       ib_portid_t * from, int hops);
 	/**
-	 * open: (required) ibmad_port object from libibmad
+	 * ctx : (required) context created by ibnd_create_ctx.
 	 * from: (optional) specify the node to start scanning from.
 	 *       If NULL start from the node we are running on.
 	 * hops: (optional) Specify how much of the fabric to traverse.
@@ -178,7 +187,7 @@ MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t * fabric);
 MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric,
 					    uint64_t guid);
 MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str);
-MAD_EXPORT ibnd_node_t *ibnd_update_node(struct ibmad_port *ibmad_port,
+MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx,
 					 ibnd_fabric_t * fabric,
 					 ibnd_node_t * node);
 
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
new file mode 100644
index 0000000..8b321b0
--- /dev/null
+++ b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3
@@ -0,0 +1,2 @@
+.\".TH IBND_CREATE_CTX 3  "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3
new file mode 100644
index 0000000..bb9d96a
--- /dev/null
+++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3
@@ -0,0 +1,2 @@
+.\".TH IBND_DESTROY_CTX 3  "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
index dfeaf47..f014977 100644
--- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
+++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3
@@ -1,46 +1,65 @@
 .TH IBND_DISCOVER_FABRIC 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
 .SH "NAME"
-ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library.
+ibnd_create_ctx, ibnd_destroy_ctx,
+ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug, ibnd_show_progress \-
+initialize ibnetdiscover library and query the fabric.
 .SH "SYNOPSIS"
 .nf
 .B #include <infiniband/ibnetdisc.h>
 .sp
-.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)"
+.bi "ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)"
+.BI "void ibnd_destroy_ctx(ibnd_ctx_t *ctx)"
+.bi "ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t *ctx, ib_portid_t *from, int hops)"
 .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)"
 .BI "void ibnd_debug(int i)"
-.BI "void ibnd_show_progress(int i)"
+.BI "int ibnd_show_progress(ibnd_ctx_t *ctx, int i)"
 .SH "DESCRIPTION"
-.B ibnd_discover_fabric()
-Discover the fabric connected to the port specified by ibmad_port, using a timeout specified.  The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops".  This gives the user a "sub-fabric" which is "centered" anywhere they chose.
+.B ibnd_create_ctx()
+Create a context for the ibnetdiscover library to be used in query operations.
 
 ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS
-classes for ibnd_discover_fabric to work.
+classes for queries to work.
+
+.B ibnd_discover_fabric()
+Discover the fabric using the context specified.  The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops".  This gives the user a "sub-fabric" which is "centered" anywhere they chose.
 
 .B ibnd_destroy_fabric()
 free all memory and resources associated with the fabric.
 
+.B ibnd_destroy_ctx()
+free all memory and resources associated with the context.
+
 .B ibnd_debug()
 Set the debug level to be printed as library operations take place.
 
-.B ibnd_debug()
-Indicate that the library should print debug output which shows it's progress
+.B ibnd_show_progress()
+Indicate that the library should print output which shows it's progress
 through the fabric.
 
 .SH "RETURN VALUE"
+.B ibnd_create_ctx()
+return NULL on failure, otherwise a valid ibnd_ctx_t object.
+
 .B ibnd_discover_fabric()
 return NULL on failure, otherwise a valid ibnd_fabric_t object.
 
-.B ibnd_destory_fabric(), ibnd_debug()
+.B ibnd_show_progress()
+Returnes the previous setting for this value.
+
+.B ibnd_destory_fabric(), ibnd_debug(), ibnd_destroy_ctx()
 NONE
+
 .SH "EXAMPLES"
 
 .B Discover the entire fabric connected to device "mthca0", port 1.
 
 	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
 	struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2);
-	ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0);
+	ibnd_ctx_t *ctx = ibnd_create_ctx(ibmad_port);
+	ibnd_fabric_t *fabric = ibnd_discover_fabric(ctx, NULL, 0);
 	...
 	ibnd_destroy_fabric(fabric);
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 
 .B Discover only a single node and those nodes connected to it.
@@ -48,7 +67,7 @@ NONE
 	...
 	str2drpath(&(port_id.drpath), from, 0, 0);
 	...
-	ibnd_discover_fabric(ibmad_port, 100, &port_id, 1);
+	ibnd_discover_fabric(ctx, &port_id, 1);
 	...
 .SH "SEE ALSO"
 	libibmad, mad_rpc_open_port
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index c69467e..7295189 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -57,9 +57,23 @@
 #include "internal.h"
 #include "chassis.h"
 
-static int show_progress = 0;
 int ibdebug;
 
+ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)
+{
+	ibnd_ctx_t *rc = calloc(1, sizeof *rc);
+	if (!rc)
+		return (NULL);
+
+	rc->ibmad_port = ibmad_port;
+	return (rc);
+}
+
+void ibnd_destroy_ctx(ibnd_ctx_t * ctx)
+{
+	free(ctx);
+}
+
 void decode_port_info(ibnd_port_t * port)
 {
 	port->base_lid = (uint16_t) mad_get_field(port->info, 0, IB_PORT_LID_F);
@@ -204,8 +218,6 @@ static void dump_endnode(ib_portid_t * path, char *prompt,
 			 ibnd_node_t * node, ibnd_port_t * port)
 {
 	char type[64];
-	if (!show_progress)
-		return;
 
 	mad_dump_node_type(type, 64, &(node->type), sizeof(int));
 	printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n",
@@ -260,16 +272,29 @@ static int _check_ibmad_port(struct ibmad_port *ibmad_port)
 	return (0);
 }
 
-ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port,
-			      ibnd_fabric_t * fabric, ibnd_node_t * node)
+static int check_ctx(ibnd_ctx_t * ctx)
+{
+	if (!ctx) {
+		IBND_DEBUG("ctx must be specified\n");
+		return (-1);
+	}
+
+	return (_check_ibmad_port(ctx->ibmad_port));
+}
+
+ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
+			      ibnd_node_t * node)
 {
 	char portinfo_port0[IB_SMP_DATA_SIZE];
 	void *nd = node->nodedesc;
 	int p = 0;
+	struct ibmad_port *ibmad_port;
 
-	if (_check_ibmad_port(ibmad_port) < 0)
+	if (check_ctx(ctx) < 0)
 		return (NULL);
 
+	ibmad_port = ctx->ibmad_port;
+
 	if (!fabric) {
 		IBND_DEBUG("fabric parameter NULL\n");
 		return (NULL);
@@ -475,12 +500,12 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 	remoteport->remoteport = (ibnd_port_t *) port;
 }
 
-static int get_remote_node(struct ibmad_port *ibmad_port,
-			   ibnd_fabric_t * fabric, ibnd_node_t * node,
-			   ibnd_port_t * port, ib_portid_t * path,
-			   int portnum, int dist)
+static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
+			   ibnd_node_t * node, ibnd_port_t * port,
+			   ib_portid_t * path, int portnum, int dist)
 {
 	int rc = 0;
+	struct ibmad_port *ibmad_port = ctx->ibmad_port;
 	ibnd_node_t node_buf;
 	ibnd_port_t port_buf;
 	ibnd_node_t *remotenode, *oldnode;
@@ -524,8 +549,9 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
 		goto error;
 	}
 
-	dump_endnode(path, oldnode ? "known remote" : "new remote",
-		     remotenode, remoteport);
+	if (ctx->show_progress)
+		dump_endnode(path, oldnode ? "known remote" : "new remote",
+			     remotenode, remoteport);
 
 	link_ports(node, port, remotenode, remoteport);
 
@@ -534,7 +560,7 @@ error:
 	return (rc);
 }
 
-ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
+ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 				    ib_portid_t * from, int hops)
 {
 	int rc = 0;
@@ -549,7 +575,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	ib_portid_t *path;
 	int max_hops = MAXHOPS - 1;	/* default find everything */
 
-	if (_check_ibmad_port(ibmad_port) < 0)
+	if (check_ctx(ctx) < 0)
 		return (NULL);
 
 	/* if not everything how much? */
@@ -575,7 +601,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	memset(&node_buf, 0, sizeof(node_buf));
 	memset(&port_buf, 0, sizeof(port_buf));
 
-	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, from)) {
+	if (query_node(ctx->ibmad_port, fabric, &node_buf, &port_buf, from)) {
 		IBND_DEBUG("can't reach node %s\n", portid2str(from));
 		goto error;
 	}
@@ -590,7 +616,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	if (!port)
 		goto error;
 
-	rc = get_remote_node(ibmad_port, fabric, node, port, from,
+	rc = get_remote_node(ctx, fabric, node, port, from,
 			     mad_get_field(node->info, 0,
 					   IB_NODE_LOCAL_PORT_F), 0);
 	if (rc < 0)
@@ -605,14 +631,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 			path = &node->path_portid;
 
 			IBND_DEBUG("dist %d node %p\n", dist, node);
-			dump_endnode(path, "processing", node, port);
+			if (ctx->show_progress)
+				dump_endnode(path, "processing", node, port);
 
 			for (i = 1; i <= node->numports; i++) {
 				if (i == mad_get_field(node->info, 0,
 						       IB_NODE_LOCAL_PORT_F))
 					continue;
 
-				if (get_port_info(ibmad_port, fabric,
+				if (get_port_info(ctx->ibmad_port, fabric,
 						  &port_buf, i, path)) {
 					IBND_ERROR
 					    ("can't reach node %s port %d",
@@ -636,7 +663,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 							    IB_NODE_PORT_GUID_F);
 				}
 
-				if (get_remote_node(ibmad_port, fabric, node,
+				if (get_remote_node(ctx, fabric, node,
 						    port, path, i, dist) < 0)
 					goto error;
 			}
@@ -703,9 +730,14 @@ void ibnd_debug(int i)
 	}
 }
 
-void ibnd_show_progress(int i)
+int ibnd_show_progress(ibnd_ctx_t * ctx, int i)
 {
-	show_progress = i;
+	int rc = 0;
+	if (check_ctx(ctx))
+		return (-1);
+	rc = ctx->show_progress;
+	ctx->show_progress = i;
+	return (rc);
 }
 
 void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func,
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 21ff476..b989b68 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -50,4 +50,9 @@
 /* HASH table defines */
 #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
 
+struct ibnd_ctx {
+	struct ibmad_port *ibmad_port;
+	int show_progress;
+};
+
 #endif				/* _INTERNAL_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map
index bd108ab..56560ec 100644
--- a/infiniband-diags/libibnetdisc/src/libibnetdisc.map
+++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map
@@ -2,6 +2,8 @@ IBNETDISC_1.0 {
 	global:
 		ibnd_debug;
 		ibnd_show_progress;
+		ibnd_create_ctx;
+		ibnd_destroy_ctx;
 		ibnd_discover_fabric;
 		ibnd_destroy_fabric;
 		ibnd_find_node_guid;
diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c
index cb5651e..b121bdd 100644
--- a/infiniband-diags/libibnetdisc/test/testleaks.c
+++ b/infiniband-diags/libibnetdisc/test/testleaks.c
@@ -87,6 +87,7 @@ int main(int argc, char **argv)
 	int hops = 0;
 	ib_portid_t port_id;
 	int iters = -1;
+	ibnd_ctx_t *ctx = NULL;
 
 	struct ibmad_port *ibmad_port;
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
@@ -156,11 +157,12 @@ int main(int argc, char **argv)
 
 	mad_rpc_set_timeout(ibmad_port, timeout_ms);
 
+	ctx = ibnd_create_ctx(ibmad_port);
 	while (iters == -1 || iters-- > 0) {
 		if (from) {
 			/* only scan part of the fabric */
 			str2drpath(&(port_id.drpath), from, 0, 0);
-			if ((fabric = ibnd_discover_fabric(ibmad_port,
+			if ((fabric = ibnd_discover_fabric(ctx,
 							   &port_id,
 							   hops)) == NULL) {
 				fprintf(stderr, "discover failed\n");
@@ -170,7 +172,7 @@ int main(int argc, char **argv)
 			guid = 0;
 		} else {
 			if ((fabric =
-			     ibnd_discover_fabric(ibmad_port, NULL,
+			     ibnd_discover_fabric(ctx, NULL,
 						  -1)) == NULL) {
 				fprintf(stderr, "discover failed\n");
 				rc = 1;
@@ -182,6 +184,7 @@ int main(int argc, char **argv)
 	}
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 	exit(rc);
 }
diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index 29c4352..f14c6c3 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -274,6 +274,7 @@ int main(int argc, char **argv)
 	int rc = 0;
 	int resolved = -1;
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 	struct ibmad_port *ibmad_port;
 	ib_portid_t port_id = { 0 };
 	int mgmt_classes[3] =
@@ -323,6 +324,8 @@ int main(int argc, char **argv)
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
+	ctx = ibnd_create_ctx(ibmad_port);
+
 	if (dr_path) {
 		/* only scan part of the fabric */
 		if ((resolved =
@@ -340,14 +343,12 @@ int main(int argc, char **argv)
 	}
 
 	if (resolved >= 0)
-		if ((fabric = ibnd_discover_fabric(ibmad_port, &port_id,
-						   hops)) == NULL)
-			IBWARN
-			    ("Single node discover failed; attempting full scan\n");
+		if ((fabric = ibnd_discover_fabric(ctx, &port_id,
+				hops)) == NULL)
+			IBWARN("Single node discover failed; attempting full scan\n");
 
 	if (!fabric)
-		if ((fabric =
-		     ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) {
+		if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
 			goto close_port;
@@ -381,6 +382,7 @@ int main(int argc, char **argv)
 	ibnd_destroy_fabric(fabric);
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	close_node_name_map(node_name_map);
 	mad_rpc_close_port(ibmad_port);
 	exit(rc);
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 2aa29c8..7811976 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -65,6 +65,7 @@ static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
 
 static int report_max_hops = 0;
+static int show_progress = 0;
 
 /**
  * Define our own conversion functions to maintain compatibility with the old
@@ -616,7 +617,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		node_name_map_file = strdup(optarg);
 		break;
 	case 's':
-		ibnd_show_progress(1);
+		show_progress = 1;
 		break;
 	case 'l':
 		list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE;
@@ -649,6 +650,7 @@ static int process_opt(void *context, int ch, char *optarg)
 int main(int argc, char **argv)
 {
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 
 	struct ibmad_port *ibmad_port;
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
@@ -690,8 +692,14 @@ int main(int argc, char **argv)
 		IBERROR("can't open file %s for writing", argv[0]);
 
 	node_name_map = open_node_name_map(node_name_map_file);
+	ctx = ibnd_create_ctx(ibmad_port);
 
-	if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL)
+	if (!ctx)
+		IBERROR("failed to create libibnetdisc context\n");
+
+	ibnd_show_progress(ctx, show_progress);
+
+	if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL)
 		IBERROR("discover failed\n");
 
 	if (ports_report)
@@ -702,6 +710,7 @@ int main(int argc, char **argv)
 		dump_topology(group, fabric);
 
 	ibnd_destroy_fabric(fabric);
+	ibnd_destroy_ctx(ctx);
 	close_node_name_map(node_name_map);
 	mad_rpc_close_port(ibmad_port);
 	exit(0);
diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c
index f73ca6f..0e4747c 100644
--- a/infiniband-diags/src/ibqueryerrors.c
+++ b/infiniband-diags/src/ibqueryerrors.c
@@ -388,6 +388,7 @@ int main(int argc, char **argv)
 	ib_portid_t portid = { 0 };
 	int rc = 0;
 	ibnd_fabric_t *fabric = NULL;
+	ibnd_ctx_t *ctx = NULL;
 
 	int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS,
 		IB_PERFORMANCE_CLASS
@@ -431,6 +432,8 @@ int main(int argc, char **argv)
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
+	ctx = ibnd_create_ctx(ibmad_port);
+
 	/* limit the scan the fabric around the target */
 	if (dr_path) {
 		if ((resolved =
@@ -448,14 +451,12 @@ int main(int argc, char **argv)
 	}
 
 	if (resolved >= 0)
-		if ((fabric = ibnd_discover_fabric(ibmad_port, &portid,
-						   0)) == NULL)
-			IBWARN
-			    ("Single node discover failed; attempting full scan\n");
-
-	if (!fabric)		/* do a full scan */
-		if ((fabric =
-		     ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) {
+		if ((fabric = ibnd_discover_fabric(ctx, &portid,
+				0)) == NULL)
+			IBWARN("Single node discover failed; attempting full scan\n");
+
+	if (!fabric) /* do a full scan */
+		if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
 			goto close_port;
@@ -490,6 +491,7 @@ int main(int argc, char **argv)
 	ibnd_destroy_fabric(fabric);
 
 close_port:
+	ibnd_destroy_ctx(ctx);
 	mad_rpc_close_port(ibmad_port);
 	close_node_name_map(node_name_map);
 	exit(rc);
-- 
1.5.4.5


From weiny2 at llnl.gov  Mon Aug 17 14:03:44 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 17 Aug 2009 14:03:44 -0700
Subject: [ofa-general] Re: [PATCH 5/5 v2] infiniband-diags/libibnetdisc:
 remove members of the fabric struct which are used in the scan only.
In-Reply-To: <20090813204316.c6ce0de3.weiny2@llnl.gov>
References: <20090813204316.c6ce0de3.weiny2@llnl.gov>
Message-ID: <20090817140344.3ce003c4.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 17 Aug 2009 13:16:51 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only.

	There is no need to have these be in the public interface.  They can
	cause confusion on which variable present the information to the user.

	Adjusted to apply to v2 of "libibnetdisc: make all fields of ibnd_fabric_t
	public"

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |    6 --
 infiniband-diags/libibnetdisc/src/chassis.c        |   52 +++++++-------
 infiniband-diags/libibnetdisc/src/chassis.h        |    2 +-
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   80 ++++++++++++--------
 infiniband-diags/libibnetdisc/src/internal.h       |   13 +++
 5 files changed, 88 insertions(+), 65 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index ce1c74f..51a35a3 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -127,7 +127,6 @@ typedef struct ibnd_chassis {
 } ibnd_chassis_t;
 
 #define HTSZ 137
-#define MAXHOPS		63
 
 /** =========================================================================
  * Fabric
@@ -148,14 +147,9 @@ typedef struct ibnd_fabric {
 	/* internal use only */
 	ibnd_node_t *nodestbl[HTSZ];
 	ibnd_port_t *portstbl[HTSZ];
-	ibnd_node_t *nodesdist[MAXHOPS + 1];
-	ibnd_chassis_t *first_chassis;
-	ibnd_chassis_t *current_chassis;
-	ibnd_chassis_t *last_chassis;
 	ibnd_node_t *switches;
 	ibnd_node_t *ch_adapters;
 	ibnd_node_t *routers;
-	ib_portid_t selfportid;
 } ibnd_fabric_t;
 
 /** =========================================================================
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c
index 4886cfc..d11d7df 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -96,7 +96,7 @@ static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric,
 {
 	ibnd_chassis_t *current;
 
-	for (current = fabric->first_chassis; current; current = current->next) {
+	for (current = fabric->chassis; current; current = current->next) {
 		if (current->chassisnum == chassisnum)
 			return current;
 	}
@@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node)
 		return sysimgguid;
 }
 
-static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric,
+static ibnd_chassis_t *find_chassisguid(struct ibnd_chassis_ctx *ch_ctx,
 					ibnd_node_t * node)
 {
 	ibnd_chassis_t *current;
 	uint64_t chguid;
 
 	chguid = get_chassisguid(node);
-	for (current = fabric->first_chassis; current; current = current->next) {
+	for (current = ch_ctx->first_chassis; current; current = current->next) {
 		if (current->chassisguid == chguid)
 			return current;
 	}
@@ -782,19 +782,19 @@ static void voltaire_portmap(ibnd_port_t * port)
 		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
 }
 
-static int add_chassis(ibnd_fabric_t * fabric)
+static int add_chassis(struct ibnd_chassis_ctx *ch_ctx)
 {
-	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
+	if (!(ch_ctx->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) {
 		IBND_ERROR("OOM: failed to allocate chassis object\n");
 		return (-1);
 	}
 
-	if (fabric->first_chassis == NULL) {
-		fabric->first_chassis = fabric->current_chassis;
-		fabric->last_chassis = fabric->current_chassis;
+	if (ch_ctx->first_chassis == NULL) {
+		ch_ctx->first_chassis = ch_ctx->current_chassis;
+		ch_ctx->last_chassis = ch_ctx->current_chassis;
 	} else {
-		fabric->last_chassis->next = fabric->current_chassis;
-		fabric->last_chassis = fabric->current_chassis;
+		ch_ctx->last_chassis->next = ch_ctx->current_chassis;
+		ch_ctx->last_chassis = ch_ctx->current_chassis;
 	}
 	return (0);
 }
@@ -818,22 +818,22 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node)
 	Returns:
 	0 on success, -1 on failure
 */
-int group_nodes(ibnd_fabric_t * fabric)
+int group_nodes(struct ibnd_scan_ctx *scan_ctx, ibnd_fabric_t * fabric)
 {
 	ibnd_node_t *node;
 	int dist;
 	int chassisnum = 0;
 	ibnd_chassis_t *chassis;
+	struct ibnd_chassis_ctx ch_ctx;
 
-	fabric->first_chassis = NULL;
-	fabric->current_chassis = NULL;
+	memset(&ch_ctx, 0, sizeof ch_ctx);
 
 	/* first pass on switches and build for every Voltaire node */
 	/* an appropriate chassis record (slotnum and position) */
 	/* according to internal connectivity */
 	/* not very efficient but clear code so... */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				if (fill_voltaire_chassis_record(node))
@@ -844,7 +844,7 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* separate every Voltaire chassis from each other and build linked list of them */
 	/* algorithm: catch spine and find all surrounding nodes */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) != VTR_VENDOR_ID)
 				continue;
@@ -852,10 +852,10 @@ int group_nodes(ibnd_fabric_t * fabric)
 			    || (node->chassis && node->chassis->chassisnum)
 			    || !is_spine(node))
 				continue;
-			if (add_chassis(fabric))
+			if (add_chassis(&ch_ctx))
 				return (-1);
-			fabric->current_chassis->chassisnum = ++chassisnum;
-			if (build_chassis(node, fabric->current_chassis))
+			ch_ctx.current_chassis->chassisnum = ++chassisnum;
+			if (build_chassis(node, ch_ctx.current_chassis))
 				return (-1);
 		}
 	}
@@ -863,25 +863,25 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* now make pass on nodes for chassis which are not Voltaire */
 	/* grouped by common SystemImageGUID */
 	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
 			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
-				    find_chassisguid(fabric,
+				    find_chassisguid(&ch_ctx,
 						     (ibnd_node_t *) node);
 				if (chassis)
 					chassis->nodecount++;
 				else {
 					/* Possible new chassis */
-					if (add_chassis(fabric))
+					if (add_chassis(&ch_ctx))
 						return (-1);
-					fabric->current_chassis->chassisguid =
+					ch_ctx.current_chassis->chassisguid =
 					    get_chassisguid((ibnd_node_t *)
 							    node);
-					fabric->current_chassis->nodecount = 1;
+					ch_ctx.current_chassis->nodecount = 1;
 				}
 			}
 		}
@@ -890,14 +890,14 @@ int group_nodes(ibnd_fabric_t * fabric)
 	/* now, make another pass to see which nodes are part of chassis */
 	/* (defined as chassis->nodecount > 1) */
 	for (dist = 0; dist <= MAXHOPS;) {
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) {
 			if (mad_get_field(node->info, 0,
 					  IB_NODE_VENDORID_F) == VTR_VENDOR_ID)
 				continue;
 			if (mad_get_field64(node->info, 0,
 					    IB_NODE_SYSTEM_GUID_F)) {
 				chassis =
-				    find_chassisguid(fabric,
+				    find_chassisguid(&ch_ctx,
 						     (ibnd_node_t *) node);
 				if (chassis && chassis->nodecount > 1) {
 					if (!chassis->chassisnum)
@@ -918,6 +918,6 @@ int group_nodes(ibnd_fabric_t * fabric)
 			dist++;
 	}
 
-	fabric->chassis = fabric->first_chassis;
+	fabric->chassis = ch_ctx.first_chassis;
 	return (0);
 }
diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h
index 2191046..707140c 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.h
+++ b/infiniband-diags/libibnetdisc/src/chassis.h
@@ -82,6 +82,6 @@ enum ibnd_chassis_type {
 };
 enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS };
 
-int group_nodes(struct ibnd_fabric *fabric);
+int group_nodes(struct ibnd_scan_ctx *scan_ctx, struct ibnd_fabric *fabric);
 
 #endif				/* _CHASSIS_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 7295189..84aac0a 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -189,21 +189,27 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
-static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
+static int extend_dpath(struct ibnd_scan_ctx *scan_ctx,
+			struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 			ib_portid_t * portid, int nextport)
 {
 	int rc = 0;
 
 	if (portid->lid) {
+		if (!scan_ctx) {
+			IBND_ERROR("Invalid internal scan state");
+			return (-1);
+		}
 		/* If we were LID routed we need to set up the drslid */
-		if (!fabric->selfportid.lid)
-			if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL,
-						ibmad_port) < 0) {
+		if (!scan_ctx->selfportid.lid)
+			if (ib_resolve_self_via
+			    (&scan_ctx->selfportid, NULL, NULL,
+			     ibmad_port) < 0) {
 				IBND_ERROR("Failed to resolve self\n");
 				return -1;
 			}
 
-		portid->drpath.drslid = (uint16_t) fabric->selfportid.lid;
+		portid->drpath.drslid = (uint16_t) scan_ctx->selfportid.lid;
 		portid->drpath.drdlid = 0xFFFF;
 	}
 
@@ -408,19 +414,25 @@ static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric)
 	}
 }
 
-static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric)
+static void add_to_nodedist(ibnd_node_t * node, struct ibnd_scan_ctx *scan_ctx)
 {
 	int dist = node->dist;
+
+	if (!scan_ctx) {
+		IBND_ERROR("Invalid internal scan state");
+		return;
+	}
+
 	if (node->type != IB_NODE_SWITCH)
 		dist = MAXHOPS;	/* special Ca list */
 
-	node->dnext = fabric->nodesdist[dist];
-	fabric->nodesdist[dist] = node;
+	node->dnext = scan_ctx->nodesdist[dist];
+	scan_ctx->nodesdist[dist] = node;
 }
 
-static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
-				ibnd_node_t * temp, ib_portid_t * path,
-				int dist)
+static ibnd_node_t *create_node(struct ibnd_scan_ctx *scan_ctx,
+				ibnd_fabric_t * fabric, ibnd_node_t * temp,
+				ib_portid_t * path, int dist)
 {
 	ibnd_node_t *node;
 
@@ -441,7 +453,7 @@ static ibnd_node_t *create_node(ibnd_fabric_t * fabric,
 	fabric->nodes = (ibnd_node_t *) node;
 
 	add_to_type_list(node, fabric);
-	add_to_nodedist(node, fabric);
+	add_to_nodedist(node, scan_ctx);
 
 	return node;
 }
@@ -500,9 +512,10 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port,
 	remoteport->remoteport = (ibnd_port_t *) port;
 }
 
-static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
-			   ibnd_node_t * node, ibnd_port_t * port,
-			   ib_portid_t * path, int portnum, int dist)
+static int get_remote_node(ibnd_ctx_t * ctx, struct ibnd_scan_ctx *scan_ctx,
+			   ibnd_fabric_t * fabric, ibnd_node_t * node,
+			   ibnd_port_t * port, ib_portid_t * path,
+			   int portnum, int dist)
 {
 	int rc = 0;
 	struct ibmad_port *ibmad_port = ctx->ibmad_port;
@@ -521,7 +534,7 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
 	    != IB_PORT_PHYS_STATE_LINKUP)
 		return 1;	/* positive == non-fatal error */
 
-	if (extend_dpath(ibmad_port, fabric, path, portnum) < 0)
+	if (extend_dpath(scan_ctx, ibmad_port, fabric, path, portnum) < 0)
 		return -1;
 
 	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
@@ -534,7 +547,9 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric,
 	oldnode = find_existing_node(fabric, &node_buf);
 	if (oldnode)
 		remotenode = oldnode;
-	else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) {
+	else if (!
+		 (remotenode =
+		  create_node(scan_ctx, fabric, &node_buf, path, dist + 1))) {
 		rc = -1;
 		goto error;
 	}
@@ -574,10 +589,13 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 	int dist = 0;
 	ib_portid_t *path;
 	int max_hops = MAXHOPS - 1;	/* default find everything */
+	struct ibnd_scan_ctx scan_ctx;
 
 	if (check_ctx(ctx) < 0)
 		return (NULL);
 
+	memset(&scan_ctx, 0, sizeof scan_ctx);
+
 	/* if not everything how much? */
 	if (hops >= 0) {
 		max_hops = hops;
@@ -606,7 +624,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 		goto error;
 	}
 
-	node = create_node(fabric, &node_buf, from, 0);
+	node = create_node(&scan_ctx, fabric, &node_buf, from, 0);
 	if (!node)
 		goto error;
 
@@ -616,7 +634,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 	if (!port)
 		goto error;
 
-	rc = get_remote_node(ctx, fabric, node, port, from,
+	rc = get_remote_node(ctx, &scan_ctx, fabric, node, port, from,
 			     mad_get_field(node->info, 0,
 					   IB_NODE_LOCAL_PORT_F), 0);
 	if (rc < 0)
@@ -626,7 +644,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 
 	for (dist = 0; dist <= max_hops; dist++) {
 
-		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+		for (node = scan_ctx.nodesdist[dist]; node; node = node->dnext) {
 
 			path = &node->path_portid;
 
@@ -663,14 +681,15 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx,
 							    IB_NODE_PORT_GUID_F);
 				}
 
-				if (get_remote_node(ctx, fabric, node,
-						    port, path, i, dist) < 0)
+				if (get_remote_node
+				    (ctx, &scan_ctx, fabric, node, port, path,
+				     i, dist) < 0)
 					goto error;
 			}
 		}
 	}
 
-	if (group_nodes(fabric))
+	if (group_nodes(&scan_ctx, fabric))
 		goto error;
 
 	return ((ibnd_fabric_t *) fabric);
@@ -692,7 +711,6 @@ static void destroy_node(ibnd_node_t * node)
 
 void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 {
-	int dist = 0;
 	ibnd_node_t *node = NULL;
 	ibnd_node_t *next = NULL;
 	ibnd_chassis_t *ch, *ch_next;
@@ -700,19 +718,17 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric)
 	if (!fabric)
 		return;
 
-	ch = fabric->first_chassis;
+	ch = fabric->chassis;
 	while (ch) {
 		ch_next = ch->next;
 		free(ch);
 		ch = ch_next;
 	}
-	for (dist = 0; dist <= MAXHOPS; dist++) {
-		node = fabric->nodesdist[dist];
-		while (node) {
-			next = node->dnext;
-			destroy_node(node);
-			node = next;
-		}
+	node = fabric->nodes;
+	while (node) {
+		next = node->next;
+		destroy_node(node);
+		node = next;
 	}
 	free(fabric);
 }
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index b989b68..c866b12 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -50,6 +50,19 @@
 /* HASH table defines */
 #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
 
+#define MAXHOPS		63
+
+struct ibnd_chassis_ctx {
+	ibnd_chassis_t *first_chassis;
+	ibnd_chassis_t *current_chassis;
+	ibnd_chassis_t *last_chassis;
+};
+
+struct ibnd_scan_ctx {
+	ibnd_node_t *nodesdist[MAXHOPS + 1];
+	ib_portid_t selfportid;
+};
+
 struct ibnd_ctx {
 	struct ibmad_port *ibmad_port;
 	int show_progress;
-- 
1.5.4.5


From nmehrotra at riorey.com  Mon Aug 17 15:44:23 2009
From: nmehrotra at riorey.com (Nitin Mehrotra)
Date: Mon, 17 Aug 2009 18:44:23 -0400
Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send
	indicate?
In-Reply-To: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>
References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>
	<1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>
	<9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>
Message-ID: <4A89DD47.5030902@riorey.com>


Folks,

I am getting this error on a verb send operation and I can't figure out 
what could be the cause; I searched for all instances of this error in 
the IB code and while I found 4, none was illuminating.

As a background, we are developing an IB application that uses RDMA for 
connection set up and the verbs interface for data transfer. We have 
tested the two ends as user space applications and they work - they can 
connect and exchange data. We are now converting the server end into a 
kernel module and this error is being encountered on the client when it 
posts a send to the RDMA connected QP. I have verified that the 
connection is setup and that recv WR with buffers are posted on the QP. 
Could it be a protection domain problem? Because we have multiple 
clients that connect to the one server we create the PD on the rdma_id 
that is used to connect to the server not the one that connection event 
gives us. Could that be the problem? We assume that the PD is tied to 
the IB device and there is only physical IB port in the system. If that 
is the problem, then why does this work in the userspace version and 
fail in the module version.

Appreciate any pointers on this.

Thanks,

Nitin


From vlad at lists.openfabrics.org  Tue Aug 18 03:00:15 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 18 Aug 2009 03:00:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090818-0200 daily build status
Message-ID: <20090818100015.79298E61B8A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From juliav at voltaire.com  Tue Aug 18 03:28:17 2009
From: juliav at voltaire.com (Julia Volynsky)
Date: Tue, 18 Aug 2009 13:28:17 +0300
Subject: [ofa-general] question about
	management/libibmad/include/infiniband/mad.h
Message-ID: <39C75744D164D948A170E9792AF8E7CA027F628C@exil.voltaire.com>

Hello, all.

 
I have a question about management/libibmad/include/infiniband/mad.h
file.

 
Why variable ibdebug defined without extern in an h file? (MAD_EXPORT
int ibdebug;)

In doc/libibmad.txt ibdebug is mentioned as extern int ibdebug;

 
Thank you.

Julia.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090818/058baf01/attachment.html>

From jon at opengridcomputing.com  Tue Aug 18 12:12:03 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Tue, 18 Aug 2009 14:12:03 -0500
Subject: [ofa-general] [PATCH] krping: Add support for fast_reg_mr with
	dma_local_lkey
Message-ID: <20090818191203.GB20947@opengridcomputing.com>

For devices that do not support reg_phys_mr (like mlx4), an alternative
is need to use krping over fast_reg_mr.  In the reg_phys_mr place, use
dma_local_lkey (previously called stag0).  This patch renames the
relevant pieces, adding support in the fastreg case for dma_local_lkey,
and adds debug code in the completion queue for unexpected errors.

Signed-Off-By: Jon Mason <jon at opengridcomputing.com>

diff --git a/README b/README
index cfdd771..b5f251f 100644
--- a/README
+++ b/README
@@ -1,5 +1,5 @@
 		Kernel Mode RDMA Ping Module
-		   Steve Wise - 6/2008
+		   Steve Wise - 8/2009
 
 ============
 Introduction
@@ -137,8 +137,8 @@ server_inv 	none		Valid only in fastreg mode, this
 				client's fastreg mr via 
 				SEND_WITH_INVALIDATE messages from
 				the server.
-stag0		none		Use lkey 0 for source of writes and
-				sends, and in recvs
+local_dma_lkey	none		Use the local dma lkey for the source 
+				of writes and sends, and in recvs
 read_inv	none		Server will use READ_WITH_INV. Only
 				valid in fastreg mem_mode.
 				
diff --git a/krping.c b/krping.c
index 7f50cf5..5f6e893 100644
--- a/krping.c
+++ b/krping.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
- * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ * Copyright (c) 2006-2009 Open Grid Computing, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -89,7 +89,7 @@ static const struct krping_option krping_opts[] = {
  	{"duplex", OPT_NOPARAM, 'd'},
  	{"txdepth", OPT_INT, 'T'},
  	{"poll", OPT_NOPARAM, 'P'},
- 	{"stag0", OPT_NOPARAM, 'Z'},
+ 	{"local_dma_lkey", OPT_NOPARAM, 'Z'},
  	{"read_inv", OPT_NOPARAM, 'R'},
 	{NULL, 0, 0}
 };
@@ -239,11 +239,11 @@ struct krping_cb {
 	int duplex;			/* run bw full duplex test */
 	int poll;			/* poll or block for rlat test */
 	int txdepth;			/* SQ depth */
-	int stag0;			/* use 0 for lkey */
+	int local_dma_lkey;			/* use 0 for lkey */
 
 	/* CM stuff */
 	struct rdma_cm_id *cm_id;	/* connection on client side,*/
-					/* listener on service side. */
+					/* listener on server side. */
 	struct rdma_cm_id *child_cm_id;	/* connection on server side */
 	struct list_head list;	
 };
@@ -376,9 +376,9 @@ static void krping_cq_event_handler(struct ib_cq *cq, void *ctx)
 				DEBUG_LOG("cq flushed\n");
 				continue;
 			} else {
-				printk(KERN_ERR PFX 
-					"cq completion failed status %d\n",
-					wc.status);
+				printk(KERN_ERR PFX "cq completion failed with "
+				       "wr_id %x status %d opcode %d vender_err %x\n",
+					wc.wr_id, wc.status, wc.opcode, wc.vendor_err);
 				goto error;
 			}
 		}
@@ -429,8 +429,16 @@ static void krping_cq_event_handler(struct ib_cq *cq, void *ctx)
 			wake_up_interruptible(&cb->sem);
 			break;
 
+		case IB_WC_LOCAL_INV:
+		case IB_WC_FAST_REG_MR:
+			printk(KERN_ERR PFX
+			       "Unexpected opcode %d, most likely unsignalled\n",
+			       __func__, __LINE__, wc.opcode);
+			break;
 		default:
-			DEBUG_LOG("unknown!!!!! completion\n");
+			printk(KERN_ERR PFX
+			       "Unexpected opcode %d, Shutting down\n",
+			       __func__, __LINE__, wc.opcode);
 			goto error;
 		}
 	}
@@ -476,8 +484,8 @@ static void krping_setup_wr(struct krping_cb *cb)
 {
 	cb->recv_sgl.addr = cb->recv_dma_addr;
 	cb->recv_sgl.length = sizeof cb->recv_buf;
-	if (cb->stag0)
-		cb->recv_sgl.lkey = 0;
+	if (cb->local_dma_lkey)
+		cb->recv_sgl.lkey = cb->qp->device->local_dma_lkey;
 	else if (cb->mem == DMA)
 		cb->recv_sgl.lkey = cb->dma_mr->lkey;
 	else
@@ -487,8 +495,8 @@ static void krping_setup_wr(struct krping_cb *cb)
 
 	cb->send_sgl.addr = cb->send_dma_addr;
 	cb->send_sgl.length = sizeof cb->send_buf;
-	if (cb->stag0)
-		cb->send_sgl.lkey = 0;
+	if (cb->local_dma_lkey)
+		cb->send_sgl.lkey = cb->qp->device->local_dma_lkey;
 	else if (cb->mem == DMA)
 		cb->send_sgl.lkey = cb->dma_mr->lkey;
 	else
@@ -560,34 +568,35 @@ static int krping_setup_buffers(struct krping_cb *cb)
 			goto bail;
 		}
 	} else {
+		if (!cb->local_dma_lkey) {
+			buf.addr = cb->recv_dma_addr;
+			buf.size = sizeof cb->recv_buf;
+			DEBUG_LOG(PFX "recv buf dma_addr %llx size %d\n", buf.addr, 
+				(int)buf.size);
+			iovbase = cb->recv_dma_addr;
+			cb->recv_mr = ib_reg_phys_mr(cb->pd, &buf, 1, 
+						     IB_ACCESS_LOCAL_WRITE, 
+						     &iovbase);
+
+			if (IS_ERR(cb->recv_mr)) {
+				DEBUG_LOG(PFX "recv_buf reg_mr failed\n");
+				ret = PTR_ERR(cb->recv_mr);
+				goto bail;
+			}
 
-		buf.addr = cb->recv_dma_addr;
-		buf.size = sizeof cb->recv_buf;
-		DEBUG_LOG(PFX "recv buf dma_addr %llx size %d\n", buf.addr, 
-			(int)buf.size);
-		iovbase = cb->recv_dma_addr;
-		cb->recv_mr = ib_reg_phys_mr(cb->pd, &buf, 1, 
-					     IB_ACCESS_LOCAL_WRITE, 
-					     &iovbase);
-
-		if (IS_ERR(cb->recv_mr)) {
-			DEBUG_LOG(PFX "recv_buf reg_mr failed\n");
-			ret = PTR_ERR(cb->recv_mr);
-			goto bail;
-		}
-
-		buf.addr = cb->send_dma_addr;
-		buf.size = sizeof cb->send_buf;
-		DEBUG_LOG(PFX "send buf dma_addr %llx size %d\n", buf.addr, 
-			(int)buf.size);
-		iovbase = cb->send_dma_addr;
-		cb->send_mr = ib_reg_phys_mr(cb->pd, &buf, 1, 
-					     0, &iovbase);
-
-		if (IS_ERR(cb->send_mr)) {
-			DEBUG_LOG(PFX "send_buf reg_mr failed\n");
-			ret = PTR_ERR(cb->send_mr);
-			goto bail;
+			buf.addr = cb->send_dma_addr;
+			buf.size = sizeof cb->send_buf;
+			DEBUG_LOG(PFX "send buf dma_addr %llx size %d\n", buf.addr, 
+				(int)buf.size);
+			iovbase = cb->send_dma_addr;
+			cb->send_mr = ib_reg_phys_mr(cb->pd, &buf, 1, 
+						     0, &iovbase);
+
+			if (IS_ERR(cb->send_mr)) {
+				DEBUG_LOG(PFX "send_buf reg_mr failed\n");
+				ret = PTR_ERR(cb->send_mr);
+				goto bail;
+			}
 		}
 	}
 
@@ -921,6 +930,7 @@ static u32 krping_rdma_rkey(struct krping_cb *cb, u64 buf, int post_inv)
 		rkey = cb->dma_mr->rkey;
 		break;
 	default:
+		printk(KERN_ERR PFX "%s:%d case ERROR\n", __func__, __LINE__);
 		cb->state = ERROR;
 		break;
 	}
@@ -1040,8 +1050,8 @@ static void krping_test_server(struct krping_cb *cb)
 		cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey;
 		cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr;
 		cb->rdma_sq_wr.sg_list->length = strlen(cb->rdma_buf) + 1;
-		if (cb->stag0)
-			cb->rdma_sgl.lkey = 0;
+		if (cb->local_dma_lkey)
+			cb->rdma_sgl.lkey = cb->qp->device->local_dma_lkey;
 		else 
 			cb->rdma_sgl.lkey = krping_rdma_rkey(cb, cb->rdma_dma_addr, 0);
 			
@@ -2087,8 +2097,8 @@ int krping_doit(char *cmd)
 			DEBUG_LOG("txdepth %d\n", (int) cb->txdepth);
 			break;
 		case 'Z':
-			cb->stag0 = 1;
-			DEBUG_LOG("using stag 0 for lkeys\n");
+			cb->local_dma_lkey = 1;
+			DEBUG_LOG("using local dma lkey\n");
 			break;
 		case 'R':
 			cb->read_inv = 1;


From arlin.r.davis at intel.com  Tue Aug 18 12:33:59 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Tue, 18 Aug 2009 12:33:59 -0700
Subject: [ofa-general] [PATCH] uDAPL v2 ucm: add new provider using a DAPL
	based IB-UD cm mechanism for MPI implementations.
Message-ID: <A51750533D484EA895D4E05522FE7CC8@amr.corp.intel.com>


New provider uses it's own CM protocol on top of IB-UD queue pairs.
During device open, this provider creates a UD queue pair and
returns local address information via dat_ia_query. This 24 byte
opaque address must be exchange out-of-band before connecting to a
server via dat_ep_connect. This provider is targeted for MPI
implementations that already exchange address information
during boot/init phase.

dtest, dtestx, and dtestcm was modified to report the lid and qpn
information on the server side so you can provide appropriate
destination address information for the client test suite.
Dapltest will not work with this provider.

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 Makefile.am                          |  148 +++-
 dapl/openib_cma/dapl_ib_util.h       |    7 +-
 dapl/openib_common/dapl_ib_common.h  |  136 ++-
 dapl/openib_common/qp.c              |  336 ++++---
 dapl/openib_scm/cm.c                 |  553 +++++------
 dapl/openib_scm/dapl_ib_util.h       |   17 +-
 dapl/openib_ucm/README               |   40 +
 dapl/openib_ucm/SOURCES              |   53 +
 dapl/openib_ucm/cm.c                 | 1837 ++++++++++++++++++++++++++++++++++
 dapl/openib_ucm/dapl_ib_util.h       |  119 +++
 dapl/openib_ucm/device.c             |  603 +++++++++++
 dapl/openib_ucm/linux/openib_osd.h   |   21 +
 dapl/openib_ucm/udapl.rc             |   48 +
 dapl/openib_ucm/windows/openib_osd.h |   35 +
 test/dtest/dtest.c                   |  177 +++-
 test/dtest/dtestcm.c                 |  146 ++-
 test/dtest/dtestx.c                  |   88 ++-
 17 files changed, 3782 insertions(+), 582 deletions(-)
 create mode 100644 dapl/openib_ucm/README
 create mode 100644 dapl/openib_ucm/SOURCES
 create mode 100644 dapl/openib_ucm/cm.c
 create mode 100644 dapl/openib_ucm/dapl_ib_util.h
 create mode 100644 dapl/openib_ucm/device.c
 create mode 100644 dapl/openib_ucm/linux/openib_osd.h
 create mode 100644 dapl/openib_ucm/udapl.rc
 create mode 100644 dapl/openib_ucm/windows/openib_osd.h

diff --git a/Makefile.am b/Makefile.am
index 6842c05..1fe71d9 100755
--- a/Makefile.am
+++ b/Makefile.am
@@ -17,12 +17,10 @@ endif
 
 if EXT_TYPE_IB
 XFLAGS = -DDAT_EXTENSIONS
-XPROGRAMS_CMA = dapl/openib_common/ib_extensions.c
-XPROGRAMS_SCM = dapl/openib_common/ib_extensions.c
+XPROGRAMS = dapl/openib_common/ib_extensions.c
 else
 XFLAGS =
-XPROGRAMS_CMA =
-XPROGRAMS_SCM =
+XPROGRAMS =
 endif
 
 if DEBUG
@@ -34,10 +32,12 @@ endif
 datlibdir = $(libdir)
 dapllibofadir = $(libdir)
 daplliboscmdir = $(libdir)
+daplliboucmdir = $(libdir)
 
 datlib_LTLIBRARIES = dat/udat/libdat2.la
 dapllibofa_LTLIBRARIES = dapl/udapl/libdaplofa.la
 daplliboscm_LTLIBRARIES = dapl/udapl/libdaploscm.la
+daplliboucm_LTLIBRARIES = dapl/udapl/libdaploucm.la
 
 dat_udat_libdat2_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \
 				-I$(srcdir)/dat/include/ -I$(srcdir)/dat/udat/ \
@@ -59,14 +59,24 @@ dapl_udapl_libdaploscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG
                                 -I$(srcdir)/dapl/openib_scm \
 				-I$(srcdir)/dapl/openib_scm/linux
 
+dapl_udapl_libdaploucm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \
+                                -DOPENIB -DCQ_WAIT_OBJECT \
+                                -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \
+                                -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \
+				-I$(srcdir)/dapl/openib_common \
+                                -I$(srcdir)/dapl/openib_ucm \
+				-I$(srcdir)/dapl/openib_ucm/linux				
+
 if HAVE_LD_VERSION_SCRIPT
     dat_version_script = -Wl,--version-script=$(srcdir)/dat/udat/libdat2.map
     daplofa_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaplofa.map
     daploscm_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaploscm.map
+    daploucm_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaploucm.map
 else
     dat_version_script = 
     daplofa_version_script = 
     daploscm_version_script =
+    daploucm_version_script =
 endif
 
 #
@@ -192,14 +202,14 @@ dapl_udapl_libdaplofa_la_SOURCES = dapl/udapl/dapl_init.c \
         dapl/openib_common/qp.c                     \
         dapl/openib_common/util.c                   \
         dapl/openib_cma/cm.c                        \
-        dapl/openib_cma/device.c $(XPROGRAMS_CMA)
+        dapl/openib_cma/device.c $(XPROGRAMS)
 
 dapl_udapl_libdaplofa_la_LDFLAGS = -version-info 2:0:0 $(daplofa_version_script) \
 				   -Wl,-init,dapl_init -Wl,-fini,dapl_fini \
 				   -lpthread -libverbs -lrdmacm 
 				
 #
-# uDAPL OpenFabrics Socket CM version: libdaplscm.so
+# uDAPL OpenFabrics Socket CM version for IB: libdaplscm.so
 #
 dapl_udapl_libdaploscm_la_SOURCES = dapl/udapl/dapl_init.c \
         dapl/udapl/dapl_evd_create.c               \
@@ -306,11 +316,125 @@ dapl_udapl_libdaploscm_la_SOURCES = dapl/udapl/dapl_init.c \
         dapl/openib_common/qp.c                     \
         dapl/openib_common/util.c                   \
         dapl/openib_scm/cm.c                        \
-        dapl/openib_scm/device.c $(XPROGRAMS_SCM)
+        dapl/openib_scm/device.c $(XPROGRAMS)
 
 dapl_udapl_libdaploscm_la_LDFLAGS = -version-info 2:0:0 $(daploscm_version_script) \
                                    -Wl,-init,dapl_init -Wl,-fini,dapl_fini \
                                    -lpthread -libverbs
+                                   
+#
+# uDAPL OpenFabrics UD CM version for IB: libdaplucm.so
+#
+dapl_udapl_libdaploucm_la_SOURCES = dapl/udapl/dapl_init.c \
+        dapl/udapl/dapl_evd_create.c               \
+        dapl/udapl/dapl_evd_query.c                \
+        dapl/udapl/dapl_cno_create.c               \
+        dapl/udapl/dapl_cno_modify_agent.c         \
+        dapl/udapl/dapl_cno_free.c                 \
+        dapl/udapl/dapl_cno_wait.c                 \
+        dapl/udapl/dapl_cno_query.c                \
+        dapl/udapl/dapl_lmr_create.c               \
+        dapl/udapl/dapl_evd_wait.c                 \
+        dapl/udapl/dapl_evd_disable.c              \
+        dapl/udapl/dapl_evd_enable.c               \
+        dapl/udapl/dapl_evd_modify_cno.c           \
+        dapl/udapl/dapl_evd_set_unwaitable.c       \
+        dapl/udapl/dapl_evd_clear_unwaitable.c     \
+        dapl/udapl/linux/dapl_osd.c                \
+        dapl/common/dapl_cookie.c                   \
+        dapl/common/dapl_cr_accept.c                \
+        dapl/common/dapl_cr_query.c                 \
+        dapl/common/dapl_cr_reject.c                \
+        dapl/common/dapl_cr_util.c                  \
+        dapl/common/dapl_cr_callback.c              \
+        dapl/common/dapl_cr_handoff.c               \
+        dapl/common/dapl_ep_connect.c               \
+        dapl/common/dapl_ep_create.c                \
+        dapl/common/dapl_ep_disconnect.c            \
+        dapl/common/dapl_ep_dup_connect.c           \
+        dapl/common/dapl_ep_free.c                  \
+        dapl/common/dapl_ep_reset.c                 \
+        dapl/common/dapl_ep_get_status.c            \
+        dapl/common/dapl_ep_modify.c                \
+        dapl/common/dapl_ep_post_rdma_read.c        \
+        dapl/common/dapl_ep_post_rdma_write.c       \
+        dapl/common/dapl_ep_post_recv.c             \
+        dapl/common/dapl_ep_post_send.c             \
+        dapl/common/dapl_ep_query.c                 \
+        dapl/common/dapl_ep_util.c                  \
+        dapl/common/dapl_evd_dequeue.c              \
+        dapl/common/dapl_evd_free.c                 \
+        dapl/common/dapl_evd_post_se.c              \
+        dapl/common/dapl_evd_resize.c               \
+        dapl/common/dapl_evd_util.c                 \
+        dapl/common/dapl_evd_cq_async_error_callb.c \
+        dapl/common/dapl_evd_qp_async_error_callb.c \
+        dapl/common/dapl_evd_un_async_error_callb.c \
+        dapl/common/dapl_evd_connection_callb.c     \
+        dapl/common/dapl_evd_dto_callb.c            \
+        dapl/common/dapl_get_consumer_context.c     \
+        dapl/common/dapl_get_handle_type.c          \
+        dapl/common/dapl_hash.c                     \
+        dapl/common/dapl_hca_util.c                 \
+        dapl/common/dapl_ia_close.c                 \
+        dapl/common/dapl_ia_open.c                  \
+        dapl/common/dapl_ia_query.c                 \
+        dapl/common/dapl_ia_util.c                  \
+        dapl/common/dapl_llist.c                    \
+        dapl/common/dapl_lmr_free.c                 \
+        dapl/common/dapl_lmr_query.c                \
+        dapl/common/dapl_lmr_util.c                 \
+        dapl/common/dapl_lmr_sync_rdma_read.c       \
+        dapl/common/dapl_lmr_sync_rdma_write.c      \
+        dapl/common/dapl_mr_util.c                  \
+        dapl/common/dapl_provider.c                 \
+        dapl/common/dapl_sp_util.c                  \
+        dapl/common/dapl_psp_create.c               \
+        dapl/common/dapl_psp_create_any.c           \
+        dapl/common/dapl_psp_free.c                 \
+        dapl/common/dapl_psp_query.c                \
+        dapl/common/dapl_pz_create.c                \
+        dapl/common/dapl_pz_free.c                  \
+        dapl/common/dapl_pz_query.c                 \
+        dapl/common/dapl_pz_util.c                  \
+        dapl/common/dapl_rmr_create.c               \
+        dapl/common/dapl_rmr_free.c                 \
+        dapl/common/dapl_rmr_bind.c                 \
+        dapl/common/dapl_rmr_query.c                \
+        dapl/common/dapl_rmr_util.c                 \
+        dapl/common/dapl_rsp_create.c               \
+        dapl/common/dapl_rsp_free.c                 \
+        dapl/common/dapl_rsp_query.c                \
+        dapl/common/dapl_cno_util.c                 \
+        dapl/common/dapl_set_consumer_context.c     \
+        dapl/common/dapl_ring_buffer_util.c         \
+        dapl/common/dapl_name_service.c             \
+        dapl/common/dapl_timer_util.c               \
+        dapl/common/dapl_ep_create_with_srq.c       \
+        dapl/common/dapl_ep_recv_query.c            \
+        dapl/common/dapl_ep_set_watermark.c         \
+        dapl/common/dapl_srq_create.c               \
+        dapl/common/dapl_srq_free.c                 \
+        dapl/common/dapl_srq_query.c                \
+        dapl/common/dapl_srq_resize.c               \
+        dapl/common/dapl_srq_post_recv.c            \
+        dapl/common/dapl_srq_set_lw.c               \
+        dapl/common/dapl_srq_util.c                 \
+        dapl/common/dapl_debug.c                    \
+        dapl/common/dapl_ia_ha.c                    \
+        dapl/common/dapl_csp.c                      \
+        dapl/common/dapl_ep_post_send_invalidate.c  \
+        dapl/common/dapl_ep_post_rdma_read_to_rmr.c \
+        dapl/openib_common/mem.c                    \
+        dapl/openib_common/cq.c                     \
+        dapl/openib_common/qp.c                     \
+        dapl/openib_common/util.c                   \
+        dapl/openib_ucm/cm.c                        \
+        dapl/openib_ucm/device.c $(XPROGRAMS)
+
+dapl_udapl_libdaploucm_la_LDFLAGS = -version-info 2:0:0 $(daploscm_version_script) \
+                                   -Wl,-init,dapl_init -Wl,-fini,dapl_fini \
+                                   -lpthread -libverbs
 
 libdatincludedir = $(includedir)/dat2
 
@@ -375,9 +499,12 @@ EXTRA_DIST = dat/common/dat_dictionary.h \
 	     dapl/openib_cma/linux/openib_osd.h \
 	     dapl/openib_scm/dapl_ib_util.h \
 	     dapl/openib_scm/linux/openib_osd.h \
+     	     dapl/openib_ucm/dapl_ib_util.h \
+	     dapl/openib_ucm/linux/openib_osd.h \
 	     dat/udat/libdat2.map \
 	     dapl/udapl/libdaplofa.map \
 	     dapl/udapl/libdaploscm.map \
+	     dapl/udapl/libdaploucm.map \
 	     dapl.spec.in \
 	     $(man_MANS) \
 	     test/dapltest/include/dapl_bpool.h \
@@ -419,12 +546,14 @@ install-exec-hook:
 		sed -e '/ofa-v2-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \
 		cp /tmp/$$$$ofadapl $(DESTDIR)$(sysconfdir)/dat.conf; \
 	fi; \
-	echo ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 1" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
-	echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"ib0 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"ib1 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 1" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 2" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
+	echo ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 1" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
+	echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
+	echo ucm-mlx4-1 u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \
+	echo ucm-mlx4-2 u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 1" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 2" ""' >>
$(DESTDIR)$(sysconfdir)/dat.conf; \
 	echo ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ehca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf;
\
@@ -433,6 +562,7 @@ install-exec-hook:
 uninstall-hook:
 	if test -e $(DESTDIR)$(sysconfdir)/dat.conf; then \
 		sed -e '/ofa-v2-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \
+		sed -e '/ucm-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \
 		cp /tmp/$$$$ofadapl $(DESTDIR)$(sysconfdir)/dat.conf; \
 	fi;
 
diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h
index c9ab4d6..35900e7 100755
--- a/dapl/openib_cma/dapl_ib_util.h
+++ b/dapl/openib_cma/dapl_ib_util.h
@@ -72,8 +72,8 @@ struct dapl_cm_id {
 	DAT_SOCK_ADDR6			r_addr;
 	int				p_len;
 	unsigned char			p_data[256]; /* dapl max private data size */
-	ib_qp_cm_t			dst; /* dapls_modify_qp_state */
-	struct ibv_ah			*ah; /* dapls_modify_qp_state */
+	ib_cm_msg_t			dst;
+	struct ibv_ah			*ah;
 };
 
 typedef struct dapl_cm_id	*dp_ib_cm_handle_t;
@@ -123,9 +123,6 @@ void dapli_async_event_cb(struct _ib_hca_transport *tp);
 void dapli_cq_event_cb(struct _ib_hca_transport *tp);
 dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep);
 void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep);
-DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle,
-				 IN ib_qp_state_t qp_state,
-				 IN dp_ib_cm_handle_t cm);
 
 STATIC _INLINE_ void dapls_print_cm_list(IN DAPL_IA * ia_ptr)
 {
diff --git a/dapl/openib_common/dapl_ib_common.h b/dapl/openib_common/dapl_ib_common.h
index 2195767..3cd8885 100644
--- a/dapl/openib_common/dapl_ib_common.h
+++ b/dapl/openib_common/dapl_ib_common.h
@@ -50,25 +50,56 @@ typedef	struct ibv_pd		*ib_pd_handle_t;
 typedef	struct ibv_mr		*ib_mr_handle_t;
 typedef	struct ibv_mw		*ib_mw_handle_t;
 typedef	struct ibv_wc		ib_work_completion_t;
+typedef struct ibv_ah		*ib_ah_handle_t;
+typedef union  ibv_gid		*ib_gid_handle_t;
 
 /* HCA context type maps to IB verbs  */
 typedef	struct ibv_context	*ib_hca_handle_t;
 typedef ib_hca_handle_t		dapl_ibal_ca_t;
 
 /* QP info to exchange, wire protocol version for these CM's */
-#define DCM_VER 4
-typedef struct _ib_qp_cm
-{ 
+#define DCM_VER 5
+
+/* CM private data areas, same for all operations */
+#define        DCM_MAX_PDATA_SIZE      128
+
+/*
+ * DAPL IB/QP address (type, port, lid, qp_num, gid) mapping to
+ * DAT_IA_ADDRESS_PTR, DAT_SOCK_ADDR2 (24 bytes)
+ * For applications, like MPI, that exchange IA_ADDRESS
+ * across the fabric before connecting, it eliminates the
+ * overhead of name and address resolution to the destination's
+ * CM services. UCM provider uses this for DAT_IA_ADDRESS.
+ */
+union dcm_addr {
+       DAT_SOCK_ADDR6          so;
+       struct {
+               uint8_t         qp_type;
+               uint8_t         port_num;
+               uint16_t        lid;
+               uint32_t        qpn;
+               union ibv_gid   gid;
+       } ib;
+};
+
+/* 256 bytes total; default max_inline_send, min IB MTU size */
+typedef struct _ib_cm_msg
+{
 	uint16_t		ver;
-	uint16_t		rej;
-	uint16_t		lid;
-	uint16_t		port;
-	uint32_t		qpn;
-	uint32_t		p_size;
-	union ibv_gid		gid;
-	DAT_SOCK_ADDR6		ia_address;
-	uint16_t		qp_type; 
-} ib_qp_cm_t;
+	uint16_t		op;
+	uint16_t		sport; /* src cm port */
+	uint16_t		dport; /* dst cm port */
+	uint32_t		sqpn;  /* src cm qpn */
+	uint32_t		dqpn;  /* dst cm qpn */
+	uint16_t		p_size;
+	uint8_t			resv[14];
+	union dcm_addr		saddr;
+	union dcm_addr		daddr;
+	union dcm_addr		saddr_alt;
+	union dcm_addr		daddr_alt;
+	uint8_t			p_data[DCM_MAX_PDATA_SIZE];
+
+} ib_cm_msg_t;
 
 /* CM events */
 typedef enum {
@@ -113,11 +144,27 @@ typedef uint16_t		ib_hca_port_t;
 
 /* inline send rdma threshold */
 #define	INLINE_SEND_IWARP_DEFAULT	64
-#define	INLINE_SEND_IB_DEFAULT		200
+#define	INLINE_SEND_IB_DEFAULT		256
 
 /* qkey for UD QP's */
 #define DAT_UD_QKEY	0x78654321
 
+/* RC timer - retry count defaults */
+#define DCM_ACK_TIMER	16 /* 5 bits, 4.096us*2^ack_timer. 16== 268ms */
+#define DCM_ACK_RETRY	7  /* 3 bits, 7 * 268ms = 1.8 seconds */
+#define DCM_RNR_TIMER	12 /* 5 bits, 12 =.64ms, 28 =163ms, 31 =491ms */
+#define DCM_RNR_RETRY	7  /* 3 bits, 7 == infinite */
+#define DCM_IB_MTU	2048
+
+/* Global routing defaults */
+#define DCM_GLOBAL	0       /* global routing is disabled */
+#define DCM_HOP_LIMIT	0xff
+#define DCM_TCLASS	0
+
+/* DAPL uCM timers */
+#define DCM_RETRY_CNT		7
+#define DCM_RETRY_TIME_MS	1000
+
 /* DTO OPs, ordered for DAPL ENUM definitions */
 #define OP_RDMA_WRITE           IBV_WR_RDMA_WRITE
 #define OP_RDMA_WRITE_IMM       IBV_WR_RDMA_WRITE_WITH_IMM
@@ -201,6 +248,36 @@ typedef enum
 
 } ib_thread_state_t;
 
+typedef enum dapl_cm_op
+{
+	DCM_REQ,
+	DCM_REP,
+	DCM_REJ_USER, /* user reject */
+	DCM_REJ_CM,   /* cm reject, no SID */
+	DCM_RTU,
+	DCM_DREQ,
+	DCM_DREP
+
+} DAPL_CM_OP;
+
+typedef enum dapl_cm_state
+{
+	DCM_INIT,
+	DCM_LISTEN,
+	DCM_CONN_PENDING,
+	DCM_RTU_PENDING,
+	DCM_ACCEPTING,
+	DCM_ACCEPTING_DATA,
+	DCM_ACCEPTED,
+	DCM_REJECTING,
+	DCM_REJECTED,
+	DCM_CONNECTED,
+	DCM_RELEASED,
+	DCM_DISC_PENDING,
+	DCM_DISCONNECTED,
+	DCM_DESTROY
+
+} DAPL_CM_STATE;
 
 /* provider specfic fields for shared memory support */
 typedef uint32_t ib_shm_transport_t;
@@ -214,6 +291,19 @@ enum ibv_mtu dapl_ib_mtu(int mtu);
 char *dapl_ib_mtu_str(enum ibv_mtu mtu);
 DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len);
 
+/* qp.c */
+DAT_RETURN dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp);
+DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t	qp_handle,
+                                IN ib_qp_state_t	qp_state,
+                                IN uint32_t		qpn,
+                                IN uint16_t		lid,
+                                IN ib_gid_handle_t	gid);
+ib_ah_handle_t dapls_create_ah( IN DAPL_HCA		*hca,
+				IN ib_pd_handle_t	pd,
+				IN ib_qp_handle_t	qp,
+				IN uint16_t		lid,
+				IN ib_gid_handle_t	gid);
+
 /* inline functions */
 STATIC _INLINE_ IB_HCA_NAME dapl_ib_convert_name (IN char *name)
 {
@@ -260,22 +350,6 @@ dapl_convert_errno( IN int err, IN const char *str )
     }
  }
 
-typedef enum dapl_cm_state 
-{
-	DCM_INIT,
-	DCM_LISTEN,
-	DCM_CONN_PENDING,
-	DCM_RTU_PENDING,
-	DCM_ACCEPTING,
-	DCM_ACCEPTING_DATA,
-	DCM_ACCEPTED,
-	DCM_REJECTED,
-	DCM_CONNECTED,
-	DCM_RELEASED,
-	DCM_DISCONNECTED,
-	DCM_DESTROY
-} DAPL_CM_STATE;
-
 STATIC _INLINE_ char * dapl_cm_state_str(IN int st)
 {
 	static char *state[] = {
@@ -286,13 +360,15 @@ STATIC _INLINE_ char * dapl_cm_state_str(IN int st)
 		"CM_ACCEPTING",
 		"CM_ACCEPTING_DATA",
 		"CM_ACCEPTED",
+		"CM_REJECTING",
 		"CM_REJECTED",
 		"CM_CONNECTED",
 		"CM_RELEASED",
+		"CM_DISC_PENDING",
 		"CM_DISCONNECTED",
 		"CM_DESTROY"
         };
-        return ((st < 0 || st > 11) ? "Invalid CM state?" : state[st]);
+        return ((st < 0 || st > 13) ? "Invalid CM state?" : state[st]);
 }
 
 #endif /*  _DAPL_IB_COMMON_H_ */
diff --git a/dapl/openib_common/qp.c b/dapl/openib_common/qp.c
index 9aa0594..73d2c3f 100644
--- a/dapl/openib_common/qp.c
+++ b/dapl/openib_common/qp.c
@@ -176,7 +176,7 @@ dapls_ib_qp_alloc(IN DAPL_IA * ia_ptr,
 		
 	/* Setup QP attributes for INIT state on the way out */
 	if (dapls_modify_qp_state(ep_ptr->qp_handle,
-				  IBV_QPS_INIT, NULL) != DAT_SUCCESS) {
+				  IBV_QPS_INIT, 0, 0, 0) != DAT_SUCCESS) {
 		ibv_destroy_qp(ep_ptr->qp_handle);
 		ep_ptr->qp_handle = IB_INVALID_HANDLE;
 		return DAT_INTERNAL_ERROR;
@@ -219,7 +219,7 @@ DAT_RETURN dapls_ib_qp_free(IN DAPL_IA * ia_ptr, IN DAPL_EP * ep_ptr)
 	
 	if (ep_ptr->qp_handle != NULL) {
 		/* force error state to flush queue, then destroy */
-		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, NULL);
+		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0,0,0);
 
 		if (ibv_destroy_qp(ep_ptr->qp_handle))
 			return (dapl_convert_errno(errno, "destroy_qp"));
@@ -280,8 +280,8 @@ dapls_ib_qp_modify(IN DAPL_IA * ia_ptr,
 	/* move to error state if necessary */
 	if ((ep_ptr->qp_state == IB_QP_STATE_ERROR) &&
 	    (ep_ptr->qp_handle->state != IBV_QPS_ERR)) {
-		return (dapls_modify_qp_state(ep_ptr->qp_handle,
-					      IBV_QPS_ERR, NULL));
+		return (dapls_modify_qp_state(ep_ptr->qp_handle, 
+					      IBV_QPS_ERR, 0, 0, 0));
 	}
 
 	/*
@@ -345,8 +345,8 @@ void dapls_ib_reinit_ep(IN DAPL_EP * ep_ptr)
 	if (ep_ptr->qp_handle != IB_INVALID_HANDLE &&
 	    ep_ptr->qp_handle->qp_type != IBV_QPT_UD) {
 		/* move to RESET state and then to INIT */
-		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET, 0);
-		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT, 0);
+		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET,0,0,0);
+		dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT,0,0,0);
 	}
 }
 #endif				// _WIN32 || _WIN64
@@ -354,152 +354,137 @@ void dapls_ib_reinit_ep(IN DAPL_EP * ep_ptr)
 /* 
  * Generic QP modify for init, reset, error, RTS, RTR
  * For UD, create_ah on RTR, qkey on INIT
+ * CM msg provides QP attributes, info in network order
  */
 DAT_RETURN
-dapls_modify_qp_state(IN ib_qp_handle_t qp_handle,
-		      IN ib_qp_state_t qp_state, 
-		      IN dp_ib_cm_handle_t cm_ptr)
+dapls_modify_qp_state(IN ib_qp_handle_t		qp_handle,
+		      IN ib_qp_state_t		qp_state, 
+		      IN uint32_t		qpn,
+		      IN uint16_t		lid,
+		      IN ib_gid_handle_t	gid)
 {
 	struct ibv_qp_attr qp_attr;
 	enum ibv_qp_attr_mask mask = IBV_QP_STATE;
 	DAPL_EP *ep_ptr = (DAPL_EP *) qp_handle->qp_context;
 	DAPL_IA *ia_ptr = ep_ptr->header.owner_ia;
-	ib_qp_cm_t *qp_cm = &cm_ptr->dst;
 	int ret;
 
 	dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr));
 	qp_attr.qp_state = qp_state;
+	
 	switch (qp_state) {
-		/* additional attributes with RTR and RTS */
 	case IBV_QPS_RTR:
-		{
-			dapl_dbg_log(DAPL_DBG_TYPE_EP,
-				     " QPS_RTR: type %d state %d qpn %x lid %x"
-				     " port %x ep %p qp_state %d\n",
-				     qp_handle->qp_type, qp_handle->qp_type,
-				     qp_cm->qpn, qp_cm->lid, qp_cm->port,
-				     ep_ptr, ep_ptr->qp_state);
-
-			mask |= IBV_QP_AV |
-			    IBV_QP_PATH_MTU |
-			    IBV_QP_DEST_QPN |
-			    IBV_QP_RQ_PSN |
-			    IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
-
-			qp_attr.dest_qp_num = qp_cm->qpn;
-			qp_attr.rq_psn = 1;
-			qp_attr.path_mtu = ia_ptr->hca_ptr->ib_trans.mtu;
-			qp_attr.max_dest_rd_atomic =
-			    ep_ptr->param.ep_attr.max_rdma_read_out;
-			qp_attr.min_rnr_timer =
-			    ia_ptr->hca_ptr->ib_trans.rnr_timer;
-
-			/* address handle. RC and UD */
-			qp_attr.ah_attr.dlid = qp_cm->lid;
-			if (ia_ptr->hca_ptr->ib_trans.global) {
-				qp_attr.ah_attr.is_global = 1;
-				qp_attr.ah_attr.grh.dgid = qp_cm->gid;
-				qp_attr.ah_attr.grh.hop_limit =
-				    ia_ptr->hca_ptr->ib_trans.hop_limit;
-				qp_attr.ah_attr.grh.traffic_class =
-				    ia_ptr->hca_ptr->ib_trans.tclass;
-			}
-			qp_attr.ah_attr.sl = 0;
-			qp_attr.ah_attr.src_path_bits = 0;
-			qp_attr.ah_attr.port_num = ia_ptr->hca_ptr->port_num;
-#ifdef DAT_EXTENSIONS
-			/* UD: create AH for remote side */
-			if (qp_handle->qp_type == IBV_QPT_UD) {
-				ib_pd_handle_t pz;
-				pz = ((DAPL_PZ *)
-				      ep_ptr->param.pz_handle)->pd_handle;
-				mask = IBV_QP_STATE;
-				cm_ptr->ah = ibv_create_ah(pz,
-							   &qp_attr.ah_attr);
-				if (!cm_ptr->ah)
-					return (dapl_convert_errno(errno,
-								   "ibv_ah"));
-
-				/* already RTR, multi remote AH's on QP */
-				if (ep_ptr->qp_state == IBV_QPS_RTR ||
-				    ep_ptr->qp_state == IBV_QPS_RTS)
-					return DAT_SUCCESS;
-			}
-#endif
-			break;
+		dapl_dbg_log(DAPL_DBG_TYPE_EP,
+				" QPS_RTR: type %d qpn 0x%x lid 0x%x"
+				" port %d ep %p qp_state %d \n",
+				qp_handle->qp_type, 
+				ntohl(qpn), ntohs(lid), 
+				ia_ptr->hca_ptr->port_num,
+				ep_ptr, ep_ptr->qp_state);
+
+		mask |= IBV_QP_AV |
+			IBV_QP_PATH_MTU |
+			IBV_QP_DEST_QPN |
+			IBV_QP_RQ_PSN |
+			IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
+
+		qp_attr.dest_qp_num = ntohl(qpn);
+		qp_attr.rq_psn = 1;
+		qp_attr.path_mtu = ia_ptr->hca_ptr->ib_trans.mtu;
+		qp_attr.max_dest_rd_atomic =
+			ep_ptr->param.ep_attr.max_rdma_read_out;
+		qp_attr.min_rnr_timer =
+			ia_ptr->hca_ptr->ib_trans.rnr_timer;
+
+		/* address handle. RC and UD */
+		qp_attr.ah_attr.dlid = ntohs(lid);
+		if (ia_ptr->hca_ptr->ib_trans.global) {
+			qp_attr.ah_attr.is_global = 1;
+			qp_attr.ah_attr.grh.dgid.global.subnet_prefix = 
+				ntohll(gid->global.subnet_prefix);
+			qp_attr.ah_attr.grh.dgid.global.interface_id = 
+				ntohll(gid->global.interface_id);
+			qp_attr.ah_attr.grh.hop_limit =
+				ia_ptr->hca_ptr->ib_trans.hop_limit;
+			qp_attr.ah_attr.grh.traffic_class =
+				ia_ptr->hca_ptr->ib_trans.tclass;
 		}
+		qp_attr.ah_attr.sl = 0;
+		qp_attr.ah_attr.src_path_bits = 0;
+		qp_attr.ah_attr.port_num = ia_ptr->hca_ptr->port_num;
+
+		/* UD: already in RTR, RTS state */
+		if (qp_handle->qp_type == IBV_QPT_UD) {
+			if (ep_ptr->qp_state == IBV_QPS_RTR ||
+				ep_ptr->qp_state == IBV_QPS_RTS)
+				return DAT_SUCCESS;
+		}
+		break;
 	case IBV_QPS_RTS:
-		{
-			/* RC only */
-			if (qp_handle->qp_type == IBV_QPT_RC) {
-				mask |= IBV_QP_SQ_PSN |
-				    IBV_QP_TIMEOUT |
-				    IBV_QP_RETRY_CNT |
-				    IBV_QP_RNR_RETRY | IBV_QP_MAX_QP_RD_ATOMIC;
-				qp_attr.timeout =
-				    ia_ptr->hca_ptr->ib_trans.ack_timer;
-				qp_attr.retry_cnt =
-				    ia_ptr->hca_ptr->ib_trans.ack_retry;
-				qp_attr.rnr_retry =
-				    ia_ptr->hca_ptr->ib_trans.rnr_retry;
-				qp_attr.max_rd_atomic =
-				    ep_ptr->param.ep_attr.max_rdma_read_out;
-			}
-			/* RC and UD */
-			qp_attr.qp_state = IBV_QPS_RTS;
-			qp_attr.sq_psn = 1;
-
-			dapl_dbg_log(DAPL_DBG_TYPE_EP,
-				     " QPS_RTS: psn %x rd_atomic %d ack %d "
-				     " retry %d rnr_retry %d ep %p qp_state %d\n",
-				     qp_attr.sq_psn, qp_attr.max_rd_atomic,
-				     qp_attr.timeout, qp_attr.retry_cnt,
-				     qp_attr.rnr_retry, ep_ptr,
-				     ep_ptr->qp_state);
-#ifdef DAT_EXTENSIONS
-			if (qp_handle->qp_type == IBV_QPT_UD) {
-				/* already RTS, multi remote AH's on QP */
-				if (ep_ptr->qp_state == IBV_QPS_RTS)
-					return DAT_SUCCESS;
-				else
-					mask = IBV_QP_STATE | IBV_QP_SQ_PSN;
-			}
-#endif
-			break;
+		if (qp_handle->qp_type == IBV_QPT_RC) {
+			mask |= IBV_QP_SQ_PSN |
+				IBV_QP_TIMEOUT |
+				IBV_QP_RETRY_CNT |
+				IBV_QP_RNR_RETRY | IBV_QP_MAX_QP_RD_ATOMIC;
+			qp_attr.timeout =
+				ia_ptr->hca_ptr->ib_trans.ack_timer;
+			qp_attr.retry_cnt =
+				ia_ptr->hca_ptr->ib_trans.ack_retry;
+			qp_attr.rnr_retry =
+				ia_ptr->hca_ptr->ib_trans.rnr_retry;
+			qp_attr.max_rd_atomic =
+				ep_ptr->param.ep_attr.max_rdma_read_out;
+		}
+		/* RC and UD */
+		qp_attr.qp_state = IBV_QPS_RTS;
+		qp_attr.sq_psn = 1;
+
+		dapl_dbg_log(DAPL_DBG_TYPE_EP,
+				" QPS_RTS: psn %x rd_atomic %d ack %d "
+				" retry %d rnr_retry %d ep %p qp_state %d\n",
+				qp_attr.sq_psn, qp_attr.max_rd_atomic,
+				qp_attr.timeout, qp_attr.retry_cnt,
+				qp_attr.rnr_retry, ep_ptr,
+				ep_ptr->qp_state);
+
+		if (qp_handle->qp_type == IBV_QPT_UD) {
+			/* already RTS, multi remote AH's on QP */
+			if (ep_ptr->qp_state == IBV_QPS_RTS)
+				return DAT_SUCCESS;
+			else
+				mask = IBV_QP_STATE | IBV_QP_SQ_PSN;
 		}
+		break;
 	case IBV_QPS_INIT:
-		{
-			mask |= IBV_QP_PKEY_INDEX | IBV_QP_PORT;
-			if (qp_handle->qp_type == IBV_QPT_RC) {
-				mask |= IBV_QP_ACCESS_FLAGS;
-				qp_attr.qp_access_flags =
-				    IBV_ACCESS_LOCAL_WRITE |
-				    IBV_ACCESS_REMOTE_WRITE |
-				    IBV_ACCESS_REMOTE_READ |
-				    IBV_ACCESS_REMOTE_ATOMIC |
-				    IBV_ACCESS_MW_BIND;
-			}
-#ifdef DAT_EXTENSIONS
-			if (qp_handle->qp_type == IBV_QPT_UD) {
-				/* already INIT, multi remote AH's on QP */
-				if (ep_ptr->qp_state == IBV_QPS_INIT)
-					return DAT_SUCCESS;
-				mask |= IBV_QP_QKEY;
-				qp_attr.qkey = DAT_UD_QKEY;
-			}
-#endif
-			qp_attr.pkey_index = 0;
-			qp_attr.port_num = ia_ptr->hca_ptr->port_num;
-
-			dapl_dbg_log(DAPL_DBG_TYPE_EP,
-				     " QPS_INIT: pi %x port %x acc %x qkey 0x%x\n",
-				     qp_attr.pkey_index, qp_attr.port_num,
-				     qp_attr.qp_access_flags, qp_attr.qkey);
-			break;
+		mask |= IBV_QP_PKEY_INDEX | IBV_QP_PORT;
+		if (qp_handle->qp_type == IBV_QPT_RC) {
+			mask |= IBV_QP_ACCESS_FLAGS;
+			qp_attr.qp_access_flags =
+				IBV_ACCESS_LOCAL_WRITE |
+				IBV_ACCESS_REMOTE_WRITE |
+				IBV_ACCESS_REMOTE_READ |
+				IBV_ACCESS_REMOTE_ATOMIC |
+				IBV_ACCESS_MW_BIND;
+		}
+
+		if (qp_handle->qp_type == IBV_QPT_UD) {
+			/* already INIT, multi remote AH's on QP */
+			if (ep_ptr->qp_state == IBV_QPS_INIT)
+				return DAT_SUCCESS;
+			mask |= IBV_QP_QKEY;
+			qp_attr.qkey = DAT_UD_QKEY;
 		}
+
+		qp_attr.pkey_index = 0;
+		qp_attr.port_num = ia_ptr->hca_ptr->port_num;
+
+		dapl_dbg_log(DAPL_DBG_TYPE_EP,
+				" QPS_INIT: pi %x port %x acc %x qkey 0x%x\n",
+				qp_attr.pkey_index, qp_attr.port_num,
+				qp_attr.qp_access_flags, qp_attr.qkey);
+		break;
 	default:
 		break;
-
 	}
 
 	ret = ibv_modify_qp(qp_handle, &qp_attr, mask);
@@ -511,6 +496,93 @@ dapls_modify_qp_state(IN ib_qp_handle_t qp_handle,
 	}
 }
 
+/* Modify UD type QP from init, rtr, rts, info network order */
+DAT_RETURN 
+dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp)
+{
+	struct ibv_qp_attr qp_attr;
+
+	/* modify QP, setup and prepost buffers */
+	dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr));
+	qp_attr.qp_state = IBV_QPS_INIT;
+        qp_attr.pkey_index = 0;
+        qp_attr.port_num = hca->port_num;
+        qp_attr.qkey = DAT_UD_QKEY;
+	if (ibv_modify_qp(qp, &qp_attr, 
+			  IBV_QP_STATE		|
+			  IBV_QP_PKEY_INDEX	|
+                          IBV_QP_PORT		|
+                          IBV_QP_QKEY)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			" modify_ud_qp INIT: ERR %s\n", strerror(errno));
+		return (dapl_convert_errno(errno, "modify_qp"));
+	}
+	dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr));
+	qp_attr.qp_state = IBV_QPS_RTR;
+	if (ibv_modify_qp(qp, &qp_attr,IBV_QP_STATE)) {
+		dapl_log(DAPL_DBG_TYPE_ERR, 
+			" modify_ud_qp RTR: ERR %s\n", strerror(errno));
+		return (dapl_convert_errno(errno, "modify_qp"));
+	}
+	dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr));
+	qp_attr.qp_state = IBV_QPS_RTS;
+	qp_attr.sq_psn = 1;
+	if (ibv_modify_qp(qp, &qp_attr, 
+			  IBV_QP_STATE | IBV_QP_SQ_PSN)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			" modify_ud_qp RTS: ERR %s\n", strerror(errno));
+		return (dapl_convert_errno(errno, "modify_qp"));
+	}
+	return DAT_SUCCESS;
+}
+
+/* Create address handle for remote QP, info in network order */
+ib_ah_handle_t 
+dapls_create_ah(IN DAPL_HCA		*hca,
+		IN ib_pd_handle_t	pd,
+		IN ib_qp_handle_t	qp,
+		IN uint16_t		lid,
+		IN ib_gid_handle_t	gid)
+{
+	struct ibv_qp_attr qp_attr;
+	ib_ah_handle_t	ah;
+
+	if (qp->qp_type != IBV_QPT_UD)
+		return NULL;
+
+	dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr));
+	qp_attr.qp_state = IBV_QP_STATE;
+
+	/* address handle. RC and UD */
+	qp_attr.ah_attr.dlid = ntohs(lid);
+	if (gid != NULL) {
+		qp_attr.ah_attr.is_global = 1;
+		qp_attr.ah_attr.grh.dgid.global.subnet_prefix = 
+				ntohll(gid->global.subnet_prefix);
+		qp_attr.ah_attr.grh.dgid.global.interface_id = 
+				ntohll(gid->global.interface_id);
+		qp_attr.ah_attr.grh.hop_limit =	hca->ib_trans.hop_limit;
+		qp_attr.ah_attr.grh.traffic_class = hca->ib_trans.tclass;
+	}
+	qp_attr.ah_attr.sl = 0;
+	qp_attr.ah_attr.src_path_bits = 0;
+	qp_attr.ah_attr.port_num = hca->port_num;
+
+	/* UD: create AH for remote side */
+	ah = ibv_create_ah(pd, &qp_attr.ah_attr);
+	if (!ah) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			" create_ah: ERR %s\n", strerror(errno));
+		return NULL;
+	}
+
+	dapl_log(DAPL_DBG_TYPE_CM, 
+			" dapls_create_ah: AH %p for lid %x\n", 
+			ah, qp_attr.ah_attr.dlid);
+
+	return ah;
+}
+
 /*
  * Local variables:
  *  c-indent-level: 4
diff --git a/dapl/openib_scm/cm.c b/dapl/openib_scm/cm.c
index 416ee71..e779d41 100644
--- a/dapl/openib_scm/cm.c
+++ b/dapl/openib_scm/cm.c
@@ -46,11 +46,6 @@
  *
  **************************************************************************/
 
-#if defined(_WIN32)
-#define FD_SETSIZE 1024
-#define DAPL_FD_SETSIZE FD_SETSIZE
-#endif
-
 #include "dapl.h"
 #include "dapl_adapter_util.h"
 #include "dapl_evd_util.h"
@@ -252,7 +247,7 @@ dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep)
 	if (dapl_os_lock_init(&cm_ptr->lock))
 		goto bail;
 
-	cm_ptr->dst.ver = htons(DCM_VER);
+	cm_ptr->msg.ver = htons(DCM_VER);
 	cm_ptr->socket = DAPL_INVALID_SOCKET;
 	cm_ptr->ep = ep;
 	return cm_ptr;
@@ -437,7 +432,7 @@ DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr)
  */
 static void dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
 {
-	int ret, len, opt = 1;
+	int ret, len, exp, opt = 1;
 	struct iovec iov[2];
 	struct dapl_ep *ep_ptr = cm_ptr->ep;
 
@@ -450,56 +445,60 @@ static void dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
 				   ep_ptr->param.
 				   remote_ia_address_ptr)->sin_addr), 
 			 ntohs(((struct sockaddr_in *)
-				&cm_ptr->dst.ia_address)->sin_port));
+				&cm_ptr->msg.daddr.so)->sin_port));
 		goto bail;
 	}
-	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " socket connected, write QP and private data\n");
 
 	/* no delay for small packets */
 	ret = setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY,
 			 (char *)&opt, sizeof(opt));
 	if (ret)
 		dapl_log(DAPL_DBG_TYPE_WARN,
-			 " connected: NODELAY setsockopt: %s\n",
+			 " CONN_PENDING: NODELAY setsockopt: %s\n",
 			 strerror(errno));
 
 	/* send qp info and pdata to remote peer */
-	iov[0].iov_base = (void *)&cm_ptr->dst;
-	iov[0].iov_len = sizeof(ib_qp_cm_t);
-	if (cm_ptr->dst.p_size) {
-		iov[1].iov_base = cm_ptr->p_data;
-		iov[1].iov_len = ntohl(cm_ptr->dst.p_size);
+	exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE;
+	iov[0].iov_base = (void *)&cm_ptr->msg;
+	iov[0].iov_len = exp;
+	if (cm_ptr->msg.p_size) {
+		iov[1].iov_base = cm_ptr->msg.p_data;
+		iov[1].iov_len = ntohs(cm_ptr->msg.p_size);
 		len = writev(cm_ptr->socket, iov, 2);
 	} else {
 		len = writev(cm_ptr->socket, iov, 1);
 	}
 
-	if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) {
+	if (len != (exp + ntohs(cm_ptr->msg.p_size))) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n",
-			 strerror(errno), len, inet_ntoa(((struct sockaddr_in *)
-							  ep_ptr->param.
-							  remote_ia_address_ptr)->
-							 sin_addr));
+			 " CONN_PENDING len ERR %s, wcnt=%d(%d) -> %s\n",
+			 strerror(errno), len, 
+			 exp + ntohs(cm_ptr->msg.p_size), 
+			 inet_ntoa(((struct sockaddr_in *)
+				   ep_ptr->param.
+				   remote_ia_address_ptr)->sin_addr));
 		goto bail;
 	}
-	dapl_dbg_log(DAPL_DBG_TYPE_CM,
-		     " connected: sending SRC port=0x%x lid=0x%x,"
+
+ 	dapl_dbg_log(DAPL_DBG_TYPE_CM,
+		     " CONN_PENDING: sending SRC port=%d lid=0x%x,"
 		     " qpn=0x%x, psize=%d\n",
-		     ntohs(cm_ptr->dst.port), ntohs(cm_ptr->dst.lid),
-		     ntohl(cm_ptr->dst.qpn), ntohl(cm_ptr->dst.p_size));
+		     cm_ptr->msg.saddr.ib.port_num, 
+		     ntohs(cm_ptr->msg.saddr.ib.lid),
+		     ntohl(cm_ptr->msg.saddr.ib.qpn), 
+		     ntohs(cm_ptr->msg.p_size));
 	dapl_dbg_log(DAPL_DBG_TYPE_CM,
-		     " connected: sending SRC GID subnet %016llx id %016llx\n",
+		     " CONN_PENDING: SRC GID subnet %016llx id %016llx\n",
 		     (unsigned long long)
-		     htonll(cm_ptr->dst.gid.global.subnet_prefix),
+		     htonll(cm_ptr->msg.saddr.ib.gid.global.subnet_prefix),
 		     (unsigned long long)
-		     htonll(cm_ptr->dst.gid.global.interface_id));
+		     htonll(cm_ptr->msg.saddr.ib.gid.global.interface_id));
 
 	/* queue up to work thread to avoid blocking consumer */
 	cm_ptr->state = DCM_RTU_PENDING;
 	return;
-      bail:
+
+bail:
 	/* close socket, free cm structure and post error event */
 	dapls_ib_cm_free(cm_ptr, cm_ptr->ep);
 	dapl_evd_connection_callback(NULL, IB_CME_LOCAL_FAILURE, NULL, ep_ptr);
@@ -554,25 +553,24 @@ dapli_socket_connect(DAPL_EP * ep_ptr,
 		return DAT_INVALID_ADDRESS;
 	}
 
-	/* Send QP info, IA address, and private data */
-	cm_ptr->dst.qpn = htonl(ep_ptr->qp_handle->qp_num);
-#ifdef DAT_EXTENSIONS
-	cm_ptr->dst.qp_type = htons(ep_ptr->qp_handle->qp_type);
-#endif
-	cm_ptr->dst.port = htons(ia_ptr->hca_ptr->port_num);
-	cm_ptr->dst.lid = ia_ptr->hca_ptr->ib_trans.lid;
-	cm_ptr->dst.gid = ia_ptr->hca_ptr->ib_trans.gid;
+	/* REQ: QP info in msg.saddr, IA address in msg.daddr, and pdata */
+	cm_ptr->msg.op = ntohs(DCM_REQ);
+	cm_ptr->msg.saddr.ib.qpn = htonl(ep_ptr->qp_handle->qp_num);
+	cm_ptr->msg.saddr.ib.qp_type = ep_ptr->qp_handle->qp_type;
+	cm_ptr->msg.saddr.ib.port_num = ia_ptr->hca_ptr->port_num;
+	cm_ptr->msg.saddr.ib.lid = ia_ptr->hca_ptr->ib_trans.lid;
+	cm_ptr->msg.saddr.ib.gid = ia_ptr->hca_ptr->ib_trans.gid;
 
 	/* save references */
 	cm_ptr->hca = ia_ptr->hca_ptr;
 	cm_ptr->ep = ep_ptr;
-	cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address;
+	cm_ptr->msg.daddr.so = ia_ptr->hca_ptr->hca_address;
 	((struct sockaddr_in *)
-		&cm_ptr->dst.ia_address)->sin_port = ntohs(r_qual);
+		&cm_ptr->msg.daddr.so)->sin_port = ntohs((uint16_t)r_qual);
 
 	if (p_size) {
-		cm_ptr->dst.p_size = htonl(p_size);
-		dapl_os_memcpy(cm_ptr->p_data, p_data, p_size);
+		cm_ptr->msg.p_size = htons(p_size);
+		dapl_os_memcpy(cm_ptr->msg.p_data, p_data, p_size);
 	}
 
 	/* connected or pending, either way results via async event */
@@ -581,18 +579,22 @@ dapli_socket_connect(DAPL_EP * ep_ptr,
 	else
 		cm_ptr->state = DCM_CONN_PENDING;
 
+	dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: p_data=%p %p\n",
+		     cm_ptr->msg.p_data, cm_ptr->msg.p_data);
+
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " connect: socket %d to %s r_qual %d pending\n",
-		     cm_ptr->socket,
-		     inet_ntoa(addr.sin_addr), (unsigned int)r_qual);
+		     " connect: %s r_qual %d pending, p_sz=%d, %d %d ...\n",
+		     inet_ntoa(addr.sin_addr), (unsigned int)r_qual, 
+		     ntohs(cm_ptr->msg.p_size),
+		     cm_ptr->msg.p_data[0], cm_ptr->msg.p_data[1]);
 
 	dapli_cm_queue(cm_ptr);
 	return DAT_SUCCESS;
-      bail:
+
+bail:
 	dapl_log(DAPL_DBG_TYPE_ERR,
-		 " socket connect ERROR: %s query lid(0x%x)/gid"
-		 " -> %s r_qual %d\n",
-		 strerror(errno), ntohs(cm_ptr->dst.lid),
+		 " connect ERROR: %s -> %s r_qual %d\n",
+		 strerror(errno), 
 		 inet_ntoa(((struct sockaddr_in *)r_addr)->sin_addr),
 		 (unsigned int)r_qual);
 
@@ -607,64 +609,60 @@ dapli_socket_connect(DAPL_EP * ep_ptr,
 static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
 {
 	DAPL_EP *ep_ptr = cm_ptr->ep;
-	int len;
-	short rtu_data = htons(0x0E0F);
-	ib_cm_events_t event = IB_CME_DESTINATION_REJECT;
+	int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE;
+	ib_cm_events_t event = IB_CME_LOCAL_FAILURE;
 
 	/* read DST information into cm_ptr, overwrite SRC info */
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect_rtu: recv peer QP data\n");
 
-	len = recv(cm_ptr->socket, (char *)&cm_ptr->dst, sizeof(ib_qp_cm_t), 0);
-	if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver) != DCM_VER) {
+	len = recv(cm_ptr->socket, (char *)&cm_ptr->msg, exp, 0);
+	if (len != exp || ntohs(cm_ptr->msg.ver) != DCM_VER) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n",
-			 strerror(errno), len, cm_ptr->dst.ver,
+			 strerror(errno), len, cm_ptr->msg.ver,
 			 inet_ntoa(((struct sockaddr_in *)
 				    ep_ptr->param.remote_ia_address_ptr)->
 				   sin_addr));
 		goto bail;
 	}
 
-	/* convert peer response values to host order */
-	cm_ptr->dst.port = ntohs(cm_ptr->dst.port);
-	cm_ptr->dst.lid = ntohs(cm_ptr->dst.lid);
-	cm_ptr->dst.qpn = ntohl(cm_ptr->dst.qpn);
-#ifdef DAT_EXTENSIONS
-	cm_ptr->dst.qp_type = ntohs(cm_ptr->dst.qp_type);
-#endif
-	cm_ptr->dst.p_size = ntohl(cm_ptr->dst.p_size);
-
-	/* save remote address information */
+	/* keep the QP, address info in network order */
+	
+	/* save remote address information, in msg.daddr */
 	dapl_os_memcpy(&ep_ptr->remote_ia_address,
-		       &cm_ptr->dst.ia_address,
-		       sizeof(ep_ptr->remote_ia_address));
+		       &cm_ptr->msg.daddr.so,
+		       sizeof(union dcm_addr));
 
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " CONN_RTU: DST %s port=0x%x lid=0x%x,"
+		     " CONN_RTU: DST %s %d port=0x%x lid=0x%x,"
 		     " qpn=0x%x, qp_type=%d, psize=%d\n",
 		     inet_ntoa(((struct sockaddr_in *)
-				&cm_ptr->dst.ia_address)->sin_addr),
-		     cm_ptr->dst.port, cm_ptr->dst.lid,
-		     cm_ptr->dst.qpn, cm_ptr->dst.qp_type, cm_ptr->dst.p_size);
+				&cm_ptr->msg.daddr.so)->sin_addr),
+		     ntohs(((struct sockaddr_in *)
+				&cm_ptr->msg.daddr.so)->sin_port),
+		     cm_ptr->msg.saddr.ib.port_num, 
+		     ntohs(cm_ptr->msg.saddr.ib.lid),
+		     ntohl(cm_ptr->msg.saddr.ib.qpn), 
+		     cm_ptr->msg.saddr.ib.qp_type, 
+		     ntohs(cm_ptr->msg.p_size));
 
 	/* validate private data size before reading */
-	if (cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE) {
+	if (ntohs(cm_ptr->msg.p_size) > DCM_MAX_PDATA_SIZE) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " CONN_RTU read: psize (%d) wrong -> %s\n",
-			 cm_ptr->dst.p_size, inet_ntoa(((struct sockaddr_in *)
-							ep_ptr->param.
-							remote_ia_address_ptr)->
-						       sin_addr));
+			 ntohs(cm_ptr->msg.p_size), 
+			 inet_ntoa(((struct sockaddr_in *)
+				   ep_ptr->param.
+				   remote_ia_address_ptr)->sin_addr));
 		goto bail;
 	}
 
 	/* read private data into cm_handle if any present */
-	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " socket connected, read private data\n");
-	if (cm_ptr->dst.p_size) {
-		len =
-		    recv(cm_ptr->socket, cm_ptr->p_data, cm_ptr->dst.p_size, 0);
-		if (len != cm_ptr->dst.p_size) {
+	dapl_dbg_log(DAPL_DBG_TYPE_EP," CONN_RTU: read private data\n");
+	exp = ntohs(cm_ptr->msg.p_size);
+	if (exp) {
+		len = recv(cm_ptr->socket, cm_ptr->msg.p_data, exp, 0);
+		if (len != exp) {
 			dapl_log(DAPL_DBG_TYPE_ERR,
 				 " CONN_RTU read pdata: ERR %s, rcnt=%d -> %s\n",
 				 strerror(errno), len,
@@ -675,17 +673,22 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
 		}
 	}
 
-	/* check for consumer reject */
-	if (cm_ptr->dst.rej) {
+	/* check for consumer or protocol stack reject */
+	if (ntohs(cm_ptr->msg.op) == DCM_REP)
+		event = IB_CME_CONNECTED;
+	else if (ntohs(cm_ptr->msg.op) == DCM_REJ_USER) 
+		event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA;
+	else  
+		event = IB_CME_DESTINATION_REJECT;
+	
+	if (event != IB_CME_CONNECTED) {
 		dapl_log(DAPL_DBG_TYPE_CM,
-			 " CONN_RTU read: PEER REJ reason=0x%x -> %s\n",
-			 ntohs(cm_ptr->dst.rej),
+			 " CONN_RTU: reject from %s\n",
 			 inet_ntoa(((struct sockaddr_in *)
-				    ep_ptr->param.remote_ia_address_ptr)->
-				   sin_addr));
-		event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA;
+				    ep_ptr->param.
+				    remote_ia_address_ptr)->sin_addr));
 #ifdef DAT_EXTENSIONS
-		if (cm_ptr->dst.qp_type == IBV_QPT_UD) 
+		if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) 
 			goto ud_bail;
 		else
 #endif
@@ -695,32 +698,39 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
 	/* modify QP to RTR and then to RTS with remote info */
 	dapl_os_lock(&ep_ptr->header.lock);
 	if (dapls_modify_qp_state(ep_ptr->qp_handle,
-				  IBV_QPS_RTR, cm_ptr) != DAT_SUCCESS) {
+				  IBV_QPS_RTR, 
+				  cm_ptr->msg.saddr.ib.qpn,
+				  cm_ptr->msg.saddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " CONN_RTU: QPS_RTR ERR %s -> %s\n",
-			 strerror(errno), inet_ntoa(((struct sockaddr_in *)
-						     ep_ptr->param.
-						     remote_ia_address_ptr)->
-						    sin_addr));
+			 strerror(errno), 
+			 inet_ntoa(((struct sockaddr_in *)
+				   ep_ptr->param.
+				   remote_ia_address_ptr)->sin_addr));
 		dapl_os_unlock(&ep_ptr->header.lock);
 		goto bail;
 	}
 	if (dapls_modify_qp_state(ep_ptr->qp_handle,
-				  IBV_QPS_RTS, cm_ptr) != DAT_SUCCESS) {
+				  IBV_QPS_RTS, 
+				  cm_ptr->msg.saddr.ib.qpn,
+				  cm_ptr->msg.saddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " CONN_RTU: QPS_RTS ERR %s -> %s\n",
-			 strerror(errno), inet_ntoa(((struct sockaddr_in *)
-						     ep_ptr->param.
-						     remote_ia_address_ptr)->
-						    sin_addr));
+			 strerror(errno), 
+			 inet_ntoa(((struct sockaddr_in *)
+				   ep_ptr->param.
+				   remote_ia_address_ptr)->sin_addr));
 		dapl_os_unlock(&ep_ptr->header.lock);
 		goto bail;
 	}
 	dapl_os_unlock(&ep_ptr->header.lock);
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect_rtu: send RTU\n");
 
-	/* complete handshake after final QP state change */
-	if (send(cm_ptr->socket, (char *)&rtu_data, sizeof(rtu_data), 0) == -1) {
+	/* complete handshake after final QP state change, Just ver+op */
+	cm_ptr->msg.op = ntohs(DCM_RTU);
+	if (send(cm_ptr->socket, (char *)&cm_ptr->msg, 4, 0) == -1) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " CONN_RTU: write error = %s\n", strerror(errno));
 		goto bail;
@@ -732,30 +742,41 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
 
 #ifdef DAT_EXTENSIONS
 ud_bail:
-	if (cm_ptr->dst.qp_type == IBV_QPT_UD) {
+	if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) {
 		DAT_IB_EXTENSION_EVENT_DATA xevent;
+		ib_pd_handle_t pd_handle = 
+			((DAPL_PZ *)ep_ptr->param.pz_handle)->pd_handle;
+
+		cm_ptr->ah = dapls_create_ah(cm_ptr->hca, pd_handle,
+					     ep_ptr->qp_handle,
+					     cm_ptr->msg.saddr.ib.lid, 
+					     NULL);
+		if (!cm_ptr->ah) {
+			event = IB_CME_LOCAL_FAILURE;
+			goto bail;
+		}
 
 		/* post EVENT, modify_qp created ah */
 		xevent.status = 0;
 		xevent.type = DAT_IB_UD_REMOTE_AH;
 		xevent.remote_ah.ah = cm_ptr->ah;
-		xevent.remote_ah.qpn = cm_ptr->dst.qpn;
+		xevent.remote_ah.qpn = cm_ptr->msg.saddr.ib.qpn;
 		dapl_os_memcpy(&xevent.remote_ah.ia_addr,
-			       &cm_ptr->dst.ia_address,
-			       sizeof(cm_ptr->dst.ia_address));
+			       &ep_ptr->remote_ia_address,
+			       sizeof(union dcm_addr));
 
 		if (event == IB_CME_CONNECTED)
 			event = DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED;
 		else
 			event = DAT_IB_UD_CONNECTION_REJECT_EVENT;
 
-		dapls_evd_post_connection_event_ext((DAPL_EVD *) ep_ptr->param.
-						    connect_evd_handle,
-						    event,
-						    (DAT_EP_HANDLE) ep_ptr,
-						    (DAT_COUNT) cm_ptr->dst.p_size,
-						    (DAT_PVOID *) cm_ptr->p_data,
-						    (DAT_PVOID *) &xevent);
+		dapls_evd_post_connection_event_ext(
+				(DAPL_EVD *) ep_ptr->param.connect_evd_handle,
+				event,
+				(DAT_EP_HANDLE) ep_ptr,
+				(DAT_COUNT) cm_ptr->msg.p_size,
+				(DAT_PVOID *) cm_ptr->msg.p_data,
+				(DAT_PVOID *) &xevent);
 
 		/* done with socket, don't destroy cm_ptr, need pdata */
 		closesocket(cm_ptr->socket);
@@ -766,17 +787,17 @@ ud_bail:
 	{
 		ep_ptr->cm_handle = cm_ptr; /* only RC, multi CR's on UD */
 		dapl_evd_connection_callback(cm_ptr,
-					     IB_CME_CONNECTED,
-					     cm_ptr->p_data, ep_ptr);
+					     event,
+					     cm_ptr->msg.p_data, ep_ptr);
 	}
 	return;
 
 bail:
 	/* close socket, and post error event */
-	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
+	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0);
 	closesocket(cm_ptr->socket);
 	cm_ptr->socket = DAPL_INVALID_SOCKET;
-	dapl_evd_connection_callback(NULL, event, cm_ptr->p_data, ep_ptr);
+	dapl_evd_connection_callback(NULL, event, cm_ptr->msg.p_data, ep_ptr);
 }
 
 /*
@@ -856,8 +877,6 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
 	dp_ib_cm_handle_t acm_ptr;
 	int ret, len, opt = 1;
 
-	dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket_accept\n");
-	
 	/* 
 	 * Accept all CR's on this port to avoid half-connection (SYN_RCV)
 	 * stalls with many to one connection storms
@@ -870,25 +889,28 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
 		acm_ptr->sp = cm_ptr->sp;
 		acm_ptr->hca = cm_ptr->hca;
 
-		len = sizeof(acm_ptr->dst.ia_address);
+		len = sizeof(union dcm_addr);
 		acm_ptr->socket = accept(cm_ptr->socket,
 					(struct sockaddr *)
-					&acm_ptr->dst.ia_address,
-					(socklen_t *) & len);
+					&acm_ptr->msg.daddr.so,
+					(socklen_t *) &len);
 		if (acm_ptr->socket == DAPL_INVALID_SOCKET) {
 			dapl_log(DAPL_DBG_TYPE_ERR,
-				" accept: ERR %s on FD %d l_cr %p\n",
+				" ACCEPT: ERR %s on FD %d l_cr %p\n",
 				strerror(errno), cm_ptr->socket, cm_ptr);
 			dapls_ib_cm_free(acm_ptr, acm_ptr->ep);
 			return;
 		}
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " accepting from %s\n",
+			     inet_ntoa(((struct sockaddr_in *)
+					&acm_ptr->msg.daddr.so)->sin_addr));
 
 		/* no delay for small packets */
 		ret = setsockopt(acm_ptr->socket, IPPROTO_TCP, TCP_NODELAY,
 			   (char *)&opt, sizeof(opt));
 		if (ret)
 			dapl_log(DAPL_DBG_TYPE_WARN,
-				 " accept: NODELAY setsockopt: %s\n",
+				 " ACCEPT: NODELAY setsockopt: %s\n",
 				 strerror(errno));
 
 		acm_ptr->state = DCM_ACCEPTING;
@@ -902,65 +924,57 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
  */
 static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
 {
-	int len;
+	int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE;
 	void *p_data = NULL;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket accepted, read QP data\n");
 
 	/* read in DST QP info, IA address. check for private data */
-	len =
-	    recv(acm_ptr->socket, (char *)&acm_ptr->dst, sizeof(ib_qp_cm_t), 0);
-	if (len != sizeof(ib_qp_cm_t) || ntohs(acm_ptr->dst.ver) != DCM_VER) {
+	len = recv(acm_ptr->socket, (char *)&acm_ptr->msg, exp, 0);
+	if (len != exp || ntohs(acm_ptr->msg.ver) != DCM_VER) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " accept read: ERR %s, rcnt=%d, ver=%d\n",
-			 strerror(errno), len, ntohs(acm_ptr->dst.ver));
+			 " ACCEPT read: ERR %s, rcnt=%d, ver=%d\n",
+			 strerror(errno), len, ntohs(acm_ptr->msg.ver));
 		goto bail;
 	}
 
-	/* convert accepted values to host order */
-	acm_ptr->dst.port = ntohs(acm_ptr->dst.port);
-	acm_ptr->dst.lid = ntohs(acm_ptr->dst.lid);
-	acm_ptr->dst.qpn = ntohl(acm_ptr->dst.qpn);
-#ifdef DAT_EXTENSIONS
-	acm_ptr->dst.qp_type = ntohs(acm_ptr->dst.qp_type);
-#endif
-	acm_ptr->dst.p_size = ntohl(acm_ptr->dst.p_size);
-
-	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " accept: DST %s port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n",
-		     inet_ntoa(((struct sockaddr_in *)&acm_ptr->dst.
-				ia_address)->sin_addr), acm_ptr->dst.port,
-		     acm_ptr->dst.lid, acm_ptr->dst.qpn, acm_ptr->dst.p_size);
+	/* keep the QP, address info in network order */
 
 	/* validate private data size before reading */
-	if (acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE) {
+	exp = ntohs(acm_ptr->msg.p_size);
+	if (exp > DCM_MAX_PDATA_SIZE) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
 			     " accept read: psize (%d) wrong\n",
-			     acm_ptr->dst.p_size);
+			     acm_ptr->msg.p_size);
 		goto bail;
 	}
 
-	dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket accepted, read private data\n");
-
 	/* read private data into cm_handle if any present */
-	if (acm_ptr->dst.p_size) {
-		len =
-		    recv(acm_ptr->socket, acm_ptr->p_data, acm_ptr->dst.p_size,
-			 0);
-		if (len != acm_ptr->dst.p_size) {
+	if (exp) {
+		len = recv(acm_ptr->socket, acm_ptr->msg.p_data, exp, 0);
+		if (len != exp) {
 			dapl_log(DAPL_DBG_TYPE_ERR,
 				 " accept read pdata: ERR %s, rcnt=%d\n",
 				 strerror(errno), len);
 			goto bail;
 		}
-		dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: psize=%d read\n", len);
-		p_data = acm_ptr->p_data;
+		p_data = acm_ptr->msg.p_data;
 	}
 
 	acm_ptr->state = DCM_ACCEPTING_DATA;
 
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " ACCEPT: DST %s %d port=%d lid=0x%x, qpn=0x%x, psz=%d\n",
+		     inet_ntoa(((struct sockaddr_in *)
+				&acm_ptr->msg.daddr.so)->sin_addr), 
+		     ntohs(((struct sockaddr_in *)
+			     &acm_ptr->msg.daddr.so)->sin_port),
+		     acm_ptr->msg.saddr.ib.port_num, 
+		     ntohs(acm_ptr->msg.saddr.ib.lid), 
+		     ntohl(acm_ptr->msg.saddr.ib.qpn), exp);
+
 #ifdef DAT_EXTENSIONS
-	if (acm_ptr->dst.qp_type == IBV_QPT_UD) {
+	if (acm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) {
 		DAT_IB_EXTENSION_EVENT_DATA xevent;
 
 		/* post EVENT, modify_qp created ah */
@@ -970,9 +984,9 @@ static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
 		dapls_evd_post_cr_event_ext(acm_ptr->sp,
 					    DAT_IB_UD_CONNECTION_REQUEST_EVENT,
 					    acm_ptr,
-					    (DAT_COUNT) acm_ptr->dst.p_size,
-					    (DAT_PVOID *) acm_ptr->p_data,
-					    (DAT_PVOID *) & xevent);
+					    (DAT_COUNT) exp,
+					    (DAT_PVOID *) acm_ptr->msg.p_data,
+					    (DAT_PVOID *) &xevent);
 	} else
 #endif
 		/* trigger CR event and return SUCCESS */
@@ -980,8 +994,8 @@ static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
 				  IB_CME_CONNECTION_REQUEST_PENDING,
 				  p_data, acm_ptr->sp);
 	return;
-      bail:
-	/* close socket, free cm structure, active will see socket close as reject */
+bail:
+	/* close socket, free cm structure, active will see close as rej */
 	dapls_ib_cm_free(acm_ptr, acm_ptr->ep);
 	return;
 }
@@ -997,11 +1011,11 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 {
 	DAPL_IA *ia_ptr = ep_ptr->header.owner_ia;
 	dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle;
-	ib_qp_cm_t local;
+	ib_cm_msg_t local;
 	struct iovec iov[2];
-	int len;
+	int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE;
 
-	if (p_size > IB_MAX_REP_PDATA_SIZE)
+	if (p_size > DCM_MAX_PDATA_SIZE)
 		return DAT_LENGTH_ERROR;
 
 	/* must have a accepted socket */
@@ -1009,13 +1023,16 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 		return DAT_INTERNAL_ERROR;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " ACCEPT_USR: remote port=0x%x lid=0x%x"
+		     " ACCEPT_USR: remote port=%d lid=0x%x"
 		     " qpn=0x%x qp_type %d, psize=%d\n",
-		     cm_ptr->dst.port, cm_ptr->dst.lid,
-		     cm_ptr->dst.qpn, cm_ptr->dst.qp_type, cm_ptr->dst.p_size);
+		     cm_ptr->msg.saddr.ib.port_num, 
+		     ntohs(cm_ptr->msg.saddr.ib.lid),
+		     ntohl(cm_ptr->msg.saddr.ib.qpn), 
+		     cm_ptr->msg.saddr.ib.qp_type, 
+		     ntohs(cm_ptr->msg.p_size));
 
 #ifdef DAT_EXTENSIONS
-	if (cm_ptr->dst.qp_type == IBV_QPT_UD &&
+	if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD &&
 	    ep_ptr->qp_handle->qp_type != IBV_QPT_UD) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
 			     " ACCEPT_USR: ERR remote QP is UD,"
@@ -1027,22 +1044,28 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 	/* modify QP to RTR and then to RTS with remote info already read */
 	dapl_os_lock(&ep_ptr->header.lock);
 	if (dapls_modify_qp_state(ep_ptr->qp_handle,
-				  IBV_QPS_RTR, cm_ptr) != DAT_SUCCESS) {
+				  IBV_QPS_RTR, 
+				  cm_ptr->msg.saddr.ib.qpn,
+				  cm_ptr->msg.saddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " ACCEPT_USR: QPS_RTR ERR %s -> %s\n",
-			 strerror(errno), inet_ntoa(((struct sockaddr_in *)
-						     &cm_ptr->dst.ia_address)->
-						    sin_addr));
+			 strerror(errno), 
+			 inet_ntoa(((struct sockaddr_in *)
+				     &cm_ptr->msg.daddr.so)->sin_addr));
 		dapl_os_unlock(&ep_ptr->header.lock);
 		goto bail;
 	}
 	if (dapls_modify_qp_state(ep_ptr->qp_handle,
-				  IBV_QPS_RTS, cm_ptr) != DAT_SUCCESS) {
+				  IBV_QPS_RTS, 
+				  cm_ptr->msg.saddr.ib.qpn,
+				  cm_ptr->msg.saddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " ACCEPT_USR: QPS_RTS ERR %s -> %s\n",
-			 strerror(errno), inet_ntoa(((struct sockaddr_in *)
-						     &cm_ptr->dst.ia_address)->
-						    sin_addr));
+			 strerror(errno), 
+			 inet_ntoa(((struct sockaddr_in *)
+				     &cm_ptr->msg.daddr.so)->sin_addr));
 		dapl_os_unlock(&ep_ptr->header.lock);
 		goto bail;
 	}
@@ -1050,53 +1073,50 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 
 	/* save remote address information */
 	dapl_os_memcpy(&ep_ptr->remote_ia_address,
-		       &cm_ptr->dst.ia_address,
-		       sizeof(ep_ptr->remote_ia_address));
+		       &cm_ptr->msg.daddr.so,
+		       sizeof(union dcm_addr));
 
 	/* send our QP info, IA address, pdata. Don't overwrite dst data */
 	local.ver = htons(DCM_VER);
-	local.rej = 0;
-	local.qpn = htonl(ep_ptr->qp_handle->qp_num);
-	local.qp_type = htons(ep_ptr->qp_handle->qp_type);
-	local.port = htons(ia_ptr->hca_ptr->port_num);
-	local.lid = ia_ptr->hca_ptr->ib_trans.lid;
-	local.gid = ia_ptr->hca_ptr->ib_trans.gid;
-	local.ia_address = ia_ptr->hca_ptr->hca_address;
-	((struct sockaddr_in *)&local.ia_address)->sin_port = 
-		ntohs(cm_ptr->sp->conn_qual);
-
-	local.p_size = htonl(p_size);
+	local.op = htons(DCM_REP);
+	local.saddr.ib.qpn = htonl(ep_ptr->qp_handle->qp_num);
+	local.saddr.ib.qp_type = ep_ptr->qp_handle->qp_type;
+	local.saddr.ib.port_num = ia_ptr->hca_ptr->port_num;
+	local.saddr.ib.lid = ia_ptr->hca_ptr->ib_trans.lid;
+	local.saddr.ib.gid = ia_ptr->hca_ptr->ib_trans.gid;
+	local.daddr.so = ia_ptr->hca_ptr->hca_address;
+	((struct sockaddr_in *)&local.daddr.so)->sin_port = 
+				htons((uint16_t)cm_ptr->sp->conn_qual);
+
+	local.p_size = htons(p_size);
 	iov[0].iov_base = (void *)&local;
-	iov[0].iov_len = sizeof(ib_qp_cm_t);
+	iov[0].iov_len = exp;
 	if (p_size) {
 		iov[1].iov_base = p_data;
 		iov[1].iov_len = p_size;
 		len = writev(cm_ptr->socket, iov, 2);
-	} else {
+	} else 
 		len = writev(cm_ptr->socket, iov, 1);
-	}
-
-	if (len != (p_size + sizeof(ib_qp_cm_t))) {
+	
+	if (len != (p_size + exp)) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n",
-			 strerror(errno), len, inet_ntoa(((struct sockaddr_in *)
-							  &cm_ptr->dst.
-							  ia_address)->
-							 sin_addr));
+			 strerror(errno), len, 
+			 inet_ntoa(((struct sockaddr_in *)
+				   &cm_ptr->msg.daddr.so)->sin_addr));
 		goto bail;
 	}
 
 	dapl_dbg_log(DAPL_DBG_TYPE_CM,
-		     " ACCEPT_USR: local port=0x%x lid=0x%x"
-		     " qpn=0x%x psize=%d\n",
-		     ntohs(local.port), ntohs(local.lid),
-		     ntohl(local.qpn), ntohl(local.p_size));
+		     " ACCEPT_USR: local port=%d lid=0x%x qpn=0x%x psz=%d\n",
+		     local.saddr.ib.port_num, ntohs(local.saddr.ib.lid),
+		     ntohl(local.saddr.ib.qpn), ntohs(local.p_size));
 	dapl_dbg_log(DAPL_DBG_TYPE_CM,
-		     " ACCEPT_USR SRC GID subnet %016llx id %016llx\n",
+		     " ACCEPT_USR: SRC GID subnet %016llx id %016llx\n",
 		     (unsigned long long)
-		     htonll(local.gid.global.subnet_prefix),
+		     htonll(local.saddr.ib.gid.global.subnet_prefix),
 		     (unsigned long long)
-		     htonll(local.gid.global.interface_id));
+		     htonll(local.saddr.ib.gid.global.interface_id));
 
 	/* save state and reference to EP, queue for RTU data */
 	cm_ptr->ep = ep_ptr;
@@ -1107,7 +1127,7 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 	return DAT_SUCCESS;
       bail:
 	dapls_ib_cm_free(cm_ptr, cm_ptr->ep);
-	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
+	dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0);
 	return DAT_INTERNAL_ERROR;
 }
 
@@ -1117,16 +1137,15 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr,
 void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr)
 {
 	int len;
-	short rtu_data = 0;
 
-	/* complete handshake after final QP state change */
-	len = recv(cm_ptr->socket, (char *)&rtu_data, sizeof(rtu_data), 0);
-	if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) {
+	/* complete handshake after final QP state change, VER and OP */
+	len = recv(cm_ptr->socket, (char *)&cm_ptr->msg, 4, 0);
+	if (len != 4 || ntohs(cm_ptr->msg.op) != DCM_RTU) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
-			 " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n",
-			 strerror(errno), len, ntohs(rtu_data),
+			 " ACCEPT_RTU: rcv ERR, rcnt=%d op=%x\n",
+			 len, ntohs(cm_ptr->msg.op),
 			 inet_ntoa(((struct sockaddr_in *)
-				    &cm_ptr->dst.ia_address)->sin_addr));
+				    &cm_ptr->msg.daddr.so)->sin_addr));
 		goto bail;
 	}
 
@@ -1137,25 +1156,26 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr)
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, " PASSIVE: connected!\n");
 
 #ifdef DAT_EXTENSIONS
-	if (cm_ptr->dst.qp_type == IBV_QPT_UD) {
+	if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) {
 		DAT_IB_EXTENSION_EVENT_DATA xevent;
 
 		/* post EVENT, modify_qp created ah */
 		xevent.status = 0;
 		xevent.type = DAT_IB_UD_PASSIVE_REMOTE_AH;
 		xevent.remote_ah.ah = cm_ptr->ah;
-		xevent.remote_ah.qpn = cm_ptr->dst.qpn;
+		xevent.remote_ah.qpn = cm_ptr->msg.saddr.ib.qpn;
 		dapl_os_memcpy(&xevent.remote_ah.ia_addr,
-			       &cm_ptr->dst.ia_address,
-			       sizeof(cm_ptr->dst.ia_address));
-
-		dapls_evd_post_connection_event_ext((DAPL_EVD *) cm_ptr->ep->
-						    param.connect_evd_handle,
-						    DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED,
-						    (DAT_EP_HANDLE) cm_ptr->ep,
-						    (DAT_COUNT) cm_ptr->dst.p_size,
-						    (DAT_PVOID *) cm_ptr->p_data,
-						    (DAT_PVOID *) &xevent);
+			       &cm_ptr->msg.daddr.so,
+			       sizeof(union dcm_addr));
+
+		dapls_evd_post_connection_event_ext(
+				(DAPL_EVD *) 
+				cm_ptr->ep->param.connect_evd_handle,
+				DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED,
+				(DAT_EP_HANDLE) cm_ptr->ep,
+				(DAT_COUNT) cm_ptr->msg.p_size,
+				(DAT_PVOID *) cm_ptr->msg.p_data,
+				(DAT_PVOID *) &xevent);
 
                 /* done with socket, don't destroy cm_ptr, need pdata */
                 closesocket(cm_ptr->socket);
@@ -1169,7 +1189,7 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr)
 	return;
       
 bail:
-	dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0);
+	dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0, 0, 0);
 	dapls_ib_cm_free(cm_ptr, cm_ptr->ep);
 	dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, NULL, cm_ptr->sp);
 }
@@ -1237,7 +1257,7 @@ dapls_ib_disconnect(IN DAPL_EP * ep_ptr, IN DAT_CLOSE_FLAGS close_flags)
 		     "dapls_ib_disconnect(ep_handle %p ....)\n", ep_ptr);
 
 	/* Transition to error state to flush queue */
-        dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0);
+        dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0);
 	
 	if (ep_ptr->cm_handle == NULL ||
 	    ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED)
@@ -1429,19 +1449,16 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr,
 		     " reject(cm %p reason %x, pdata %p, psize %d)\n",
 		     cm_ptr, reason, pdata, psize);
 
-        if (psize > IB_MAX_REJ_PDATA_SIZE)
+        if (psize > DCM_MAX_PDATA_SIZE)
                 return DAT_LENGTH_ERROR;
 
 	/* write reject data to indicate reject */
 	if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
-		cm_ptr->dst.rej = (uint16_t) reason;
-		cm_ptr->dst.rej = htons(cm_ptr->dst.rej);
-		cm_ptr->dst.p_size = htonl(psize);
-		/* get qp_type from request */
-		cm_ptr->dst.qp_type = ntohs(cm_ptr->dst.qp_type);
-
-		iov[0].iov_base = (void *)&cm_ptr->dst;
-		iov[0].iov_len = sizeof(ib_qp_cm_t);
+		cm_ptr->msg.op = htons(DCM_REJ_USER);
+		cm_ptr->msg.p_size = htons(psize);
+		
+		iov[0].iov_base = (void *)&cm_ptr->msg;
+		iov[0].iov_len = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE;
 		if (psize) {
 			iov[1].iov_base = pdata;
 			iov[1].iov_len = psize;
@@ -1457,10 +1474,7 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr,
 
 	/* cr_thread will destroy CR */
 	cm_ptr->state = DCM_DESTROY;
-	if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
-		dapl_log(DAPL_DBG_TYPE_CM,
-			 " cm_destroy: thread wakeup error = %s\n",
-			 strerror(errno));
+	send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0);
 	return DAT_SUCCESS;
 }
 
@@ -1501,7 +1515,7 @@ dapls_ib_cm_remote_addr(IN DAT_HANDLE dat_handle,
 		return DAT_INVALID_HANDLE;
 
 	dapl_os_memcpy(remote_ia_address,
-		       &ib_cm_handle->dst.ia_address, sizeof(DAT_SOCK_ADDR6));
+		       &ib_cm_handle->msg.daddr.so, sizeof(DAT_SOCK_ADDR6));
 
 	return DAT_SUCCESS;
 }
@@ -1533,38 +1547,16 @@ int dapls_ib_private_data_size(IN DAPL_PRIVATE * prd_ptr,
 	int size;
 
 	switch (conn_op) {
-	case DAPL_PDATA_CONN_REQ:
-		{
-			size = IB_MAX_REQ_PDATA_SIZE;
-			break;
-		}
-	case DAPL_PDATA_CONN_REP:
-		{
-			size = IB_MAX_REP_PDATA_SIZE;
-			break;
-		}
-	case DAPL_PDATA_CONN_REJ:
-		{
-			size = IB_MAX_REJ_PDATA_SIZE;
+		case DAPL_PDATA_CONN_REQ:
+		case DAPL_PDATA_CONN_REP:
+		case DAPL_PDATA_CONN_REJ:
+		case DAPL_PDATA_CONN_DREQ:
+		case DAPL_PDATA_CONN_DREP:
+			size = DCM_MAX_PDATA_SIZE;
 			break;
-		}
-	case DAPL_PDATA_CONN_DREQ:
-		{
-			size = IB_MAX_DREQ_PDATA_SIZE;
-			break;
-		}
-	case DAPL_PDATA_CONN_DREP:
-		{
-			size = IB_MAX_DREP_PDATA_SIZE;
-			break;
-		}
-	default:
-		{
+		default:
 			size = 0;
-		}
-
-	}			/* end case */
-
+	}			
 	return size;
 }
 
@@ -1717,27 +1709,26 @@ void cr_thread(void *arg)
 				continue;
 
 			event = (cr->state == DCM_CONN_PENDING) ?
-			    DAPL_FD_WRITE : DAPL_FD_READ;
+						DAPL_FD_WRITE : DAPL_FD_READ;
+
 			if (dapl_fd_set(cr->socket, set, event)) {
 				dapl_log(DAPL_DBG_TYPE_ERR,
 					 " cr_thread: DESTROY CR st=%d fd %d"
 					 " -> %s\n", cr->state, cr->socket,
 					 inet_ntoa(((struct sockaddr_in *)
-						    &cr->dst.ia_address)->
-						   sin_addr));
+						&cr->msg.daddr.so)->sin_addr));
 				dapls_ib_cm_free(cr, cr->ep);
 				continue;
 			}
 
 			dapl_dbg_log(DAPL_DBG_TYPE_CM,
-				     " poll cr=%p, socket=%d\n", cr,
-				     cr->socket);
+				     " poll cr=%p, sck=%d\n", cr, cr->socket);
 			dapl_os_unlock(&hca_ptr->ib_trans.lock);
 
 			ret = dapl_poll(cr->socket, event);
 
 			dapl_dbg_log(DAPL_DBG_TYPE_CM,
-				     " poll ret=0x%x cr->state=%d socket=%d\n",
+				     " poll ret=0x%x cr->state=%d sck=%d\n",
 				     ret, cr->state, cr->socket);
 
 			/* data on listen, qp exchange, and on disc req */
@@ -1783,7 +1774,7 @@ void cr_thread(void *arg)
 				     " poll=%d cr->st=%s sk=%d ep %p, %d\n",
 				     ret, dapl_cm_state_str(cr->state), 
 				     cr->socket, cr->ep,
-				     cr->ep ? cr->ep->param.ep_state:0);
+				     cr->ep ? cr->ep->param.ep_state : 0);
 				dapli_socket_disconnect(cr);
 			}
 			dapl_os_lock(&hca_ptr->ib_trans.lock);
@@ -1846,17 +1837,17 @@ void dapls_print_cm_list(IN DAPL_IA *ia_ptr)
 
 		printf( "  CONN[%d]: sp %p ep %p sock %d %s %s %s %s %d\n",
 			i, cr->sp, cr->ep, cr->socket,
-			cr->dst.qp_type == IBV_QPT_RC ? "RC" : "UD",
+			cr->msg.saddr.ib.qp_type == IBV_QPT_RC ? "RC" : "UD",
 			dapl_cm_state_str(cr->state),
 			cr->sp ? "<-" : "->",
 			cr->state == DCM_LISTEN ? 
 			inet_ntoa(((struct sockaddr_in *)
 				&ia_ptr->hca_ptr->hca_address)->sin_addr) :
 			inet_ntoa(((struct sockaddr_in *)
-				&cr->dst.ia_address)->sin_addr),
+				&cr->msg.daddr.so)->sin_addr),
 			cr->sp ? (int)cr->sp->conn_qual : 
 			ntohs(((struct sockaddr_in *)
-				&cr->dst.ia_address)->sin_port));
+				&cr->msg.daddr.so)->sin_port));
 		i++;
 	}
 	printf("\n");
diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h
index 933364c..d6950fa 100644
--- a/dapl/openib_scm/dapl_ib_util.h
+++ b/dapl/openib_scm/dapl_ib_util.h
@@ -40,8 +40,7 @@ struct ib_cm_handle
 	struct dapl_hca		*hca;
 	struct dapl_sp		*sp;	
 	struct dapl_ep 		*ep;
-	ib_qp_cm_t		dst;
-	unsigned char		p_data[256];	/* must follow ib_qp_cm_t */
+	ib_cm_msg_t		msg;
 	struct ibv_ah		*ah;
 };
 
@@ -66,15 +65,6 @@ typedef dp_ib_cm_handle_t	ib_cm_srvc_handle_t;
 #define SCM_HOP_LIMIT	0xff
 #define SCM_TCLASS	0
 
-/* CM private data areas */
-#define	IB_MAX_REQ_PDATA_SIZE	92
-#define	IB_MAX_REP_PDATA_SIZE	196
-#define	IB_MAX_REJ_PDATA_SIZE	148
-#define	IB_MAX_DREQ_PDATA_SIZE	220
-#define	IB_MAX_DREP_PDATA_SIZE	224
-#define	IB_MAX_RTU_PDATA_SIZE	224
-
-
 /* ib_hca_transport_t, specific to this implementation */
 typedef struct _ib_hca_transport
 { 
@@ -120,11 +110,8 @@ void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr);
 void dapli_async_event_cb(struct _ib_hca_transport *tp);
 void dapli_cq_event_cb(struct _ib_hca_transport *tp);
 DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr);
-void dapls_print_cm_list(IN DAPL_IA *ia_ptr);
 dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep);
 void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep);
-DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle,
-				 IN ib_qp_state_t qp_state,
-				 IN dp_ib_cm_handle_t cm);
+void dapls_print_cm_list(IN DAPL_IA *ia_ptr);
 
 #endif /*  _DAPL_IB_UTIL_H_ */
diff --git a/dapl/openib_ucm/README b/dapl/openib_ucm/README
new file mode 100644
index 0000000..239dfe6
--- /dev/null
+++ b/dapl/openib_ucm/README
@@ -0,0 +1,40 @@
+
+OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels.
+
+to build:
+
+cd dapl/udapl
+make VERBS=openib_scm clean
+make VERBS=openib_scm
+
+
+Modifications to common code:
+
+- added dapl/openib_scm directory 
+
+	dapl/udapl/Makefile
+
+New files for openib_scm provider
+
+	dapl/openib/dapl_ib_cq.c
+	dapl/openib/dapl_ib_dto.h
+	dapl/openib/dapl_ib_mem.c
+	dapl/openib/dapl_ib_qp.c
+	dapl/openib/dapl_ib_util.c
+	dapl/openib/dapl_ib_util.h
+	dapl/openib/dapl_ib_cm.c
+
+A simple dapl test just for openib_scm testing...
+
+	test/dtest/dtest.c
+	test/dtest/makefile
+
+	server:	dtest -s 
+	client:	dtest -h hostname
+
+known issues:
+
+	no memory windows support in ibverbs, dat_create_rmr fails.
+	
+
+
diff --git a/dapl/openib_ucm/SOURCES b/dapl/openib_ucm/SOURCES
new file mode 100644
index 0000000..dfe956f
--- /dev/null
+++ b/dapl/openib_ucm/SOURCES
@@ -0,0 +1,53 @@
+!if $(FREEBUILD)
+TARGETNAME=dapl2-ofa-ucm
+!else
+TARGETNAME=dapl2-ofa-ucmd
+!endif
+
+TARGETPATH = ..\..\..\..\bin\user\obj$(BUILD_ALT_DIR)
+TARGETTYPE = DYNLINK
+DLLENTRY = _DllMainCRTStartup
+
+!if $(_NT_TOOLS_VERSION) == 0x700
+DLLDEF=$O\udapl_ofa_ucm_exports.def
+!else
+DLLDEF=$(OBJ_PATH)\$O\udapl_ofa_ucm_exports.def
+!endif
+
+USE_MSVCRT = 1
+
+SOURCES = \
+	udapl.rc \
+	..\dapl_common_src.c	\
+	..\dapl_udapl_src.c		\
+	dapl_ib_cq.c			\
+	dapl_ib_extensions.c	\
+	dapl_ib_mem.c			\
+	dapl_ib_qp.c			\
+	dapl_ib_util.c			\
+	dapl_ib_cm.c
+
+INCLUDES = ..\include;..\common;windows;..\..\dat\include;\
+		   ..\..\dat\udat\windows;..\udapl\windows;\
+		   ..\..\..\..\inc;..\..\..\..\inc\user;..\..\..\libibverbs\include
+
+DAPL_OPTS = -DEXPORT_DAPL_SYMBOLS -DDAT_EXTENSIONS -DSOCK_CM -DOPENIB -DCQ_WAIT_OBJECT
+
+USER_C_FLAGS = $(USER_C_FLAGS) $(DAPL_OPTS)
+
+!if !$(FREEBUILD)
+USER_C_FLAGS = $(USER_C_FLAGS) -DDAPL_DBG
+!endif
+
+TARGETLIBS= \
+	$(SDK_LIB_PATH)\kernel32.lib \
+	$(SDK_LIB_PATH)\ws2_32.lib \
+!if $(FREEBUILD)
+	$(TARGETPATH)\*\dat2.lib \
+	$(TARGETPATH)\*\libibverbs.lib
+!else
+	$(TARGETPATH)\*\dat2d.lib \
+	$(TARGETPATH)\*\libibverbsd.lib
+!endif
+
+MSC_WARNING_LEVEL = /W1 /wd4113
diff --git a/dapl/openib_ucm/cm.c b/dapl/openib_ucm/cm.c
new file mode 100644
index 0000000..ab3823e
--- /dev/null
+++ b/dapl/openib_ucm/cm.c
@@ -0,0 +1,1837 @@
+/*
+ * Copyright (c) 2009 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+
+#include "dapl.h"
+#include "dapl_adapter_util.h"
+#include "dapl_evd_util.h"
+#include "dapl_cr_util.h"
+#include "dapl_name_service.h"
+#include "dapl_ib_util.h"
+#include "dapl_osd.h"
+
+
+#if defined(_WIN32) || defined(_WIN64)
+enum DAPL_FD_EVENTS {
+	DAPL_FD_READ = 0x1,
+	DAPL_FD_WRITE = 0x2,
+	DAPL_FD_ERROR = 0x4
+};
+
+struct dapl_fd_set {
+	struct fd_set set[3];
+};
+
+static struct dapl_fd_set *dapl_alloc_fd_set(void)
+{
+	return dapl_os_alloc(sizeof(struct dapl_fd_set));
+}
+
+static void dapl_fd_zero(struct dapl_fd_set *set)
+{
+	FD_ZERO(&set->set[0]);
+	FD_ZERO(&set->set[1]);
+	FD_ZERO(&set->set[2]);
+}
+
+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
+		       enum DAPL_FD_EVENTS event)
+{
+	FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]);
+	FD_SET(s, &set->set[2]);
+	return 0;
+}
+
+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event)
+{
+	struct fd_set rw_fds;
+	struct fd_set err_fds;
+	struct timeval tv;
+	int ret;
+
+	FD_ZERO(&rw_fds);
+	FD_ZERO(&err_fds);
+	FD_SET(s, &rw_fds);
+	FD_SET(s, &err_fds);
+
+	tv.tv_sec = 0;
+	tv.tv_usec = 0;
+
+	if (event == DAPL_FD_READ)
+		ret = select(1, &rw_fds, NULL, &err_fds, &tv);
+	else
+		ret = select(1, NULL, &rw_fds, &err_fds, &tv);
+
+	if (ret == 0)
+		return 0;
+	else if (ret == SOCKET_ERROR)
+		return WSAGetLastError();
+	else if (FD_ISSET(s, &rw_fds))
+		return event;
+	else
+		return DAPL_FD_ERROR;
+}
+
+static int dapl_select(struct dapl_fd_set *set)
+{
+	int ret;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: sleep\n");
+	ret = select(0, &set->set[0], &set->set[1], &set->set[2], NULL);
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: wakeup\n");
+
+	if (ret == SOCKET_ERROR)
+		dapl_dbg_log(DAPL_DBG_TYPE_CM,
+			     " dapl_select: error 0x%x\n", WSAGetLastError());
+
+	return ret;
+}
+#else				// _WIN32 || _WIN64
+enum DAPL_FD_EVENTS {
+	DAPL_FD_READ = POLLIN,
+	DAPL_FD_WRITE = POLLOUT,
+	DAPL_FD_ERROR = POLLERR
+};
+
+struct dapl_fd_set {
+	int index;
+	struct pollfd set[DAPL_FD_SETSIZE];
+};
+
+static struct dapl_fd_set *dapl_alloc_fd_set(void)
+{
+	return dapl_os_alloc(sizeof(struct dapl_fd_set));
+}
+
+static void dapl_fd_zero(struct dapl_fd_set *set)
+{
+	set->index = 0;
+}
+
+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
+		       enum DAPL_FD_EVENTS event)
+{
+	if (set->index == DAPL_FD_SETSIZE - 1) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n",
+			 set->index + 1);
+		return -1;
+	}
+
+	set->set[set->index].fd = s;
+	set->set[set->index].revents = 0;
+	set->set[set->index++].events = event;
+	return 0;
+}
+
+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event)
+{
+	struct pollfd fds;
+	int ret;
+
+	fds.fd = s;
+	fds.events = event;
+	fds.revents = 0;
+	ret = poll(&fds, 1, 0);
+	dapl_log(DAPL_DBG_TYPE_CM, " dapl_poll: fd=%d ret=%d, evnts=0x%x\n",
+		 s, ret, fds.revents);
+	if (ret == 0)
+		return 0;
+	else if (fds.revents & (POLLERR | POLLHUP | POLLNVAL)) 
+		return DAPL_FD_ERROR;
+	else 
+		return fds.revents;
+}
+
+static int dapl_select(struct dapl_fd_set *set)
+{
+	int ret;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: sleep, fds=%d\n",
+		     set->index);
+	ret = poll(set->set, set->index, -1);
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: wakeup, ret=0x%x\n", ret);
+	return ret;
+}
+#endif
+
+/* forward declarations */
+static void ucm_accept(ib_cm_srvc_handle_t cm, ib_cm_msg_t *msg);
+static void ucm_connect_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg);
+static void ucm_accept_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg);
+static int ucm_send(ib_hca_transport_t *tp, ib_cm_msg_t *msg);
+DAT_RETURN dapli_cm_disconnect(dp_ib_cm_handle_t cm);
+
+#define UCM_SND_BURST	100
+
+/* Service ids - port space */
+static uint16_t ucm_get_port(ib_hca_transport_t *tp, uint16_t port)
+{
+	int i = 0;
+	
+	dapl_os_lock(&tp->plock);
+	/* get specific ID */
+	if (port) {
+		if (tp->sid[port] == 0) {
+			tp->sid[port] = 1;
+			i = port;
+		}
+		goto done;
+	} 
+	
+	/* get any free ID */
+	for (i = 0xffff; i > 0; i--) {
+		if (tp->sid[i] == 0) {
+			tp->sid[i] = 1;
+			break;
+		}
+	}
+done:
+	dapl_os_unlock(&tp->plock);
+	return i;
+}
+
+static void ucm_free_port(ib_hca_transport_t *tp, uint16_t port)
+{
+	dapl_os_lock(&tp->plock);
+	tp->sid[port] = 0;
+	dapl_os_unlock(&tp->plock);
+}
+
+/* SEND CM MESSAGE PROCESSING */
+
+/* Get CM UD message from send queue, called with s_lock held */
+static ib_cm_msg_t *ucm_get_smsg(ib_hca_transport_t *tp)
+{
+	ib_cm_msg_t *msg = NULL; 
+	int ret, polled = 0, hd = tp->s_hd;
+
+	hd++;
+retry:
+	if (hd == tp->qpe)
+		hd = 0;
+
+	if (hd == tp->s_tl)
+		msg = NULL;
+	else {
+		msg = &tp->sbuf[hd];
+		tp->s_hd = hd; /* new hd */
+	}
+
+	/* if empty, process some completions */
+	if ((msg == NULL) && (!polled)) {
+		struct ibv_wc wc;
+
+		/* process completions, based on UCM_SND_BURST */
+		ret = ibv_poll_cq(tp->scq, 1, &wc);
+		if (ret < 0) {
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				" get_smsg: cq %p %s\n", 
+				tp->scq, strerror(errno));
+		}
+		/* free up completed sends, update tail */
+		if (ret > 0) {
+			tp->s_tl = (int)wc.wr_id;
+			dapl_log(DAPL_DBG_TYPE_CM,
+				" get_smsg: wr_cmp (%d) s_tl=%d\n", 
+				wc.status, tp->s_tl);
+		}
+		polled++;
+		goto retry;
+	}
+	return msg;
+}
+
+/* RECEIVE CM MESSAGE PROCESSING */
+
+static int ucm_post_rmsg(ib_hca_transport_t *tp, ib_cm_msg_t *msg)
+{	
+	struct ibv_recv_wr recv_wr, *recv_err;
+	struct ibv_sge sge;
+        
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sge;
+	recv_wr.num_sge = 1;
+	recv_wr.wr_id = (uint64_t)(uintptr_t) msg;
+	sge.length = sizeof(ib_cm_msg_t) + sizeof(struct ibv_grh);
+	sge.lkey = tp->mr_rbuf->lkey;
+	sge.addr = (uintptr_t)((char *)msg - sizeof(struct ibv_grh));
+	
+	return (ibv_post_recv(tp->qp, &recv_wr, &recv_err));
+}
+
+static int ucm_reject(ib_hca_transport_t *tp, ib_cm_msg_t *msg)
+{
+	ib_cm_msg_t	smsg;
+
+	/* setup op, rearrange the src, dst cm and addr info */
+	(void)dapl_os_memzero(&smsg, sizeof(smsg));
+	smsg.ver = htons(DCM_VER);
+	smsg.op = htons(DCM_REJ_CM);
+	smsg.dport = msg->sport;
+	smsg.dqpn = msg->sqpn;
+	smsg.sport = msg->dport; 
+	smsg.sqpn = msg->dqpn;
+
+	dapl_os_memcpy(&smsg.daddr, &msg->saddr, sizeof(union dcm_addr));
+	dapl_os_memcpy(&smsg.saddr, &msg->daddr, sizeof(union dcm_addr));
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+		     " CM reject -> LID %x, QPN %x PORT %d\n", 
+		     ntohs(smsg.daddr.ib.lid),
+		     ntohl(smsg.dqpn), ntohs(smsg.dport));
+
+	return (ucm_send(tp, &smsg));
+}
+
+static void ucm_process_recv(ib_hca_transport_t *tp, 
+			     ib_cm_msg_t *msg, 
+			     dp_ib_cm_handle_t cm)
+{
+	dapl_os_lock(&cm->lock);
+	switch (cm->state) {
+	case DCM_LISTEN:
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: LISTEN\n");
+		dapl_os_unlock(&cm->lock);
+		ucm_accept(cm, msg);
+		break;
+	case DCM_ACCEPTED:
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: ACCEPT_RTU\n");
+		dapl_os_unlock(&cm->lock);
+		ucm_accept_rtu(cm, msg);
+		break;
+	case DCM_CONN_PENDING:
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: CONN_RTU\n");
+		dapl_os_unlock(&cm->lock);
+		ucm_connect_rtu(cm, msg);
+		break;
+	case DCM_CONNECTED:
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: DREQ connect\n");
+		dapl_os_unlock(&cm->lock);
+		if (ntohs(msg->op) == DCM_DREQ)
+			dapli_cm_disconnect(cm);
+		break;
+	case DCM_DISC_PENDING:
+	case DCM_DESTROY:
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: DREQ toss\n");
+		break;
+	default:
+		dapl_log(DAPL_DBG_TYPE_WARN,
+				" process_recv: UNKNOWN state"
+				" <- op %d, st %d spsp %d sqpn %d\n", 
+				ntohs(msg->op), cm->state, 
+				ntohs(msg->sport), ntohl(msg->sqpn));
+		dapl_os_unlock(&cm->lock);
+		break;
+	}
+}
+
+/* Find matching CM object for this receive message, return CM reference */
+dp_ib_cm_handle_t ucm_cm_find(ib_hca_transport_t *tp, ib_cm_msg_t *msg)
+{
+	dp_ib_cm_handle_t cm, next, found = NULL;
+	struct dapl_llist_entry	*list;
+	DAPL_OS_LOCK lock;
+
+	/* connect request - listen list, otherwise conn list */
+	if (ntohs(msg->op) == DCM_REQ) {
+		dapl_dbg_log(DAPL_DBG_TYPE_CM," search - listenQ\n");
+		list = tp->llist;
+		lock = tp->llock;
+	} else {
+		dapl_dbg_log(DAPL_DBG_TYPE_CM," search - connectQ\n");
+		list = tp->list;
+		lock = tp->lock;
+	}
+
+	dapl_os_lock(&lock);
+        if (!dapl_llist_is_empty(&list))
+		next = dapl_llist_peek_head(&list);
+	else
+		next = NULL;
+
+	while (next) {
+		cm = next;
+		next = dapl_llist_next_entry(&list,
+					     (DAPL_LLIST_ENTRY *)&cm->entry);
+		if (cm->state == DCM_DESTROY)
+			continue;
+		
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+			     " MATCH? cm %p st %s sport %x sqpn %x lid %x\n", 
+			     cm, dapl_cm_state_str(cm->state),
+			     ntohs(cm->msg.sport), ntohl(cm->msg.sqpn),
+			     ntohs(cm->msg.saddr.ib.lid));
+
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+			     "  src port %d=%d, sqp %x=%x slid %x=%x, iqp %x=%x\n",
+			     ntohs(cm->msg.sport), ntohs(msg->dport), 
+			     ntohl(cm->msg.sqpn), ntohl(msg->dqpn),
+			     ntohs(cm->msg.saddr.ib.lid), 
+			     ntohs(msg->daddr.ib.lid),
+			     ntohl(cm->msg.saddr.ib.qpn),  
+			     ntohl(msg->daddr.ib.qpn));
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+			     "  dst port %d=%d, sqp %x=%x slid %x=%x, iqp %x=%x\n",
+			     ntohs(cm->msg.dport), ntohs(msg->sport), 
+			     ntohl(cm->msg.dqpn), ntohl(msg->sqpn),
+			     ntohs(cm->msg.daddr.ib.lid), 
+			     ntohs(msg->saddr.ib.lid),
+			     ntohl(cm->msg.daddr.ib.qpn),  
+			     ntohl(msg->saddr.ib.qpn));
+
+		/* REQ: CM sPORT + QPN, match is good enough */
+		if ((cm->msg.sport == msg->dport) && 
+		    (cm->msg.sqpn == msg->dqpn)) {
+			if (ntohs(msg->op) == DCM_REQ) {
+				found = cm;
+				break;
+			/* NOT REQ: add remote CM sPORT, QPN, LID match */
+			} else if ((cm->msg.dport == msg->sport) &&
+				   (cm->msg.dqpn == msg->sqpn)  &&
+				   (cm->msg.daddr.ib.lid == 
+				    msg->saddr.ib.lid)) { 
+				found = cm;
+				break;
+			}
+		}
+	}
+	dapl_os_unlock(&lock);
+	return found;
+}
+
+/* Get rmsgs from CM completion queue, 10 at a time */
+static void ucm_recv(ib_hca_transport_t *tp)
+{
+	struct ibv_wc wc[10];
+	ib_cm_msg_t *msg;
+	dp_ib_cm_handle_t cm;
+	int i, ret, notify = 0;
+	struct ibv_cq *ibv_cq = NULL;
+	DAPL_HCA *hca;
+
+	/* POLLIN on channel FD */
+	ret = ibv_get_cq_event(tp->rch, &ibv_cq, (void *)&hca);
+	if (ret == 0) {
+		ibv_ack_cq_events(ibv_cq, 1);
+	}
+retry:	
+	ret = ibv_poll_cq(tp->rcq, 10, wc);
+	if (ret <= 0) {
+		if (!ret && !notify) {
+			ibv_req_notify_cq(tp->rcq, 0);
+			notify = 1;
+			goto retry;
+		}
+		return;
+	} else 
+		notify = 0;
+	
+	for (i = 0; i < ret; i++) {
+		msg = (ib_cm_msg_t*)wc[i].wr_id;
+
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+			     " ucm_recv: wc status=%d, ln=%d id=%p sqp=%x\n", 
+			     wc[i].status, wc[i].byte_len, 
+			     (void*)wc[i].wr_id, wc[i].src_qp);
+
+		/* validate CM message, version */
+		if (ntohs(msg->ver) != DCM_VER) {
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				 " ucm_recv: UNKNOWN msg %p, ver %d\n", 
+				 msg, msg->ver);
+			ucm_post_rmsg(tp, msg);
+			continue;
+		}
+		if (!(cm = ucm_cm_find(tp, msg))) {
+			dapl_log(DAPL_DBG_TYPE_CM,
+				 " ucm_recv: NO MATCH op %d port %d cqp %x\n", 
+				 ntohs(msg->op), ntohs(msg->dport), 
+				 ntohl(msg->dqpn));
+			if (ntohs(msg->op) == DCM_REQ)
+				ucm_reject(tp, msg);
+			ucm_post_rmsg(tp, msg);
+			continue;
+		}
+		dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: match %p\n",cm);
+
+		/* match, process it */
+		ucm_process_recv(tp, msg, cm);
+		ucm_post_rmsg(tp, msg);
+	}
+	
+	/* finished this batch of WC's, poll and rearm */
+	goto retry;
+	
+}
+
+/* ACTIVE/PASSIVE: build and send CM message out of CM object */
+static int ucm_send(ib_hca_transport_t *tp, ib_cm_msg_t *msg)
+{
+	ib_cm_msg_t *smsg = NULL;
+	struct ibv_send_wr wr, *bad_wr;
+	struct ibv_sge sge;
+	int len, ret = -1;
+	uint16_t dlid = ntohs(msg->daddr.ib.lid);
+
+	/* Get message from send queue, copy data, and send */
+	dapl_os_lock(&tp->slock);
+	if ((smsg = ucm_get_smsg(tp)) == NULL)
+		goto bail;
+
+	len = ((sizeof(*msg) - DCM_MAX_PDATA_SIZE) + ntohs(msg->p_size));
+	dapl_os_memcpy(smsg, msg, len);
+
+	wr.next = NULL;
+        wr.sg_list = &sge;
+        wr.num_sge = 1;
+        wr.opcode = IBV_WR_SEND;
+        wr.wr_id = (unsigned long)tp->s_hd;
+	wr.send_flags = (wr.wr_id % UCM_SND_BURST) ? 0 : IBV_SEND_SIGNALED;
+	if (len <= tp->max_inline_send)
+		wr.send_flags |= IBV_SEND_INLINE; 
+
+        sge.length = len;
+        sge.lkey = tp->mr_sbuf->lkey;
+        sge.addr = (uintptr_t)smsg;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
+		" ucm_send: op %d ln %d lid %x c_qpn %x rport %d\n", 
+		ntohs(smsg->op), len, htons(smsg->daddr.ib.lid), 
+		htonl(smsg->dqpn), htons(smsg->dport));
+
+	/* empty slot, then create AH */
+	if (!tp->ah[dlid]) {
+		tp->ah[dlid] = 	
+			dapls_create_ah(tp->hca, tp->pd, tp->qp, 
+					htons(dlid), NULL);
+		if (!tp->ah[dlid])
+			goto bail;
+	}
+		
+	wr.wr.ud.ah = tp->ah[dlid];
+	wr.wr.ud.remote_qpn = ntohl(smsg->dqpn);
+	wr.wr.ud.remote_qkey = DAT_UD_QKEY;
+
+	ret = ibv_post_send(tp->qp, &wr, &bad_wr);
+bail:
+	dapl_os_unlock(&tp->slock);	
+	return ret;
+}
+
+/* ACTIVE/PASSIVE: CM objects */
+dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep)
+{
+	dp_ib_cm_handle_t cm;
+
+	/* Allocate CM, init lock, and initialize */
+	if ((cm = dapl_os_alloc(sizeof(*cm))) == NULL)
+		return NULL;
+
+	(void)dapl_os_memzero(cm, sizeof(*cm));
+	if (dapl_os_lock_init(&cm->lock))
+		goto bail;
+
+	cm->msg.ver = htons(DCM_VER);
+	
+	/* ACTIVE: init source address QP info from local EP */
+	if (ep) {
+		DAPL_HCA *hca = ep->header.owner_ia->hca_ptr;
+
+		cm->msg.sport = htons(ucm_get_port(&hca->ib_trans, 0));
+		if (!cm->msg.sport) 
+			goto bail;
+
+		/* IB info in network order */
+		cm->ep = ep;
+		cm->hca = hca;
+		cm->msg.sqpn = htonl(hca->ib_trans.qp->qp_num); /* ucm */
+		cm->msg.saddr.ib.qpn = htonl(ep->qp_handle->qp_num); /* ep */
+		cm->msg.saddr.ib.qp_type = ep->qp_handle->qp_type;
+		cm->msg.saddr.ib.port_num = hca->port_num;
+                cm->msg.saddr.ib.lid = hca->ib_trans.addr.ib.lid; 
+		cm->msg.saddr.ib.gid = hca->ib_trans.addr.ib.gid; 
+        }
+	return cm;
+bail:
+	dapl_os_free(cm, sizeof(*cm));
+	return NULL;
+}
+
+/* 
+ * UD CR objects are kept active because of direct private data references
+ * from CONN events. The cr->socket is closed and marked inactive but the 
+ * object remains allocated and queued on the CR resource list. There can
+ * be multiple CR's associated with a given EP. There is no way to determine 
+ * when consumer is finished with event until the dat_ep_free.
+ *
+ * Schedule destruction for all CR's associated with this EP, cr_thread will
+ * complete the cleanup with state == DCM_DESTROY. 
+ */ 
+static void ucm_ud_free(DAPL_EP *ep)
+{
+	DAPL_IA *ia = ep->header.owner_ia;
+	DAPL_HCA *hca = NULL;
+	ib_hca_transport_t *tp = &ia->hca_ptr->ib_trans;
+	dp_ib_cm_handle_t cr, next;
+
+	dapl_os_lock(&tp->lock);
+	if (!dapl_llist_is_empty((DAPL_LLIST_HEAD*)&tp->list))
+            next = dapl_llist_peek_head((DAPL_LLIST_HEAD*)&tp->list);
+	else
+	    next = NULL;
+
+	while (next) {
+		cr = next;
+		next = dapl_llist_next_entry((DAPL_LLIST_HEAD*)&tp->list,
+					     (DAPL_LLIST_ENTRY*)&cr->entry);
+		if (cr->ep == ep)  {
+			dapl_dbg_log(DAPL_DBG_TYPE_EP,
+				     " qp_free CR: ep %p cr %p\n", ep, cr);
+			dapl_os_lock(&cr->lock);
+			hca = cr->hca;
+			cr->ep = NULL;
+			cr->state = DCM_DESTROY;
+			dapl_os_unlock(&cr->lock);
+		}
+	}
+	dapl_os_unlock(&tp->lock);
+
+	/* wakeup work thread if necessary */
+	if (hca)
+		send(tp->scm[1], "w", sizeof "w", 0);
+}
+
+/* mark for destroy, remove all references, schedule cleanup */
+/* cm_ptr == NULL (UD), then multi CR's, kill all associated with EP */
+void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep)
+{
+	dapl_dbg_log(DAPL_DBG_TYPE_CM,
+		     " cm_destroy: cm %p ep %p\n", cm, ep);
+
+	if (!cm && ep)
+		return (ucm_ud_free(ep));
+
+	dapl_os_lock(&cm->lock);
+
+	/* client, release local conn id port */
+	if (!cm->sp && cm->msg.sport)
+		ucm_free_port(&cm->hca->ib_trans, cm->msg.sport);
+
+	/* cleanup, never made it to work queue */
+	if (cm->state == DCM_INIT) {
+		dapl_os_unlock(&cm->lock);
+		dapl_os_free(cm, sizeof(*cm));
+		return;
+	}
+
+	/* free could be called before disconnect, disc_clean will destroy */
+	if (cm->state == DCM_CONNECTED) {
+		dapl_os_unlock(&cm->lock);
+		dapli_cm_disconnect(cm);
+		return;
+	}
+
+	cm->state = DCM_DESTROY;
+	if ((cm->ep) && (cm->ep->cm_handle == cm)) {
+		cm->ep->cm_handle = IB_INVALID_HANDLE;
+		cm->ep = NULL;
+	}
+
+	dapl_os_unlock(&cm->lock);
+
+	/* wakeup work thread */
+	send(cm->hca->ib_trans.scm[1], "w", sizeof "w", 0);
+}
+
+/* ACTIVE/PASSIVE: queue up connection object on CM list */
+static void ucm_queue_conn(dp_ib_cm_handle_t cm)
+{
+	/* add to work queue, list, for cm thread processing */
+	dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&cm->entry);
+	dapl_os_lock(&cm->hca->ib_trans.lock);
+	dapl_llist_add_tail(&cm->hca->ib_trans.list,
+			    (DAPL_LLIST_ENTRY *)&cm->entry, cm);
+	dapl_os_unlock(&cm->hca->ib_trans.lock);
+}
+
+/* PASSIVE: queue up listen object on listen list */
+static void ucm_queue_listen(dp_ib_cm_handle_t cm)
+{
+	/* add to work queue, llist, for cm thread processing */
+	dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&cm->entry);
+	dapl_os_lock(&cm->hca->ib_trans.llock);
+	dapl_llist_add_tail(&cm->hca->ib_trans.llist,
+			    (DAPL_LLIST_ENTRY *)&cm->entry, cm);
+	dapl_os_unlock(&cm->hca->ib_trans.llock);
+}
+
+static void ucm_dequeue_listen(dp_ib_cm_handle_t cm) {
+	dapl_os_lock(&cm->hca->ib_trans.llock);
+	dapl_llist_remove_entry(&cm->hca->ib_trans.llist, 
+				(DAPL_LLIST_ENTRY *)&cm->entry);
+	dapl_os_unlock(&cm->hca->ib_trans.llock);
+}
+
+/*
+ * ACTIVE/PASSIVE: called from CR thread or consumer via ep_disconnect
+ *                 or from ep_free
+ */
+DAT_RETURN dapli_cm_disconnect(dp_ib_cm_handle_t cm)
+{
+	DAPL_EP *ep = cm->ep;
+
+	if (ep == NULL)
+		return DAT_SUCCESS;
+
+	dapl_os_lock(&cm->lock);
+	if ((cm->state == DCM_INIT) ||
+	    (cm->state == DCM_DISC_PENDING) ||
+	    (cm->state == DCM_DISCONNECTED) ||
+	    (cm->state == DCM_DESTROY)) {
+		dapl_os_unlock(&cm->lock);
+		return DAT_SUCCESS;
+	} else {
+		/* send disc, schedule destroy */
+		cm->msg.op = htons(DCM_DREQ);
+		if (ucm_send(&cm->hca->ib_trans, &cm->msg)) {
+			dapl_log(DAPL_DBG_TYPE_WARN, 
+				 " disc_req: ERR-> %s lid %d qpn %d"
+				 " r_psp %d \n", strerror(errno), 
+				 htons(cm->msg.saddr.ib.lid), 
+				 htonl(cm->msg.saddr.ib.qpn), 
+				 htons(cm->msg.sport));
+		}
+		cm->state = DCM_DISC_PENDING;
+	}
+	dapl_os_unlock(&cm->lock);
+
+	/* disconnect events for RC's only */
+	if (ep->param.ep_attr.service_type == DAT_SERVICE_TYPE_RC) {
+		if (ep->cr_ptr) {
+			dapls_cr_callback(cm,
+					  IB_CME_DISCONNECTED,
+					  NULL,
+					  ((DAPL_CR *)ep->cr_ptr)->sp_ptr);
+		} else {
+			dapl_evd_connection_callback(ep->cm_handle,
+						     IB_CME_DISCONNECTED,
+						     NULL, ep);
+		}
+	}
+
+	/* scheduled destroy via disconnect clean in callback */
+	return DAT_SUCCESS;
+}
+
+/*
+ * ACTIVE: get remote CM SID server info from r_addr. 
+ *         send, or resend CM msg via UD CM QP 
+ */
+DAT_RETURN
+dapli_cm_connect(DAPL_EP *ep, dp_ib_cm_handle_t cm)
+{
+	dapl_log(DAPL_DBG_TYPE_EP, 
+		 " connect: lid %x qpn %x lport %d p_sz=%d -> "
+		 " lid %x c_qpn %x rport %d\n",
+		 htons(cm->msg.saddr.ib.lid), htonl(cm->msg.saddr.ib.qpn),
+		 htons(cm->msg.sport), htons(cm->msg.p_size),
+		 htons(cm->msg.daddr.ib.lid), htonl(cm->msg.dqpn),
+		 htons(cm->msg.dport));
+
+	dapl_os_lock(&cm->lock);
+	if (cm->state == DCM_INIT) 
+		cm->state = DCM_CONN_PENDING;
+	else if (++cm->retries == DCM_RETRY_CNT) {
+		dapl_log(DAPL_DBG_TYPE_WARN, 
+			 " connect: RETRIES EXHAUSTED -> lid %d qpn %d r_psp"
+			 " %d p_sz=%d\n",
+			 strerror(errno), htons(cm->msg.daddr.ib.lid), 
+			 htonl(cm->msg.dqpn), htons(cm->msg.dport), 
+			 htons(cm->msg.p_size));
+
+		/* update ep->cm reference so we get cleaned up on callback */
+		if (cm->msg.saddr.ib.qp_type == IBV_QPT_RC);
+			ep->cm_handle = cm;
+
+		dapl_os_unlock(&cm->lock);
+		dapl_evd_connection_callback(cm, 
+					     IB_CME_DESTINATION_UNREACHABLE,
+					     NULL, ep);
+
+		return DAT_ERROR(DAT_INVALID_ADDRESS, 
+				 DAT_INVALID_ADDRESS_UNREACHABLE);
+	}
+	dapl_os_unlock(&cm->lock);
+
+	cm->msg.op = htons(DCM_REQ);
+	if (ucm_send(&cm->hca->ib_trans, &cm->msg)) 		
+		goto bail;
+
+	/* first time through, put on work queue */
+	if (!cm->retries)
+		ucm_queue_conn(cm);
+
+	return DAT_SUCCESS;
+
+bail:
+	dapl_log(DAPL_DBG_TYPE_ERR, 
+		 " connect: ERR %s -> cm_lid %d cm_qpn %d r_psp %d p_sz=%d\n",
+		 strerror(errno), htons(cm->msg.daddr.ib.lid), 
+		 htonl(cm->msg.dqpn), htons(cm->msg.dport), 
+		 htonl(cm->msg.p_size));
+
+	/* close socket, free cm structure */
+	dapls_ib_cm_free(cm, cm->ep);
+	return DAT_INSUFFICIENT_RESOURCES;
+}
+
+/*
+ * ACTIVE: exchange QP information, called from CR thread
+ */
+static void ucm_connect_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg)
+{
+	DAPL_EP *ep = cm->ep;
+	ib_cm_events_t event = IB_CME_CONNECTED;
+
+	dapl_os_lock(&cm->lock);
+	if (cm->state != DCM_CONN_PENDING) {
+		dapl_log(DAPL_DBG_TYPE_WARN, 
+			 " CONN_RTU: UNEXPECTED state:"
+			 " op %d, st %s <- lid %d sqpn %d sport %d\n", 
+			 ntohs(msg->op), dapl_cm_state_str(cm->state), 
+			 ntohs(msg->saddr.ib.lid), ntohl(msg->saddr.ib.qpn), 
+			 ntohs(msg->sport));
+		dapl_os_unlock(&cm->lock);
+		return;
+	}
+	dapl_os_unlock(&cm->lock);
+
+	/* save remote address information to EP and CM */
+	dapl_os_memcpy(&ep->remote_ia_address,
+		       &msg->saddr, sizeof(union dcm_addr));
+	dapl_os_memcpy(&cm->msg.daddr, 
+		       &msg->saddr, sizeof(union dcm_addr));
+
+	/* validate private data size, and copy if necessary */
+	if (msg->p_size) {
+		if (ntohs(msg->p_size) > DCM_MAX_PDATA_SIZE) {
+			dapl_log(DAPL_DBG_TYPE_WARN, 
+				 " CONN_RTU: invalid p_size %d:"
+				 " st %s <- lid %d sqpn %d spsp %d\n", 
+				 ntohs(msg->p_size), 
+				 dapl_cm_state_str(cm->state), 
+				 ntohs(msg->saddr.ib.lid), 
+				 ntohl(msg->saddr.ib.qpn), 
+				 ntohs(msg->sport));
+			goto bail;
+		}
+		dapl_os_memcpy(cm->msg.p_data, 
+			       msg->p_data, ntohs(msg->p_size));
+	}
+		
+	dapl_dbg_log(DAPL_DBG_TYPE_CM,
+		     " CONN_RTU: DST port=%d lid=%x,"
+		     " iqp=%x, qp_type=%d, port=%d psize=%d\n",
+		     cm->msg.daddr.ib.port_num, ntohs(cm->msg.daddr.ib.lid),
+		     ntohl(cm->msg.daddr.ib.qpn), cm->msg.daddr.ib.qp_type,
+		     ntohs(msg->sport), ntohs(msg->p_size));
+
+	if (ntohs(msg->op) == DCM_REP)
+		event = IB_CME_CONNECTED;
+	else if (ntohs(msg->op) == DCM_REJ_USER) 
+		event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA;
+	else  
+		event = IB_CME_DESTINATION_REJECT;
+	
+	if (event != IB_CME_CONNECTED) {
+		dapl_log(DAPL_DBG_TYPE_CM,
+			 " CONN_RTU: REJ op=%d <- lid %x, iqp %x, psp %d\n",
+			 ntohs(msg->op), ntohs(msg->saddr.ib.lid), 
+			 ntohl(msg->saddr.ib.qpn), ntohs(msg->sport));
+#ifdef DAT_EXTENSIONS
+		if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD) 
+			goto ud_bail;
+		else
+#endif
+		goto bail;
+	}
+
+	/* modify QP to RTR and then to RTS with remote info */
+	dapl_os_lock(&cm->ep->header.lock);
+	if (dapls_modify_qp_state(cm->ep->qp_handle,
+				  IBV_QPS_RTR, 
+				  cm->msg.daddr.ib.qpn,
+				  cm->msg.daddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " CONN_RTU: QPS_RTR ERR %s <- lid %x iqp %x\n",
+			 strerror(errno), ntohs(cm->msg.daddr.ib.lid),
+			 ntohl(cm->msg.daddr.ib.qpn));
+		dapl_os_unlock(&cm->ep->header.lock);
+		event = IB_CME_LOCAL_FAILURE;
+		goto bail;
+	}
+	if (dapls_modify_qp_state(cm->ep->qp_handle,
+				  IBV_QPS_RTS, 
+				  cm->msg.daddr.ib.qpn,
+				  cm->msg.daddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " CONN_RTU: QPS_RTS ERR %s <- lid %x iqp %x\n",
+			 strerror(errno), ntohs(cm->msg.daddr.ib.lid),
+			 ntohl(cm->msg.daddr.ib.qpn));
+		dapl_os_unlock(&cm->ep->header.lock);
+		event = IB_CME_LOCAL_FAILURE;
+		goto bail;
+	}
+	dapl_os_unlock(&cm->ep->header.lock);
+	
+	/* Send RTU */
+	cm->msg.op = htons(DCM_RTU);
+	
+	if (ucm_send(&cm->hca->ib_trans, &cm->msg)) 		
+		goto bail;
+
+	/* init cm_handle and post the event with private data */
+	cm->state = DCM_CONNECTED;
+	dapl_dbg_log(DAPL_DBG_TYPE_EP, " ACTIVE: connected!\n");
+
+#ifdef DAT_EXTENSIONS
+ud_bail:
+	if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD) {
+		DAT_IB_EXTENSION_EVENT_DATA xevent;
+		uint16_t lid = ntohs(cm->msg.daddr.ib.lid);
+		
+		/* post EVENT, modify_qp, AH already created, ucm msg */
+		xevent.status = 0;
+		xevent.type = DAT_IB_UD_REMOTE_AH;
+		xevent.remote_ah.ah = cm->hca->ib_trans.ah[lid];
+		xevent.remote_ah.qpn = cm->msg.daddr.ib.qpn;
+		dapl_os_memcpy(&xevent.remote_ah.ia_addr,
+			       &cm->msg.daddr,
+			       sizeof(union dcm_addr));
+
+		if (event == IB_CME_CONNECTED)
+			event = DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED;
+		else
+			event = DAT_IB_UD_CONNECTION_REJECT_EVENT;
+
+		dapls_evd_post_connection_event_ext(
+				(DAPL_EVD *)cm->ep->param.connect_evd_handle,
+				event,
+				(DAT_EP_HANDLE)ep,
+				(DAT_COUNT)cm->msg.p_size,
+				(DAT_PVOID *)cm->msg.p_data,
+				(DAT_PVOID *)&xevent);
+
+		/* we are done, don't destroy cm_ptr, need pdata */
+		cm->state = DCM_RELEASED;
+	} else
+#endif
+	{
+		cm->ep->cm_handle = cm; /* only RC, multi CR's on UD */
+		dapl_evd_connection_callback(cm,
+					     IB_CME_CONNECTED,
+					     cm->msg.p_data, cm->ep);
+	}
+	return;
+
+bail:
+	if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD) 
+		dapls_ib_reinit_ep(cm->ep); /* reset QP state */
+	dapl_evd_connection_callback(NULL, event, cm->msg.p_data, cm->ep);
+}
+
+/*
+ * PASSIVE: Accept on listen CM PSP.
+ *          create new CM object for this CR, 
+ *	    receive peer QP information, private data, 
+ *	    and post cr_event 
+ */
+static void ucm_accept(ib_cm_srvc_handle_t cm, ib_cm_msg_t *msg)
+{
+	dp_ib_cm_handle_t acm;
+
+	/* Allocate accept CM and setup passive references */
+	if ((acm = dapls_ib_cm_create(NULL)) == NULL) {
+		dapl_log(DAPL_DBG_TYPE_WARN, " accept: ERR cm_create\n");
+		return;
+	}
+
+	/* dest CM info from CR msg, source CM info from listen */
+	acm->sp = cm->sp;
+	acm->hca = cm->hca;
+	acm->state = DCM_ACCEPTING;
+	acm->msg.dport = msg->sport;
+	acm->msg.dqpn = msg->sqpn;
+	acm->msg.sport = cm->msg.sport; 
+	acm->msg.sqpn = cm->msg.sqpn;
+	acm->msg.p_size = msg->p_size;
+
+	/* CR saddr is CM daddr info, need EP for local saddr */
+	dapl_os_memcpy(&acm->msg.daddr, &msg->saddr, sizeof(union dcm_addr));
+	
+	dapl_log(DAPL_DBG_TYPE_CM,
+		 " accept: DST port=%d lid=%x, iqp=%x, psize=%d\n",
+		 ntohs(acm->msg.dport), ntohs(acm->msg.daddr.ib.lid), 
+		 htonl(acm->msg.daddr.ib.qpn), htons(acm->msg.p_size));
+
+	/* validate private data size before reading */
+	if (ntohs(msg->p_size) > DCM_MAX_PDATA_SIZE) {
+		dapl_log(DAPL_DBG_TYPE_WARN, " accept: psize (%d) wrong\n",
+			 ntohs(msg->p_size));
+		goto bail;
+	}
+
+	/* read private data into cm_handle if any present */
+	if (msg->p_size) 
+		dapl_os_memcpy(acm->msg.p_data, 
+			       msg->p_data, ntohs(msg->p_size));
+		
+	acm->state = DCM_ACCEPTING_DATA;
+	ucm_queue_conn(acm);
+
+#ifdef DAT_EXTENSIONS
+	if (acm->msg.daddr.ib.qp_type == IBV_QPT_UD) {
+		DAT_IB_EXTENSION_EVENT_DATA xevent;
+
+		/* post EVENT, modify_qp created ah */
+		xevent.status = 0;
+		xevent.type = DAT_IB_UD_CONNECT_REQUEST;
+
+		dapls_evd_post_cr_event_ext(acm->sp,
+					    DAT_IB_UD_CONNECTION_REQUEST_EVENT,
+					    acm,
+					    (DAT_COUNT)acm->msg.p_size,
+					    (DAT_PVOID *)acm->msg.p_data,
+					    (DAT_PVOID *)&xevent);
+	} else
+#endif
+		/* trigger CR event and return SUCCESS */
+		dapls_cr_callback(acm,
+				  IB_CME_CONNECTION_REQUEST_PENDING,
+				  acm->msg.p_data, acm->sp);
+	return;
+
+bail:
+	/* free cm object */
+	dapls_ib_cm_free(acm, NULL);
+	return;
+}
+
+/*
+ * PASSIVE: read RTU from active peer, post CONN event
+ */
+static void ucm_accept_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg)
+{
+	dapl_os_lock(&cm->lock);
+	if ((ntohs(msg->op) != DCM_RTU) || (cm->state != DCM_ACCEPTED)) {
+		dapl_log(DAPL_DBG_TYPE_WARN, 
+			 " accept_rtu: UNEXPECTED op, state:"
+			 " op %d, st %s <- lid %x iqp %x sport %d\n", 
+			 ntohs(msg->op), dapl_cm_state_str(cm->state), 
+			 ntohs(msg->saddr.ib.lid), ntohl(msg->saddr.ib.qpn), 
+			 ntohs(msg->sport));
+		dapl_os_unlock(&cm->lock);
+		goto bail;
+	}
+	cm->state = DCM_CONNECTED;
+	dapl_os_unlock(&cm->lock);
+	
+	if (msg->p_size) 
+		dapl_os_memcpy(cm->msg.p_data, 
+			       msg->p_data, ntohs(msg->p_size));
+
+	/* final data exchange if remote QP state is good to go */
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " PASSIVE: connected!\n");
+
+#ifdef DAT_EXTENSIONS
+	if (cm->msg.saddr.ib.qp_type == IBV_QPT_UD) {
+		DAT_IB_EXTENSION_EVENT_DATA xevent;
+		uint16_t lid = ntohs(cm->msg.daddr.ib.lid);
+		
+		/* post EVENT, modify_qp, AH already created, ucm msg */
+		xevent.status = 0;
+		xevent.type = DAT_IB_UD_PASSIVE_REMOTE_AH;
+		xevent.remote_ah.ah = cm->hca->ib_trans.ah[lid];
+		xevent.remote_ah.qpn = cm->msg.daddr.ib.qpn;
+		dapl_os_memcpy(&xevent.remote_ah.ia_addr,
+			       &cm->msg.daddr,
+			       sizeof(cm->msg.daddr));
+
+		dapls_evd_post_connection_event_ext(
+				(DAPL_EVD *)cm->ep->param.connect_evd_handle,
+				DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED,
+				(DAT_EP_HANDLE)cm->ep,
+				(DAT_COUNT)cm->msg.p_size,
+				(DAT_PVOID *)cm->msg.p_data,
+				(DAT_PVOID *)&xevent);
+
+                /* done with CM object, don't destroy cm, need pdata */
+               	cm->state = DCM_RELEASED;
+	} else {
+#endif
+		cm->ep->cm_handle = cm; /* only RC, multi CR's on UD */
+		dapls_cr_callback(cm, IB_CME_CONNECTED, NULL, cm->sp);
+	}
+	return;
+bail:
+	if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD) 
+		dapls_ib_reinit_ep(cm->ep);	/* reset QP state */
+	dapls_ib_cm_free(cm, cm->ep);
+	dapls_cr_callback(cm, IB_CME_LOCAL_FAILURE, NULL, cm->sp);
+}
+
+/*
+ * PASSIVE: consumer accept, send local QP information, private data, 
+ * queue on work thread to receive RTU information to avoid blocking
+ * user thread. 
+ */
+DAT_RETURN
+dapli_accept_usr(DAPL_EP *ep, DAPL_CR *cr, DAT_COUNT p_size, DAT_PVOID p_data)
+{
+	DAPL_IA *ia = ep->header.owner_ia;
+	dp_ib_cm_handle_t cm = cr->ib_cm_handle;
+
+	if (p_size > DCM_MAX_PDATA_SIZE)
+		return DAT_LENGTH_ERROR;
+
+	dapl_os_lock(&cm->lock);
+	if (cm->state != DCM_ACCEPTING_DATA) {
+		dapl_os_unlock(&cm->lock);
+		return DAT_INVALID_STATE;
+	}
+	dapl_os_unlock(&cm->lock);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM,
+		     " ACCEPT_USR: remote port_num=%d lid=%x"
+		     " iqp=%x qp_type %d, psize=%d\n",
+		     cm->msg.daddr.ib.port_num, cm->msg.daddr.ib.lid,
+		     cm->msg.daddr.ib.qpn, cm->msg.daddr.ib.qp_type, 
+		     cm->msg.p_size);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM,
+		     " ACCEPT_USR: remote GID subnet %016llx id %016llx\n",
+		     (unsigned long long)
+		     htonll(cm->msg.daddr.ib.gid.global.subnet_prefix),
+		     (unsigned long long)
+		     htonll(cm->msg.daddr.ib.gid.global.interface_id));
+
+#ifdef DAT_EXTENSIONS
+	if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD &&
+	    ep->qp_handle->qp_type != IBV_QPT_UD) {
+		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
+			     " ACCEPT_USR: ERR remote QP is UD,"
+			     ", but local QP is not\n");
+		return (DAT_INVALID_HANDLE | DAT_INVALID_HANDLE_EP);
+	}
+#endif
+
+	/* modify QP to RTR and then to RTS with remote info already read */
+	dapl_os_lock(&ep->header.lock);
+	if (dapls_modify_qp_state(ep->qp_handle,
+				  IBV_QPS_RTR, 
+				  cm->msg.daddr.ib.qpn,
+				  cm->msg.daddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " ACCEPT_USR: QPS_RTR ERR %s -> lid %x qpn %x\n",
+			 strerror(errno), ntohs(cm->msg.daddr.ib.lid),
+			 ntohl(cm->msg.daddr.ib.qpn));
+		dapl_os_unlock(&ep->header.lock);
+		goto bail;
+	}
+	if (dapls_modify_qp_state(ep->qp_handle,
+				  IBV_QPS_RTS, 
+				  cm->msg.daddr.ib.qpn,
+				  cm->msg.daddr.ib.lid,
+				  NULL) != DAT_SUCCESS) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " ACCEPT_USR: QPS_RTS ERR %s -> lid %x qpn %x\n",
+			 strerror(errno), ntohs(cm->msg.daddr.ib.lid),
+			 ntohl(cm->msg.daddr.ib.qpn));
+		dapl_os_unlock(&ep->header.lock);
+		goto bail;
+	}
+	dapl_os_unlock(&ep->header.lock);
+
+	/* save remote address information */
+	dapl_os_memcpy(&ep->remote_ia_address,
+		       &cm->msg.saddr, sizeof(union dcm_addr));
+
+	/* setup local QP info and type from EP, copy pdata, for reply */
+	cm->msg.op = htons(DCM_REP);
+	cm->msg.saddr.ib.qpn = htonl(ep->qp_handle->qp_num);
+	cm->msg.saddr.ib.qp_type = htons(ep->qp_handle->qp_type);
+	cm->msg.saddr.ib.port_num = cm->hca->port_num;
+	cm->msg.saddr.ib.lid = cm->hca->ib_trans.addr.ib.lid; 
+	cm->msg.saddr.ib.gid = cm->hca->ib_trans.addr.ib.gid; 
+	dapl_os_memcpy(&cm->msg.p_data, p_data, p_size);
+		
+	if (ucm_send(&cm->hca->ib_trans, &cm->msg)) 		
+		goto bail;
+
+	/* save state and setup valid reference to EP, HCA */
+	dapl_os_lock(&cm->lock);
+	cm->ep = ep;
+	cm->hca = ia->hca_ptr;
+	cm->state = DCM_ACCEPTED;
+	dapl_os_unlock(&cm->lock);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " PASSIVE: accepted!\n");
+	return DAT_SUCCESS;
+
+bail:
+	if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD)
+		dapls_ib_reinit_ep(ep);
+	dapls_ib_cm_free(cm, ep);
+	return DAT_INTERNAL_ERROR;
+}
+
+
+/*
+ * dapls_ib_connect
+ *
+ * Initiate a connection with the passive listener on another node
+ *
+ * Input:
+ *	ep_handle,
+ *	remote_ia_address,
+ *	remote_conn_qual,
+ *	prd_size		size of private data and structure
+ *	prd_prt			pointer to private data structure
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INSUFFICIENT_RESOURCES
+ *	DAT_INVALID_PARAMETER
+ *
+ */
+DAT_RETURN
+dapls_ib_connect(IN DAT_EP_HANDLE ep_handle,
+		 IN DAT_IA_ADDRESS_PTR r_addr,
+		 IN DAT_CONN_QUAL r_psp,
+		 IN DAT_COUNT p_size, IN void *p_data)
+{
+	DAPL_EP *ep = (DAPL_EP *)ep_handle;
+	dp_ib_cm_handle_t cm;
+	
+	/* create CM object, initialize SRC info from EP */
+	cm = dapls_ib_cm_create(ep);
+	if (cm == NULL)
+		return DAT_INSUFFICIENT_RESOURCES;
+
+	/* remote hca and port: lid, gid, port_num, network order */
+	dapl_os_memcpy(&cm->msg.daddr, r_addr, sizeof(union dcm_addr));
+
+	/* remote uCM information, comes from consumer provider r_addr */
+	cm->msg.dport = htons((uint16_t)r_psp);
+	cm->msg.dqpn = cm->msg.daddr.ib.qpn;
+	
+	if (p_size) {
+		cm->msg.p_size = htons(p_size);
+		dapl_os_memcpy(&cm->msg.p_data, p_data, p_size);
+	}
+
+	/* build connect request, send to remote CM based on r_addr info */
+	return(dapli_cm_connect(ep, cm));
+}
+
+/*
+ * dapls_ib_disconnect
+ *
+ * Disconnect an EP
+ *
+ * Input:
+ *	ep_handle,
+ *	disconnect_flags
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ */
+DAT_RETURN
+dapls_ib_disconnect(IN DAPL_EP *ep, IN DAT_CLOSE_FLAGS close_flags)
+{
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     "dapls_ib_disconnect(ep_handle %p ....)\n", ep);
+
+	/* reinit to modify QP state, if not UD */
+	if (ep->qp_handle->qp_type != IBV_QPT_UD)
+		dapls_ib_reinit_ep(ep);
+
+	if (ep->cm_handle == NULL ||
+	    ep->param.ep_state == DAT_EP_STATE_DISCONNECTED)
+		return DAT_SUCCESS;
+	else
+		return (dapli_cm_disconnect(ep->cm_handle));
+}
+
+/*
+ * dapls_ib_disconnect_clean
+ *
+ * Clean up outstanding connection data. This routine is invoked
+ * after the final disconnect callback has occurred. Only on the
+ * ACTIVE side of a connection. It is also called if dat_ep_connect
+ * times out using the consumer supplied timeout value.
+ *
+ * Input:
+ *	ep_ptr		DAPL_EP
+ *	active		Indicates active side of connection
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	void
+ *
+ */
+void
+dapls_ib_disconnect_clean(IN DAPL_EP *ep,
+			  IN DAT_BOOLEAN active,
+			  IN const ib_cm_events_t ib_cm_event)
+{
+	/* NOTE: SCM will only initialize cm_handle with RC type
+	 * 
+	 * For UD there can many in-flight CR's so you 
+	 * cannot cleanup timed out CR's with EP reference 
+	 * alone since they share the same EP. The common
+	 * code that handles connection timeout logic needs 
+	 * updated for UD support.
+	 */
+	if (ep->cm_handle)
+		dapls_ib_cm_free(ep->cm_handle, ep);
+
+	return;
+}
+
+/*
+ * dapl_ib_setup_conn_listener
+ *
+ * Have the CM set up a connection listener.
+ *
+ * Input:
+ *	ibm_hca_handle		HCA handle
+ *	qp_handle			QP handle
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INSUFFICIENT_RESOURCES
+ *	DAT_INTERNAL_ERROR
+ *	DAT_CONN_QUAL_UNAVAILBLE
+ *	DAT_CONN_QUAL_IN_USE
+ *
+ */
+DAT_RETURN
+dapls_ib_setup_conn_listener(IN DAPL_IA *ia, 
+			     IN DAT_UINT64 sid, 
+			     IN DAPL_SP *sp)
+{
+	ib_cm_srvc_handle_t cm = NULL;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " listen(ia %p ServiceID %d sp %p)\n",
+		     ia, sid, sp);
+
+	/* reserve local port, then allocate CM object */
+	if (!ucm_get_port(&ia->hca_ptr->ib_trans, (uint16_t)sid)) {
+		dapl_dbg_log(DAPL_DBG_TYPE_CM,
+			     " listen: ERROR %s on conn_qual 0x%x\n",
+			     strerror(errno), sid);
+		return DAT_CONN_QUAL_IN_USE;
+	}
+
+	/* cm_create will setup saddr for listen server */
+	if ((cm = dapls_ib_cm_create(NULL)) == NULL)
+		return DAT_INSUFFICIENT_RESOURCES;
+
+	/* LISTEN: init DST address and QP info to local CM server info */
+	cm->sp = sp;
+	cm->hca = ia->hca_ptr;
+	cm->msg.sport = htons((uint16_t)sid);
+	cm->msg.sqpn = htonl(ia->hca_ptr->ib_trans.qp->qp_num);
+	cm->msg.saddr.ib.qp_type = IBV_QPT_UD;
+	cm->msg.saddr.ib.port_num = ia->hca_ptr->port_num;
+        cm->msg.saddr.ib.lid = ia->hca_ptr->ib_trans.addr.ib.lid; 
+	cm->msg.saddr.ib.gid = ia->hca_ptr->ib_trans.addr.ib.gid; 
+	
+	/* save cm_handle reference in service point */
+	sp->cm_srvc_handle = cm;
+
+	/* queue up listen socket to process inbound CR's */
+	cm->state = DCM_LISTEN;
+	ucm_queue_listen(cm);
+
+	return DAT_SUCCESS;
+}
+
+
+/*
+ * dapl_ib_remove_conn_listener
+ *
+ * Have the CM remove a connection listener.
+ *
+ * Input:
+ *	ia_handle		IA handle
+ *	ServiceID		IB Channel Service ID
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INVALID_STATE
+ *
+ */
+DAT_RETURN
+dapls_ib_remove_conn_listener(IN DAPL_IA *ia, IN DAPL_SP *sp)
+{
+	ib_cm_srvc_handle_t cm = sp->cm_srvc_handle;
+	ib_hca_transport_t *tp = &ia->hca_ptr->ib_trans;
+
+	/* free cm_srvc_handle and port, and mark CM for cleanup */
+	if (cm) {
+		dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " remove_listener(ia %p sp %p cm %p psp=%d)\n",
+		     ia, sp, cm, ntohs(cm->msg.dport));
+
+		sp->cm_srvc_handle = NULL;
+		dapl_os_lock(&cm->lock);
+		ucm_free_port(tp, ntohs(cm->msg.dport));
+		cm->msg.dport = 0;
+		cm->state = DCM_DESTROY;
+		dapl_os_unlock(&cm->lock);
+		ucm_dequeue_listen(cm);
+		dapl_os_free(cm, sizeof(*cm));
+	}
+	return DAT_SUCCESS;
+}
+
+/*
+ * dapls_ib_accept_connection
+ *
+ * Perform necessary steps to accept a connection
+ *
+ * Input:
+ *	cr_handle
+ *	ep_handle
+ *	private_data_size
+ *	private_data
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INSUFFICIENT_RESOURCES
+ *	DAT_INTERNAL_ERROR
+ *
+ */
+DAT_RETURN
+dapls_ib_accept_connection(IN DAT_CR_HANDLE cr_handle,
+			   IN DAT_EP_HANDLE ep_handle,
+			   IN DAT_COUNT p_size, 
+			   IN const DAT_PVOID p_data)
+{
+	DAPL_CR *cr = (DAPL_CR *)cr_handle;
+	DAPL_EP *ep = (DAPL_EP *)ep_handle;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " accept_connection(cr %p ep %p prd %p,%d)\n",
+		     cr, ep, p_data, p_size);
+
+	/* allocate and attach a QP if necessary */
+	if (ep->qp_state == DAPL_QP_STATE_UNATTACHED) {
+		DAT_RETURN status;
+		status = dapls_ib_qp_alloc(ep->header.owner_ia,
+					   ep, ep);
+		if (status != DAT_SUCCESS)
+			return status;
+	}
+	return (dapli_accept_usr(ep, cr, p_size, p_data));
+}
+
+/*
+ * dapls_ib_reject_connection
+ *
+ * Reject a connection
+ *
+ * Input:
+ *	cr_handle
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INTERNAL_ERROR
+ *
+ */
+DAT_RETURN
+dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm,
+			   IN int reason,
+			   IN DAT_COUNT psize, IN const DAT_PVOID pdata)
+{
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " reject(cm %p reason %x, pdata %p, psize %d)\n",
+		     cm, reason, pdata, psize);
+
+        if (psize > DCM_MAX_PDATA_SIZE)
+                return DAT_LENGTH_ERROR;
+
+	cm->msg.op = htons(DCM_REJ_USER);
+	if (psize)
+		dapl_os_memcpy(&cm->msg.p_data, pdata, psize);
+		
+	if (ucm_send(&cm->hca->ib_trans, &cm->msg)) {
+		dapl_log(DAPL_DBG_TYPE_WARN,
+			 " cm_reject: ERR: %s\n", strerror(errno));
+		return DAT_INTERNAL_ERROR;
+	}
+		
+	/* cr_thread will destroy CR */
+	cm->state = DCM_REJECTING;
+	send(cm->hca->ib_trans.scm[1], "w", sizeof "w", 0);
+	return DAT_SUCCESS;
+}
+
+/*
+ * dapls_ib_cm_remote_addr
+ *
+ * Obtain the remote IP address given a connection
+ *
+ * Input:
+ *	cr_handle
+ *
+ * Output:
+ *	remote_ia_address: where to place the remote address
+ *
+ * Returns:
+ * 	DAT_SUCCESS
+ *	DAT_INVALID_HANDLE
+ *
+ */
+DAT_RETURN
+dapls_ib_cm_remote_addr(IN DAT_HANDLE dat_handle,
+			OUT DAT_SOCK_ADDR6 * remote_ia_address)
+{
+	DAPL_HEADER *header;
+	dp_ib_cm_handle_t ib_cm_handle;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     "dapls_ib_cm_remote_addr(dat_handle %p, ....)\n",
+		     dat_handle);
+
+	header = (DAPL_HEADER *) dat_handle;
+
+	if (header->magic == DAPL_MAGIC_EP)
+		ib_cm_handle = ((DAPL_EP *) dat_handle)->cm_handle;
+	else if (header->magic == DAPL_MAGIC_CR)
+		ib_cm_handle = ((DAPL_CR *) dat_handle)->ib_cm_handle;
+	else
+		return DAT_INVALID_HANDLE;
+
+	dapl_os_memcpy(remote_ia_address,
+		       &ib_cm_handle->msg.daddr, sizeof(DAT_SOCK_ADDR6));
+
+	return DAT_SUCCESS;
+}
+
+/*
+ * dapls_ib_private_data_size
+ *
+ * Return the size of private data given a connection op type
+ *
+ * Input:
+ *	prd_ptr		private data pointer
+ *	conn_op		connection operation type
+ *
+ * If prd_ptr is NULL, this is a query for the max size supported by
+ * the provider, otherwise it is the actual size of the private data
+ * contained in prd_ptr.
+ *
+ *
+ * Output:
+ *	None
+ *
+ * Returns:
+ * 	length of private data
+ *
+ */
+int dapls_ib_private_data_size(IN DAPL_PRIVATE * prd_ptr,
+			       IN DAPL_PDATA_OP conn_op, IN DAPL_HCA * hca_ptr)
+{
+	int size;
+
+	switch (conn_op) {
+	case DAPL_PDATA_CONN_REQ:
+	case DAPL_PDATA_CONN_REP:
+	case DAPL_PDATA_CONN_REJ:
+	case DAPL_PDATA_CONN_DREQ:
+	case DAPL_PDATA_CONN_DREP:
+		size = DCM_MAX_PDATA_SIZE;
+		break;
+	default:
+		size = 0;
+	}			
+	return size;
+}
+
+/*
+ * Map all socket CM event codes to the DAT equivelent.
+ */
+#define DAPL_IB_EVENT_CNT	10
+
+static struct ib_cm_event_map {
+	const ib_cm_events_t ib_cm_event;
+	DAT_EVENT_NUMBER dat_event_num;
+} ib_cm_event_map[DAPL_IB_EVENT_CNT] = {
+/* 00 */ {IB_CME_CONNECTED, 
+	  DAT_CONNECTION_EVENT_ESTABLISHED},
+/* 01 */ {IB_CME_DISCONNECTED, 
+	  DAT_CONNECTION_EVENT_DISCONNECTED},
+/* 02 */ {IB_CME_DISCONNECTED_ON_LINK_DOWN,
+	  DAT_CONNECTION_EVENT_DISCONNECTED},
+/* 03 */ {IB_CME_CONNECTION_REQUEST_PENDING, 
+	  DAT_CONNECTION_REQUEST_EVENT},
+/* 04 */ {IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA,
+	  DAT_CONNECTION_REQUEST_EVENT},
+/* 05 */ {IB_CME_DESTINATION_REJECT,
+	  DAT_CONNECTION_EVENT_NON_PEER_REJECTED},
+/* 06 */ {IB_CME_DESTINATION_REJECT_PRIVATE_DATA,
+	  DAT_CONNECTION_EVENT_PEER_REJECTED},
+/* 07 */ {IB_CME_DESTINATION_UNREACHABLE, 
+	  DAT_CONNECTION_EVENT_UNREACHABLE},
+/* 08 */ {IB_CME_TOO_MANY_CONNECTION_REQUESTS,
+	  DAT_CONNECTION_EVENT_NON_PEER_REJECTED},
+/* 09 */ {IB_CME_LOCAL_FAILURE, 
+	  DAT_CONNECTION_EVENT_BROKEN}
+};
+
+/*
+ * dapls_ib_get_cm_event
+ *
+ * Return a DAT connection event given a provider CM event.
+ *
+ * Input:
+ *	dat_event_num	DAT event we need an equivelent CM event for
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	ib_cm_event of translated DAPL value
+ */
+DAT_EVENT_NUMBER
+dapls_ib_get_dat_event(IN const ib_cm_events_t ib_cm_event,
+		       IN DAT_BOOLEAN active)
+{
+	DAT_EVENT_NUMBER dat_event_num;
+	int i;
+
+	if (ib_cm_event > IB_CME_LOCAL_FAILURE)
+		return (DAT_EVENT_NUMBER) 0;
+
+	dat_event_num = 0;
+	for (i = 0; i < DAPL_IB_EVENT_CNT; i++) {
+		if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) {
+			dat_event_num = ib_cm_event_map[i].dat_event_num;
+			break;
+		}
+	}
+	dapl_dbg_log(DAPL_DBG_TYPE_CALLBACK,
+		     "dapls_ib_get_dat_event: event translate(%s) ib=0x%x dat=0x%x\n",
+		     active ? "active" : "passive", ib_cm_event, dat_event_num);
+
+	return dat_event_num;
+}
+
+/*
+ * dapls_ib_get_dat_event
+ *
+ * Return a DAT connection event given a provider CM event.
+ * 
+ * Input:
+ *	ib_cm_event	event provided to the dapl callback routine
+ *	active		switch indicating active or passive connection
+ *
+ * Output:
+ * 	none
+ *
+ * Returns:
+ * 	DAT_EVENT_NUMBER of translated provider value
+ */
+ib_cm_events_t dapls_ib_get_cm_event(IN DAT_EVENT_NUMBER dat_event_num)
+{
+	ib_cm_events_t ib_cm_event;
+	int i;
+
+	ib_cm_event = 0;
+	for (i = 0; i < DAPL_IB_EVENT_CNT; i++) {
+		if (dat_event_num == ib_cm_event_map[i].dat_event_num) {
+			ib_cm_event = ib_cm_event_map[i].ib_cm_event;
+			break;
+		}
+	}
+	return ib_cm_event;
+}
+
+/* work thread for uAT, uCM, CQ, and async events */
+void cm_thread(void *arg)
+{
+	struct dapl_hca *hca = arg;
+	dp_ib_cm_handle_t cm, next;
+	struct dapl_fd_set *set;
+	char rbuf[2];
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: ENTER hca %p\n", hca);
+	set = dapl_alloc_fd_set();
+	if (!set)
+		goto out;
+
+	dapl_os_lock(&hca->ib_trans.lock);
+	hca->ib_trans.cm_state = IB_THREAD_RUN;
+
+	while (1) {
+		dapl_fd_zero(set);
+		dapl_fd_set(hca->ib_trans.scm[0], set, DAPL_FD_READ);	
+		dapl_fd_set(hca->ib_hca_handle->async_fd, set, DAPL_FD_READ);
+		dapl_fd_set(hca->ib_trans.rch->fd, set, DAPL_FD_READ);
+
+		if (!dapl_llist_is_empty(&hca->ib_trans.list))
+			next = dapl_llist_peek_head(&hca->ib_trans.list);
+		else
+			next = NULL;
+
+		while (next) {
+			cm = next;
+			next = dapl_llist_next_entry(
+					&hca->ib_trans.list,
+					(DAPL_LLIST_ENTRY *)&cm->entry);
+
+			if (cm->state == DCM_DESTROY || 
+			    hca->ib_trans.cm_state != IB_THREAD_RUN) {
+				dapl_llist_remove_entry(
+					&hca->ib_trans.list,
+					(DAPL_LLIST_ENTRY *)&cm->entry);
+				dapl_os_free(cm, sizeof(*cm));
+				continue;
+			}
+		
+			/* TODO: Check and process retries here */
+
+			continue;
+		}
+
+		/* set to exit and all resources destroyed */
+		if ((hca->ib_trans.cm_state != IB_THREAD_RUN) &&
+		    (dapl_llist_is_empty(&hca->ib_trans.list)))
+			break;
+
+		dapl_os_unlock(&hca->ib_trans.lock);
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: select sleep\n");
+		dapl_select(set);
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: select wake\n");
+
+		/* Process events: CM, ASYNC, NOTIFY THREAD */
+		if (dapl_poll(hca->ib_trans.rch->fd, 
+			      DAPL_FD_READ) == DAPL_FD_READ) {
+			ucm_recv(&hca->ib_trans);
+		}
+		if (dapl_poll(hca->ib_hca_handle->async_fd, 
+			      DAPL_FD_READ) == DAPL_FD_READ) {
+			ucm_async_event(hca);
+		}
+		while (dapl_poll(hca->ib_trans.scm[0], 
+				 DAPL_FD_READ) == DAPL_FD_READ) {
+			recv(hca->ib_trans.scm[0], rbuf, 2, 0);
+		}
+
+		dapl_os_lock(&hca->ib_trans.lock);
+		
+		/* set to exit and all resources destroyed */
+		if ((hca->ib_trans.cm_state != IB_THREAD_RUN) &&
+		    (dapl_llist_is_empty(&hca->ib_trans.list)))
+			break;
+	}
+
+	dapl_os_unlock(&hca->ib_trans.lock);
+	free(set);
+out:
+	hca->ib_trans.cm_state = IB_THREAD_EXIT;
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread(hca %p) exit\n", hca);
+}
+
+
+#ifdef DAPL_COUNTERS
+/* Debug aid: List all Connections in process and state */
+void dapls_print_cm_list(IN DAPL_IA *ia_ptr)
+{
+	/* Print in process CR's for this IA, if debug type set */
+	int i = 0;
+	dp_ib_cm_handle_t cr, next_cr;
+
+	dapl_os_lock(&ia_ptr->hca_ptr->ib_trans.lock);
+	if (!dapl_llist_is_empty((DAPL_LLIST_HEAD*)
+				 &ia_ptr->hca_ptr->ib_trans.list))
+				 next_cr = dapl_llist_peek_head((DAPL_LLIST_HEAD*)
+				 &ia_ptr->hca_ptr->ib_trans.list);
+ 	else
+		next_cr = NULL;
+
+        printf("\n DAPL IA CONNECTIONS IN PROCESS:\n");
+	while (next_cr) {
+		cr = next_cr;
+		next_cr = dapl_llist_next_entry((DAPL_LLIST_HEAD*)
+				 &ia_ptr->hca_ptr->ib_trans.list,
+				(DAPL_LLIST_ENTRY*)&cr->entry);
+
+		printf( "  CONN[%d]: sp %p ep %p %s %s %s"
+			" dst lid %x iqp %x port %d\n",
+			i, cr->sp, cr->ep, 
+			cr->msg.saddr.ib.qp_type == IBV_QPT_RC ? "RC" : "UD",
+			dapl_cm_state_str(cr->state),
+			cr->sp ? "<-" : "->",
+			ntohs(cr->msg.daddr.ib.lid),
+			ntohl(cr->msg.daddr.ib.qpn),			
+			cr->sp ? 
+			(int)cr->sp->conn_qual : ntohs(cr->msg.dport) );
+		i++;
+	}
+	printf("\n");
+	dapl_os_unlock(&ia_ptr->hca_ptr->ib_trans.lock);
+}
+#endif
diff --git a/dapl/openib_ucm/dapl_ib_util.h b/dapl/openib_ucm/dapl_ib_util.h
new file mode 100644
index 0000000..dfee2b9
--- /dev/null
+++ b/dapl/openib_ucm/dapl_ib_util.h
@@ -0,0 +1,119 @@
+/*
+ * Copyright (c) 2009 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+
+#ifndef _DAPL_IB_UTIL_H_
+#define _DAPL_IB_UTIL_H_
+#define _OPENIB_SCM_ 
+
+#include <infiniband/verbs.h>
+#include "openib_osd.h"
+#include "dapl_ib_common.h"
+
+#define UCM_DEFAULT_CQE 500
+#define UCM_DEFAULT_QPE 500
+
+struct ib_cm_handle
+{ 
+	struct dapl_llist_entry	entry;
+	DAPL_OS_LOCK		lock;
+	int			state;
+	int			retries;
+	struct dapl_hca		*hca;
+	struct dapl_sp		*sp;	
+	struct dapl_ep 		*ep;
+	ib_cm_msg_t		msg;
+};
+
+typedef struct ib_cm_handle	*dp_ib_cm_handle_t;
+typedef dp_ib_cm_handle_t	ib_cm_srvc_handle_t;
+
+/* Definitions */
+#define IB_INVALID_HANDLE	NULL
+
+/* ib_hca_transport_t, specific to this implementation */
+typedef struct _ib_hca_transport
+{ 
+	struct	ibv_device	*ib_dev;
+	struct	dapl_hca	*hca;
+        struct  ibv_context     *ib_ctx;
+        struct ibv_comp_channel *ib_cq;
+        ib_cq_handle_t          ib_cq_empty;
+	int			destroy;
+	int			cm_state;
+	DAPL_OS_THREAD		thread;
+	DAPL_OS_LOCK		lock;	/* connect list */
+	struct dapl_llist_entry	*list;	
+	DAPL_OS_LOCK		llock;	/* listen list */
+	struct dapl_llist_entry	*llist;	
+	ib_async_handler_t	async_unafiliated;
+	void			*async_un_ctx;
+	ib_async_cq_handler_t	async_cq_error;
+	ib_async_dto_handler_t	async_cq;
+	ib_async_qp_handler_t	async_qp_error;
+	union dcm_addr		addr;	/* lid, port, qp_num, gid */
+	int			max_inline_send;
+	int			rd_atom_in;
+	int			rd_atom_out;
+	uint8_t			ack_timer;
+	uint8_t			ack_retry;
+	uint8_t			rnr_timer;
+	uint8_t			rnr_retry;
+	uint8_t			global;
+	uint8_t			hop_limit;
+	uint8_t			tclass;
+	uint8_t			mtu;
+	DAT_NAMED_ATTR		named_attr;
+	DAPL_SOCKET		scm[2];
+	int			cqe;
+	int			qpe;
+	DAPL_OS_LOCK		slock;	
+	int			s_hd;
+	int			s_tl;
+	struct ibv_pd		*pd; 
+	struct ibv_cq		*scq;
+	struct ibv_cq		*rcq;
+	struct ibv_qp		*qp;
+	struct ibv_mr		*mr_rbuf;
+	struct ibv_mr		*mr_sbuf;
+	ib_cm_msg_t		*sbuf;
+	ib_cm_msg_t		*rbuf;
+	struct ibv_comp_channel *rch;
+	struct ibv_ah		**ah;  
+	DAPL_OS_LOCK		plock;
+	uint8_t			*sid;  /* Sevice IDs, port space, bitarray? */
+
+} ib_hca_transport_t;
+
+/* prototypes */
+void cm_thread(void *arg);
+void ucm_async_event(struct dapl_hca *hca);
+dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep);
+void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep);
+void dapls_print_cm_list(IN DAPL_IA *ia_ptr);
+
+#endif /*  _DAPL_IB_UTIL_H_ */
+
diff --git a/dapl/openib_ucm/device.c b/dapl/openib_ucm/device.c
new file mode 100644
index 0000000..329b050
--- /dev/null
+++ b/dapl/openib_ucm/device.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright (c) 2009 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+
+#include "openib_osd.h"
+#include "dapl.h"
+#include "dapl_adapter_util.h"
+#include "dapl_ib_util.h"
+#include "dapl_osd.h"
+
+#include <stdlib.h>
+
+static void ucm_service_destroy(IN DAPL_HCA *hca);
+static int  ucm_service_create(IN DAPL_HCA *hca);
+
+static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr)
+{
+	DAPL_SOCKET listen_socket;
+	struct sockaddr_in addr;
+	socklen_t addrlen = sizeof(addr);
+	int ret;
+
+	listen_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	if (listen_socket == DAPL_INVALID_SOCKET)
+		return 1;
+
+	memset(&addr, 0, sizeof addr);
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = htonl(0x7f000001);
+	ret = bind(listen_socket, (struct sockaddr *)&addr, sizeof addr);
+	if (ret)
+		goto err1;
+
+	ret = getsockname(listen_socket, (struct sockaddr *)&addr, &addrlen);
+	if (ret)
+		goto err1;
+
+	ret = listen(listen_socket, 0);
+	if (ret)
+		goto err1;
+
+	hca_ptr->ib_trans.scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	if (hca_ptr->ib_trans.scm[1] == DAPL_INVALID_SOCKET)
+		goto err1;
+
+	ret = connect(hca_ptr->ib_trans.scm[1], 
+		      (struct sockaddr *)&addr, sizeof(addr));
+	if (ret)
+		goto err2;
+
+	hca_ptr->ib_trans.scm[0] = accept(listen_socket, NULL, NULL);
+	if (hca_ptr->ib_trans.scm[0] == DAPL_INVALID_SOCKET)
+		goto err2;
+
+	closesocket(listen_socket);
+	return 0;
+
+      err2:
+	closesocket(hca_ptr->ib_trans.scm[1]);
+      err1:
+	closesocket(listen_socket);
+	return 1;
+}
+
+static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr)
+{
+	closesocket(hca_ptr->ib_trans.scm[0]);
+	closesocket(hca_ptr->ib_trans.scm[1]);
+}
+
+
+/*
+ * dapls_ib_init, dapls_ib_release
+ *
+ * Initialize Verb related items for device open
+ *
+ * Input:
+ * 	none
+ *
+ * Output:
+ *	none
+ *
+ * Returns:
+ * 	0 success, -1 error
+ *
+ */
+int32_t dapls_ib_init(void)
+{
+	return 0;
+}
+
+int32_t dapls_ib_release(void)
+{
+	return 0;
+}
+
+#if defined(_WIN64) || defined(_WIN32)
+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	return 0;
+}
+#else				// _WIN64 || WIN32
+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	int opts;
+
+	opts = fcntl(channel->fd, F_GETFL);	/* uCQ */
+	if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n",
+			 channel->fd, opts, strerror(errno));
+		return errno;
+	}
+
+	return 0;
+}
+#endif
+
+/*
+ * dapls_ib_open_hca
+ *
+ * Open HCA
+ *
+ * Input:
+ *      *hca_name         pointer to provider device name
+ *      *ib_hca_handle_p  pointer to provide HCA handle
+ *
+ * Output:
+ *      none
+ *
+ * Return:
+ *      DAT_SUCCESS
+ *      dapl_convert_errno
+ *
+ */
+DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
+{
+	struct ibv_device **dev_list;
+	struct ibv_port_attr port_attr;
+	int i;
+	DAT_RETURN dat_status;
+
+	/* Get list of all IB devices, find match, open */
+	dev_list = ibv_get_device_list(NULL);
+	if (!dev_list) {
+		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
+			     " open_hca: ibv_get_device_list() failed\n",
+			     hca_name);
+		return DAT_INTERNAL_ERROR;
+	}
+
+	for (i = 0; dev_list[i]; ++i) {
+		hca_ptr->ib_trans.ib_dev = dev_list[i];
+		if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
+			    hca_name))
+			goto found;
+	}
+
+	dapl_log(DAPL_DBG_TYPE_ERR,
+		 " open_hca: device %s not found\n", hca_name);
+	goto err;
+
+found:
+
+	hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev);
+	if (!hca_ptr->ib_hca_handle) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: dev open failed for %s, err=%s\n",
+			 ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
+			 strerror(errno));
+		goto err;
+	}
+	hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle;
+	
+	/* get lid for this hca-port, network order */
+	if (ibv_query_port(hca_ptr->ib_hca_handle,
+			   (uint8_t)hca_ptr->port_num, &port_attr)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: get lid ERR for %s, err=%s\n",
+			 ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
+			 strerror(errno));
+		goto err;
+	} else {
+		hca_ptr->ib_trans.addr.ib.lid = htons(port_attr.lid);
+		hca_ptr->ib_trans.addr.ib.port_num = hca_ptr->port_num;
+	}
+
+	/* get gid for this hca-port, network order */
+	if (ibv_query_gid(hca_ptr->ib_hca_handle,
+			  (uint8_t) hca_ptr->port_num,
+			  0, &hca_ptr->ib_trans.addr.ib.gid)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: query GID ERR for %s, err=%s\n",
+			 ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
+			 strerror(errno));
+		goto err;
+	}
+
+	/* set RC tunables via enviroment or default */
+	hca_ptr->ib_trans.max_inline_send =
+	    dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_IB_DEFAULT);
+	hca_ptr->ib_trans.ack_retry =
+	    dapl_os_get_env_val("DAPL_ACK_RETRY", DCM_ACK_RETRY);
+	hca_ptr->ib_trans.ack_timer =
+	    dapl_os_get_env_val("DAPL_ACK_TIMER", DCM_ACK_TIMER);
+	hca_ptr->ib_trans.rnr_retry =
+	    dapl_os_get_env_val("DAPL_RNR_RETRY", DCM_RNR_RETRY);
+	hca_ptr->ib_trans.rnr_timer =
+	    dapl_os_get_env_val("DAPL_RNR_TIMER", DCM_RNR_TIMER);
+	hca_ptr->ib_trans.global =
+	    dapl_os_get_env_val("DAPL_GLOBAL_ROUTING", DCM_GLOBAL);
+	hca_ptr->ib_trans.hop_limit =
+	    dapl_os_get_env_val("DAPL_HOP_LIMIT", DCM_HOP_LIMIT);
+	hca_ptr->ib_trans.tclass =
+	    dapl_os_get_env_val("DAPL_TCLASS", DCM_TCLASS);
+	hca_ptr->ib_trans.mtu =
+	    dapl_ib_mtu(dapl_os_get_env_val("DAPL_IB_MTU", DCM_IB_MTU));
+
+	/* initialize CM list, LISTEN, SND queue, PSP array, locks */
+	if ((dapl_os_lock_init(&hca_ptr->ib_trans.lock)) != DAT_SUCCESS)
+		goto err;
+	
+	if ((dapl_os_lock_init(&hca_ptr->ib_trans.llock)) != DAT_SUCCESS)
+		goto err;
+	
+	if ((dapl_os_lock_init(&hca_ptr->ib_trans.slock)) != DAT_SUCCESS)
+		goto err;
+
+	if ((dapl_os_lock_init(&hca_ptr->ib_trans.plock)) != DAT_SUCCESS)
+		goto err;
+	
+
+	/* initialize CM and listen lists on this HCA uCM QP */
+	dapl_llist_init_head(&hca_ptr->ib_trans.list);
+	dapl_llist_init_head(&hca_ptr->ib_trans.llist);
+
+	/* create uCM qp services */
+	if (ucm_service_create(hca_ptr))
+		goto bail;
+
+	/* initialize pipe, user level wakeup on select */
+	if (create_cr_pipe(hca_ptr)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: failed to init cr pipe - %s\n",
+			 strerror(errno));
+		goto bail;
+	}
+
+	/* create thread to process inbound connect request */
+	hca_ptr->ib_trans.cm_state = IB_THREAD_INIT;
+	dat_status = dapl_os_thread_create(cm_thread,
+					   (void *)hca_ptr,
+					   &hca_ptr->ib_trans.thread);
+	if (dat_status != DAT_SUCCESS) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: failed to create thread\n");
+		goto bail;
+	}
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+		     " open_hca: devname %s, port %d, hostname_IP %s\n",
+		     ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
+		     hca_ptr->ib_trans.addr.ib.port_num, 
+		     inet_ntoa(((struct sockaddr_in *)
+			       &hca_ptr->hca_address)->sin_addr));
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+		     " open_hca: QPN 0x%x LID 0x%x GID Subnet 0x" F64x ""
+		     "ID 0x" F64x "\n", 
+		     ntohl(hca_ptr->ib_trans.addr.ib.qpn),
+		     ntohs(hca_ptr->ib_trans.addr.ib.lid), 
+		     (unsigned long long)
+		     htonll(hca_ptr->ib_trans.addr.ib.gid.global.subnet_prefix),
+		     (unsigned long long)
+		     htonll(hca_ptr->ib_trans.addr.ib.gid.global.interface_id));
+
+	/* save LID, GID, QPN, PORT address information, for ia_queries */
+	hca_ptr->ib_trans.hca = hca_ptr;
+	hca_ptr->ib_trans.addr.ib.qp_type = IBV_QPT_UD;
+	memcpy(&hca_ptr->hca_address, 
+	       &hca_ptr->ib_trans.addr, 
+	       sizeof(union dcm_addr));
+
+	ibv_free_device_list(dev_list);
+
+	/* wait for cm_thread */
+	while (hca_ptr->ib_trans.cm_state != IB_THREAD_RUN) 
+		dapl_os_sleep_usec(1000);
+
+	return dat_status;
+
+bail:
+	ucm_service_destroy(hca_ptr);
+	ibv_close_device(hca_ptr->ib_hca_handle);
+	hca_ptr->ib_hca_handle = IB_INVALID_HANDLE;
+      
+err:
+	ibv_free_device_list(dev_list);
+	return DAT_INTERNAL_ERROR;
+}
+
+/*
+ * dapls_ib_close_hca
+ *
+ * Open HCA
+ *
+ * Input:
+ *      DAPL_HCA   provide CA handle
+ *
+ * Output:
+ *      none
+ *
+ * Return:
+ *      DAT_SUCCESS
+ *	dapl_convert_errno 
+ *
+ */
+DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr)
+{
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: %p\n", hca_ptr);
+
+	if (hca_ptr->ib_trans.cm_state == IB_THREAD_RUN) {
+		hca_ptr->ib_trans.cm_state = IB_THREAD_CANCEL;
+		send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0);
+		while (hca_ptr->ib_trans.cm_state != IB_THREAD_EXIT) {
+			dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+				" close_hca: waiting for cr_thread\n");
+			send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0);
+			dapl_os_sleep_usec(1000);
+		}
+	}
+
+	if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) {
+		if (ibv_close_device(hca_ptr->ib_hca_handle))
+			return (dapl_convert_errno(errno, "ib_close_device"));
+		hca_ptr->ib_hca_handle = IB_INVALID_HANDLE;
+	}
+
+	dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
+	dapl_os_lock_destroy(&hca_ptr->ib_trans.llock);
+	destroy_cr_pipe(hca_ptr); /* no longer need pipe */
+	ucm_service_destroy(hca_ptr);
+	return (DAT_SUCCESS);
+}
+
+/* Create uCM endpoint services, allocate remote_ah's array */
+static void ucm_service_destroy(IN DAPL_HCA *hca)
+{
+	ib_hca_transport_t *tp = &hca->ib_trans;
+	int msg_size = sizeof(ib_cm_msg_t);
+
+	if (tp->pd)
+		ibv_dealloc_pd(tp->pd);
+
+	if (tp->rch)
+		ibv_destroy_comp_channel(tp->rch);
+
+	if (tp->scq)
+		ibv_destroy_cq(tp->scq);
+
+	if (tp->rcq)
+		ibv_destroy_cq(tp->rcq);
+
+	if (tp->qp)
+		ibv_destroy_qp(tp->qp);
+
+	if (tp->mr_sbuf)
+		ibv_dereg_mr(tp->mr_sbuf);
+
+	if (tp->mr_sbuf)
+		ibv_dereg_mr(tp->mr_sbuf);
+
+	if (tp->ah)
+		dapl_os_free(tp->ah, (sizeof(*tp->ah) * 0xffff));
+
+	if (tp->sid)
+		dapl_os_free(tp->sid, (sizeof(*tp->sid) * 0xffff));
+
+	if (tp->rbuf)
+		dapl_os_free(tp->rbuf, (msg_size * tp->qpe));
+
+	if (tp->sbuf)
+		dapl_os_free(tp->sbuf, (msg_size * tp->qpe));
+}
+
+static int ucm_service_create(IN DAPL_HCA *hca)
+{
+        struct ibv_qp_init_attr qp_create;
+	ib_hca_transport_t *tp = &hca->ib_trans;
+	struct ibv_recv_wr recv_wr, *recv_err;
+        struct ibv_sge sge;
+	int i, mlen = sizeof(ib_cm_msg_t);
+	int hlen = sizeof(struct ibv_grh); /* hdr included with UD recv */
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ucm_create: \n");
+
+	/* get queue sizes */
+	tp->qpe = dapl_os_get_env_val("DAPL_UCM_QPE", UCM_DEFAULT_QPE);
+	tp->cqe = dapl_os_get_env_val("DAPL_UCM_CQE", UCM_DEFAULT_CQE);
+	tp->pd = ibv_alloc_pd(hca->ib_hca_handle);
+        if (!tp->pd) 
+                goto bail;
+        
+    	tp->rch = ibv_create_comp_channel(hca->ib_hca_handle);
+	if (!tp->rch) 
+		goto bail;
+
+	tp->scq = ibv_create_cq(hca->ib_hca_handle, tp->cqe, hca, NULL, 0);
+	if (!tp->scq) 
+		goto bail;
+        
+	tp->rcq = ibv_create_cq(hca->ib_hca_handle, tp->cqe, hca, tp->rch, 0);
+	if (!tp->rcq) 
+		goto bail;
+
+	if(ibv_req_notify_cq(tp->rcq, 0))
+		goto bail; 
+ 
+	dapl_os_memzero((void *)&qp_create, sizeof(qp_create));
+	qp_create.qp_type = IBV_QPT_UD;
+	qp_create.send_cq = tp->scq;
+	qp_create.recv_cq = tp->rcq;
+	qp_create.cap.max_send_wr = qp_create.cap.max_recv_wr = tp->qpe;
+	qp_create.cap.max_send_sge = qp_create.cap.max_recv_sge = 1;
+	qp_create.cap.max_inline_data = tp->max_inline_send;
+	qp_create.qp_context = (void *)hca;
+
+	tp->qp = ibv_create_qp(tp->pd, &qp_create);
+	if (!tp->qp) 
+                goto bail;
+
+	tp->ah = (ib_ah_handle_t*) dapl_os_alloc(sizeof(ib_ah_handle_t) * 0xffff);
+	tp->sid = (uint8_t*) dapl_os_alloc(sizeof(uint8_t) * 0xffff);
+	tp->rbuf = (void*) dapl_os_alloc((mlen + hlen) * tp->qpe);
+	tp->sbuf = (void*) dapl_os_alloc(mlen * tp->qpe);
+
+	if (!tp->ah || !tp->rbuf || !tp->sbuf || !tp->sid)
+		goto bail;
+
+	(void)dapl_os_memzero(tp->ah, (sizeof(ib_ah_handle_t) * 0xffff));
+	(void)dapl_os_memzero(tp->sid, (sizeof(uint8_t) * 0xffff));
+	tp->sid[0] = 1; /* resv slot 0, 0 == no ports available */
+	(void)dapl_os_memzero(tp->rbuf, ((mlen + hlen) * tp->qpe));
+	(void)dapl_os_memzero(tp->sbuf, (mlen * tp->qpe));
+
+	tp->mr_sbuf = ibv_reg_mr(tp->pd, tp->sbuf, 
+				 (mlen * tp->qpe),
+				 IBV_ACCESS_LOCAL_WRITE);
+	if (!tp->mr_sbuf)
+		goto bail;
+
+	tp->mr_rbuf = ibv_reg_mr(tp->pd, tp->rbuf, 
+				 ((mlen + hlen) * tp->qpe),
+				 IBV_ACCESS_LOCAL_WRITE);
+	if (!tp->mr_rbuf)
+		goto bail;
+	
+	/* modify UD QP: init, rtr, rts */
+	if ((dapls_modify_qp_ud(hca, tp->qp)) != DAT_SUCCESS)
+		goto bail;
+
+	/* post receive buffers, setup head, tail pointers */
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sge;
+	recv_wr.num_sge = 1;
+	sge.length = mlen + hlen;
+	sge.lkey = tp->mr_rbuf->lkey;
+
+	for (i = 0; i < tp->qpe; i++) {
+		recv_wr.wr_id = 
+			(uintptr_t)((char *)&tp->rbuf[i] + 
+				    sizeof(struct ibv_grh));
+		sge.addr = (uintptr_t) &tp->rbuf[i];
+		if (ibv_post_recv(tp->qp, &recv_wr, &recv_err))
+			goto bail;
+	}
+
+	/* save qp_num as part of ia_address, network order */
+	tp->addr.ib.qpn = htonl(tp->qp->qp_num);
+        return 0;
+bail:
+	dapl_log(DAPL_DBG_TYPE_ERR,
+		 " ucm_create_services: ERR %s\n", strerror(errno));
+	ucm_service_destroy(hca);
+	return -1;
+}
+
+void ucm_async_event(struct dapl_hca *hca)
+{
+	struct ibv_async_event event;
+	struct _ib_hca_transport *tp = &hca->ib_trans;
+
+	dapl_log(DAPL_DBG_TYPE_WARN, " async_event(%p)\n", hca);
+
+	if (!ibv_get_async_event(hca->ib_hca_handle, &event)) {
+
+		switch (event.event_type) {
+		case IBV_EVENT_CQ_ERR:
+		{
+			struct dapl_ep *evd_ptr =
+				event.element.cq->cq_context;
+
+			dapl_log(DAPL_DBG_TYPE_ERR,
+				 "dapl async_event CQ (%p) ERR %d\n",
+				 evd_ptr, event.event_type);
+
+			/* report up if async callback still setup */
+			if (tp->async_cq_error)
+				tp->async_cq_error(hca->ib_hca_handle,
+						   event.element.cq,
+						   &event, (void *)evd_ptr);
+			break;
+		}
+		case IBV_EVENT_COMM_EST:
+		{
+			/* Received msgs on connected QP before RTU */
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " async_event COMM_EST(%p) rdata beat RTU\n",
+				 event.element.qp);
+
+			break;
+		}
+		case IBV_EVENT_QP_FATAL:
+		case IBV_EVENT_QP_REQ_ERR:
+		case IBV_EVENT_QP_ACCESS_ERR:
+		case IBV_EVENT_QP_LAST_WQE_REACHED:
+		case IBV_EVENT_SRQ_ERR:
+		case IBV_EVENT_SRQ_LIMIT_REACHED:
+		case IBV_EVENT_SQ_DRAINED:
+		{
+			struct dapl_ep *ep_ptr =
+				event.element.qp->qp_context;
+
+			dapl_log(DAPL_DBG_TYPE_ERR,
+				 "dapl async_event QP (%p) ERR %d\n",
+				 ep_ptr, event.event_type);
+
+			/* report up if async callback still setup */
+			if (tp->async_qp_error)
+				tp->async_qp_error(hca->ib_hca_handle,
+						   ep_ptr->qp_handle,
+						   &event, (void *)ep_ptr);
+			break;
+		}
+		case IBV_EVENT_PATH_MIG:
+		case IBV_EVENT_PATH_MIG_ERR:
+		case IBV_EVENT_DEVICE_FATAL:
+		case IBV_EVENT_PORT_ACTIVE:
+		case IBV_EVENT_PORT_ERR:
+		case IBV_EVENT_LID_CHANGE:
+		case IBV_EVENT_PKEY_CHANGE:
+		case IBV_EVENT_SM_CHANGE:
+		{
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				 "dapl async_event: DEV ERR %d\n",
+				 event.event_type);
+
+			/* report up if async callback still setup */
+			if (tp->async_unafiliated)
+				tp->async_unafiliated(hca->ib_hca_handle,
+						      &event,
+						      tp->async_un_ctx);
+			break;
+		}
+		case IBV_EVENT_CLIENT_REREGISTER:
+			/* no need to report this event this time */
+			dapl_log(DAPL_DBG_TYPE_UTIL,
+				 " async_event: IBV_CLIENT_REREGISTER\n");
+			break;
+
+		default:
+			dapl_log(DAPL_DBG_TYPE_WARN,
+				 "dapl async_event: %d UNKNOWN\n",
+				 event.event_type);
+			break;
+
+		}
+		ibv_ack_async_event(&event);
+	}
+}
+
diff --git a/dapl/openib_ucm/linux/openib_osd.h b/dapl/openib_ucm/linux/openib_osd.h
new file mode 100644
index 0000000..191a55b
--- /dev/null
+++ b/dapl/openib_ucm/linux/openib_osd.h
@@ -0,0 +1,21 @@
+#ifndef OPENIB_OSD_H
+#define OPENIB_OSD_H
+
+#include <endian.h>
+#include <netinet/in.h>
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#define htonll(x) (x)
+#define ntohll(x) (x)
+#elif __BYTE_ORDER == __LITTLE_ENDIAN
+#define htonll(x)  bswap_64(x)
+#define ntohll(x)  bswap_64(x)
+#endif
+
+#define DAPL_SOCKET int
+#define DAPL_INVALID_SOCKET -1
+#define DAPL_FD_SETSIZE 16
+
+#define closesocket close
+
+#endif // OPENIB_OSD_H
diff --git a/dapl/openib_ucm/udapl.rc b/dapl/openib_ucm/udapl.rc
new file mode 100644
index 0000000..8550256
--- /dev/null
+++ b/dapl/openib_ucm/udapl.rc
@@ -0,0 +1,48 @@
+/*
+ * Copyright (c) 2007, 2009 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under the OpenIB.org BSD license
+ * below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+
+#include <oib_ver.h>
+
+#define VER_FILETYPE			VFT_DLL
+#define VER_FILESUBTYPE			VFT2_UNKNOWN
+
+#if DBG
+#define VER_FILEDESCRIPTION_STR		"Direct Access Provider Library v2.0 (OFA socket-cm) (Debug)"
+#define VER_INTERNALNAME_STR		"dapl2-ofa-scmd.dll"
+#define VER_ORIGINALFILENAME_STR	"dapl2-ofa-scmd.dll"
+#else
+#define VER_FILEDESCRIPTION_STR		"Direct Access Provider Library v2.0 (OFA socket-cm)"
+#define VER_INTERNALNAME_STR		"dapl2-ofa-scm.dll"
+#define VER_ORIGINALFILENAME_STR	"dapl2-ofa-scm.dll"
+#endif
+
+#include <common.ver>
diff --git a/dapl/openib_ucm/windows/openib_osd.h b/dapl/openib_ucm/windows/openib_osd.h
new file mode 100644
index 0000000..7eb3df3
--- /dev/null
+++ b/dapl/openib_ucm/windows/openib_osd.h
@@ -0,0 +1,35 @@
+#ifndef OPENIB_OSD_H
+#define OPENIB_OSD_H
+
+#ifndef FD_SETSIZE
+#define FD_SETSIZE 1024 /* Set before including winsock2 - see select help */
+#define DAPL_FD_SETSIZE FD_SETSIZE
+#endif
+
+#include <winsock2.h>
+#include <ws2tcpip.h>
+#include <io.h>
+#include <fcntl.h>
+
+#define ntohll _byteswap_uint64
+#define htonll _byteswap_uint64
+
+#define DAPL_SOCKET SOCKET
+#define DAPL_INVALID_SOCKET INVALID_SOCKET
+
+/* allow casting to WSABUF */
+struct iovec
+{
+       u_long iov_len;
+       char FAR* iov_base;
+};
+
+static int writev(DAPL_SOCKET s, struct iovec *vector, int count)
+{
+       int len, ret;
+
+       ret = WSASend(s, (WSABUF *) vector, count, &len, 0, NULL, NULL);
+       return ret ? ret : len;
+}
+
+#endif // OPENIB_OSD_H
diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index d868490..2f418fe 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -76,6 +76,7 @@
 #include <getopt.h>
 #include <inttypes.h>
 #include <unistd.h>
+#include <stdlib.h>
 
 #define DAPL_PROVIDER "ofa-v2-ib0"
 
@@ -99,14 +100,15 @@
 #define MAX_PROCS      1000
 
 /* Header files needed for DAT/uDAPL */
-#include    "dat2/udat.h"
+#include "infiniband/verbs.h"
+#include "dat2/udat.h"
 
 /* definitions */
 #define SERVER_CONN_QUAL  45248
 #define DTO_TIMEOUT       (1000*1000*5)
 #define CNO_TIMEOUT       (1000*1000*1)
 #define DTO_FLUSH_TIMEOUT (1000*1000*2)
-#define CONN_TIMEOUT      (1000*1000*10)
+#define CONN_TIMEOUT      (1000*1000*100)
 #define SERVER_TIMEOUT    DAT_TIMEOUT_INFINITE
 #define RDMA_BUFFER_SIZE  (64)
 
@@ -187,7 +189,7 @@ struct dt_time {
 	double conn;
 };
 
-struct dt_time time;
+struct dt_time ts;
 
 /* defaults */
 static int failed = 0;
@@ -207,6 +209,22 @@ static int use_cno = 0;
 static int recv_msg_index = 0;
 static int burst_msg_posted = 0;
 static int burst_msg_index = 0;
+static int ucm = 0;
+
+/* IB address structure used by DAPL uCM provider */
+union dcm_addr {
+	DAT_SOCK_ADDR6          so;
+	struct {
+		uint8_t		qp_type;
+		uint8_t		port_num;
+		uint16_t	lid;
+		uint32_t	qpn;
+		union ibv_gid	gid;
+	} ib;
+};
+
+static union dcm_addr remote;
+static union dcm_addr local;
 
 /* forward prototypes */
 const char *DT_RetToStr(DAT_RETURN ret_value);
@@ -313,9 +331,10 @@ int main(int argc, char **argv)
 	int i, c;
 	DAT_RETURN ret;
 	DAT_EP_PARAM ep_param;
+	DAT_IA_ATTR ia_attr;
 
 	/* parse arguments */
-	while ((c = getopt(argc, argv, "tscvpb:d:B:h:P:")) != -1) {
+	while ((c = getopt(argc, argv, "tscvpq:l:b:d:B:h:P:")) != -1) {
 		switch (c) {
 		case 't':
 			performance_times = 1;
@@ -340,6 +359,16 @@ int main(int argc, char **argv)
 			printf("%d Polling\n", getpid());
 			fflush(stdout);
 			break;
+		case 'q':
+			remote.ib.qpn = htonl(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
+		case 'l':
+			remote.ib.lid = htons(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
 		case 'B':
 			burst = atoi(optarg);
 			break;
@@ -389,7 +418,7 @@ int main(int argc, char **argv)
 		perror("malloc");
 		exit(1);
 	}
-	memset(&time, 0, sizeof(struct dt_time));
+	memset(&ts, 0, sizeof(struct dt_time));
 	LOGPRINTF("%d Allocated RDMA buffers (r:%p,s:%p) len %d \n",
 		  getpid(), rbuf, sbuf, buf_len);
 
@@ -398,7 +427,7 @@ int main(int argc, char **argv)
 	start = get_time();
 	ret = dat_ia_open(provider, 8, &h_async_evd, &h_ia);
 	stop = get_time();
-	time.open += ((stop - start) * 1.0e6);
+	ts.open += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d: Error Adaptor open: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -406,12 +435,34 @@ int main(int argc, char **argv)
 	} else
 		LOGPRINTF("%d Opened Interface Adaptor\n", getpid());
 
+	printf("%d query \n", getpid());
+
+	ret = dat_ia_query(h_ia, 0, DAT_IA_FIELD_ALL, &ia_attr, 0, 0);
+	if (ret != DAT_SUCCESS) {
+		fprintf(stderr, "%d: Error Adaptor query: %s\n",
+			getpid(), DT_RetToStr(ret));
+		exit(1);
+	}
+	memcpy((void*)&local,
+		(void*)ia_attr.ia_address_ptr,
+		sizeof(DAT_SOCK_ADDR6));
+
+	if (local.ib.qp_type == IBV_QPT_UD) {
+		ucm = 1;
+		printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n",
+			getpid(), ntohl(local.ib.qpn),
+			ntohs(local.ib.lid));
+		printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n",
+			getpid(), ntohl(remote.ib.qpn),
+			ntohs(remote.ib.lid));
+	}
+
 	/* Create Protection Zone */
 	start = get_time();
 	LOGPRINTF("%d Create Protection Zone\n", getpid());
 	ret = dat_pz_create(h_ia, &h_pz);
 	stop = get_time();
-	time.pzc += ((stop - start) * 1.0e6);
+	ts.pzc += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d Error creating Protection Zone: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -461,8 +512,8 @@ int main(int argc, char **argv)
 	ret = dat_ep_create(h_ia, h_pz, h_dto_rcv_evd,
 			    h_dto_req_evd, h_conn_evd, &ep_attr, &h_ep);
 	stop = get_time();
-	time.epc += ((stop - start) * 1.0e6);
-	time.total += time.epc;
+	ts.epc += ((stop - start) * 1.0e6);
+	ts.total += ts.epc;
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d Error dat_ep_create: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -570,8 +621,8 @@ complete:
 		start = get_time();
 		ret = dat_ep_free(h_ep);
 		stop = get_time();
-		time.epf += ((stop - start) * 1.0e6);
-		time.total += time.epf;
+		ts.epf += ((stop - start) * 1.0e6);
+		ts.total += ts.epf;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, "%d Error freeing EP: %s\n",
 				getpid(), DT_RetToStr(ret));
@@ -603,7 +654,7 @@ complete:
 	start = get_time();
 	ret = dat_pz_free(h_pz);
 	stop = get_time();
-	time.pzf += ((stop - start) * 1.0e6);
+	ts.pzf += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d Error freeing PZ: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -617,7 +668,7 @@ complete:
 	start = get_time();
 	ret = dat_ia_close(h_ia, DAT_CLOSE_ABRUPT_FLAG);
 	stop = get_time();
-	time.close += ((stop - start) * 1.0e6);
+	ts.close += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d: Error Adaptor close: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -640,35 +691,35 @@ complete:
 	printf("\n%d: DAPL Test Complete.\n\n", getpid());
 	printf("%d: Message RTT: Total=%10.2lf usec, %d bursts, itime=%10.2lf"
 	       " usec, pc=%d\n",
-	       getpid(), time.rtt, burst, time.rtt / burst, poll_count);
+	       getpid(), ts.rtt, burst, ts.rtt / burst, poll_count);
 	printf("%d: RDMA write:  Total=%10.2lf usec, %d bursts, itime=%10.2lf"
 	       " usec, pc=%d\n",
-	       getpid(), time.rdma_wr, burst,
-	       time.rdma_wr / burst, rdma_wr_poll_count);
+	       getpid(), ts.rdma_wr, burst,
+	       ts.rdma_wr / burst, rdma_wr_poll_count);
 	for (i = 0; i < MAX_RDMA_RD; i++) {
 		printf("%d: RDMA read:   Total=%10.2lf usec,   %d bursts, "
 		       "itime=%10.2lf usec, pc=%d\n",
-		       getpid(), time.rdma_rd_total, MAX_RDMA_RD,
-		       time.rdma_rd[i], rdma_rd_poll_count[i]);
+		       getpid(), ts.rdma_rd_total, MAX_RDMA_RD,
+		       ts.rdma_rd[i], rdma_rd_poll_count[i]);
 	}
-	printf("%d: open:      %10.2lf usec\n", getpid(), time.open);
-	printf("%d: close:     %10.2lf usec\n", getpid(), time.close);
-	printf("%d: PZ create: %10.2lf usec\n", getpid(), time.pzc);
-	printf("%d: PZ free:   %10.2lf usec\n", getpid(), time.pzf);
-	printf("%d: LMR create:%10.2lf usec\n", getpid(), time.reg);
-	printf("%d: LMR free:  %10.2lf usec\n", getpid(), time.unreg);
-	printf("%d: EVD create:%10.2lf usec\n", getpid(), time.evdc);
-	printf("%d: EVD free:  %10.2lf usec\n", getpid(), time.evdf);
+	printf("%d: open:      %10.2lf usec\n", getpid(), ts.open);
+	printf("%d: close:     %10.2lf usec\n", getpid(), ts.close);
+	printf("%d: PZ create: %10.2lf usec\n", getpid(), ts.pzc);
+	printf("%d: PZ free:   %10.2lf usec\n", getpid(), ts.pzf);
+	printf("%d: LMR create:%10.2lf usec\n", getpid(), ts.reg);
+	printf("%d: LMR free:  %10.2lf usec\n", getpid(), ts.unreg);
+	printf("%d: EVD create:%10.2lf usec\n", getpid(), ts.evdc);
+	printf("%d: EVD free:  %10.2lf usec\n", getpid(), ts.evdf);
 	if (use_cno) {
-		printf("%d: CNO create:  %10.2lf usec\n", getpid(), time.cnoc);
-		printf("%d: CNO free:    %10.2lf usec\n", getpid(), time.cnof);
+		printf("%d: CNO create:  %10.2lf usec\n", getpid(), ts.cnoc);
+		printf("%d: CNO free:    %10.2lf usec\n", getpid(), ts.cnof);
 	}
-	printf("%d: EP create: %10.2lf usec\n", getpid(), time.epc);
-	printf("%d: EP free:   %10.2lf usec\n", getpid(), time.epf);
+	printf("%d: EP create: %10.2lf usec\n", getpid(), ts.epc);
+	printf("%d: EP free:   %10.2lf usec\n", getpid(), ts.epf);
 	if (!server)
 		printf("%d: connect:   %10.2lf usec, poll_cnt=%d\n", 
-		       getpid(), time.conn, conn_poll_count);
-	printf("%d: TOTAL:     %10.2lf usec\n", getpid(), time.total);
+		       getpid(), ts.conn, conn_poll_count);
+	printf("%d: TOTAL:     %10.2lf usec\n", getpid(), ts.total);
 
 #if defined(_WIN32) || defined(_WIN64)
 	WSACleanup();
@@ -676,6 +727,17 @@ complete:
 	return (0);
 }
 
+#if defined(_WIN32) || defined(_WIN64)
+void gettimeofday(struct timeval *t, char *jnk)
+{
+	SYSTEMTIME now;
+	GetLocalTime(&now);
+	t->tv_sec = now.wMinute * 60;
+	t->tv_sec += now.wSecond;
+	t->tv_usec = now.wMilliseconds;
+}
+#endif
+
 double get_time(void)
 {
 	struct timeval tp;
@@ -761,7 +823,7 @@ send_msg(void *data,
 
 DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 {
-	DAT_SOCK_ADDR remote_addr;
+	DAT_IA_ADDRESS_PTR remote_addr = (DAT_IA_ADDRESS_PTR)&remote;
 	DAT_RETURN ret;
 	DAT_REGION_DESCRIPTION region;
 	DAT_EVENT event;
@@ -953,6 +1015,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 		struct addrinfo *target;
 		int rval;
 
+		if (ucm)
+			goto no_resolution;
+
 #if defined(_WIN32) || defined(_WIN64)
 		if ((rval = getaddrinfo(hostname, "ftp", NULL, &target)) != 0) {
 			printf("\n remote name resolution failed! %s\n",
@@ -972,16 +1037,15 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 		       getpid(), (rval >> 0) & 0xff, (rval >> 8) & 0xff,
 		       (rval >> 16) & 0xff, (rval >> 24) & 0xff, conn_id);
 
-		remote_addr = *((DAT_IA_ADDRESS_PTR) target->ai_addr);
-		freeaddrinfo(target);
-
+		remote_addr = (DAT_IA_ADDRESS_PTR)&target->ai_addr; /* IP */
+no_resolution:
 		for (i = 0; i < 48; i++)	/* simple pattern in private data */
 			pdata[i] = i + 1;
 
 		LOGPRINTF("%d Connecting to server\n", getpid());
         	start = get_time();
 		ret = dat_ep_connect(h_ep,
-				     &remote_addr,
+				     remote_addr,
 				     conn_id,
 				     CONN_TIMEOUT,
 				     48,
@@ -993,6 +1057,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 			return (ret);
 		} else
 			LOGPRINTF("%d dat_ep_connect completed\n", getpid());
+
+		if (!ucm)
+			freeaddrinfo(target);
 	}
 
 	printf("%d Waiting for connect response\n", getpid());
@@ -1007,7 +1074,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 
 	if (!server) {
         	stop = get_time();
-        	time.conn += ((stop - start) * 1.0e6);
+        	ts.conn += ((stop - start) * 1.0e6);
 	}
 
 #ifdef TEST_REJECT_WITH_PRIVATE_DATA
@@ -1307,7 +1374,7 @@ DAT_RETURN do_rdma_write_with_msg(void)
 		return (DAT_ABORT);
 	
 	stop = get_time();
-	time.rdma_wr = ((stop - start) * 1.0e6);
+	ts.rdma_wr = ((stop - start) * 1.0e6);
 
 	/* validate event number and status */
 	printf("%d inbound rdma_write; send message arrived!\n", getpid());
@@ -1436,8 +1503,8 @@ DAT_RETURN do_rdma_read_with_msg(void)
 			return (DAT_ABORT);
 		}
 		stop = get_time();
-		time.rdma_rd[i] = ((stop - start) * 1.0e6);
-		time.rdma_rd_total += time.rdma_rd[i];
+		ts.rdma_rd[i] = ((stop - start) * 1.0e6);
+		ts.rdma_rd_total += ts.rdma_rd[i];
 
 		LOGPRINTF("%d rdma_read # %d completed\n", getpid(), i + 1);
 	}
@@ -1675,7 +1742,7 @@ DAT_RETURN do_ping_pong_msg()
 		snd_buf += buf_len;
 	}
 	stop = get_time();
-	time.rtt = ((stop - start) * 1.0e6);
+	ts.rtt = ((stop - start) * 1.0e6);
 
 	return (DAT_SUCCESS);
 }
@@ -1700,8 +1767,8 @@ DAT_RETURN register_rdma_memory(void)
 			     &rmr_context_recv,
 			     &registered_size_recv, &registered_addr_recv);
 	stop = get_time();
-	time.reg += ((stop - start) * 1.0e6);
-	time.total += time.reg;
+	ts.reg += ((stop - start) * 1.0e6);
+	ts.total += ts.reg;
 
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr,
@@ -1751,8 +1818,8 @@ DAT_RETURN unregister_rdma_memory(void)
 		start = get_time();
 		ret = dat_lmr_free(h_lmr_recv);
 		stop = get_time();
-		time.unreg += ((stop - start) * 1.0e6);
-		time.total += time.unreg;
+		ts.unreg += ((stop - start) * 1.0e6);
+		ts.total += ts.unreg;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, "%d Error deregistering recv mr: %s\n",
 				getpid(), DT_RetToStr(ret));
@@ -1801,8 +1868,8 @@ DAT_RETURN create_events(void)
 				   &h_dto_cno);
 #endif
 		stop = get_time();
-		time.cnoc += ((stop - start) * 1.0e6);
-		time.total += time.cnoc;
+		ts.cnoc += ((stop - start) * 1.0e6);
+		ts.total += ts.cnoc;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, "%d Error dat_cno_create: %s\n",
 				getpid(), DT_RetToStr(ret));
@@ -1819,8 +1886,8 @@ DAT_RETURN create_events(void)
 	    dat_evd_create(h_ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG,
 			   &h_cr_evd);
 	stop = get_time();
-	time.evdc += ((stop - start) * 1.0e6);
-	time.total += time.evdc;
+	ts.evdc += ((stop - start) * 1.0e6);
+	ts.total += ts.evdc;
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, "%d Error dat_evd_create: %s\n",
 			getpid(), DT_RetToStr(ret));
@@ -1930,8 +1997,8 @@ DAT_RETURN destroy_events(void)
 		start = get_time();
 		ret = dat_evd_free(h_dto_rcv_evd);
 		stop = get_time();
-		time.evdf += ((stop - start) * 1.0e6);
-		time.total += time.evdf;
+		ts.evdf += ((stop - start) * 1.0e6);
+		ts.total += ts.evdf;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, "%d Error freeing dto EVD: %s\n",
 				getpid(), DT_RetToStr(ret));
@@ -1962,8 +2029,8 @@ DAT_RETURN destroy_events(void)
 		start = get_time();
 		ret = dat_cno_free(h_dto_cno);
 		stop = get_time();
-		time.cnof += ((stop - start) * 1.0e6);
-		time.total += time.cnof;
+		ts.cnof += ((stop - start) * 1.0e6);
+		ts.total += ts.cnof;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, "%d Error freeing dto CNO: %s\n",
 				getpid(), DT_RetToStr(ret));
@@ -2048,6 +2115,8 @@ void print_usage(void)
 	printf("B: burst count, rdma and msgs \n");
 	printf("h: hostname/address of server, specified on client\n");
 	printf("P: provider name (default = OpenIB-cma)\n");
+	printf("l: server lid (required ucm provider)\n");
+	printf("q: server qpn (required ucm provider)\n");
 	printf("\n");
 }
 
diff --git a/test/dtest/dtestcm.c b/test/dtest/dtestcm.c
index 71d9350..5b0272a 100644
--- a/test/dtest/dtestcm.c
+++ b/test/dtest/dtestcm.c
@@ -76,6 +76,7 @@
 #include <getopt.h>
 #include <inttypes.h>
 #include <unistd.h>
+#include <stdlib.h>
 
 #define DAPL_PROVIDER "ofa-v2-mlx4_0-1"
 
@@ -96,8 +97,24 @@
 #define MAX_POLLING_CNT 50000
 
 /* Header files needed for DAT/uDAPL */
-#include    "dat2/udat.h"
-#include    "dat2/dat_ib_extensions.h"
+#include "infiniband/verbs.h"
+#include "dat2/udat.h"
+#include "dat2/dat_ib_extensions.h"
+
+/* IB address structure used by DAPL uCM provider */
+union dcm_addr { 
+	DAT_SOCK_ADDR6		so;
+	struct {
+		uint8_t		qp_type;
+		uint8_t		port_num;
+		uint16_t	lid;
+		uint32_t	qpn;
+		union ibv_gid	gid;
+	} ib;
+};
+
+static union dcm_addr remote;
+static union dcm_addr local;
 
 /* definitions */
 #define SERVER_CONN_QUAL  45248
@@ -145,7 +162,7 @@ struct dt_time {
 	double conn;
 };
 
-struct dt_time time;
+struct dt_time ts;
 
 /* defaults */
 static int connected = 0;
@@ -160,6 +177,7 @@ static int delay = 0;
 static int connections = 1000;
 static int burst = 100;
 static int port_id = SERVER_CONN_QUAL;
+static int ucm = 0;
 
 /* forward prototypes */
 const char *DT_RetToString(DAT_RETURN ret_value);
@@ -191,9 +209,10 @@ int main(int argc, char **argv)
 {
 	int i, c, len;
 	DAT_RETURN ret;
+	DAT_IA_ATTR ia_attr;
 	
 	/* parse arguments */
-	while ((c = getopt(argc, argv, "smwvub:c:d:h:P:p:")) != -1) {
+	while ((c = getopt(argc, argv, "smwvub:c:d:h:P:p:q:l:")) != -1) {
 		switch (c) {
 		case 's':
 			server = 1;
@@ -230,6 +249,16 @@ int main(int argc, char **argv)
 		case 'P':
 			strcpy(provider, optarg);
 			break;
+		case 'q':
+			remote.ib.qpn = htonl(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
+		case 'l':
+			remote.ib.lid = htons(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
 		default:
 			print_usage();
 			exit(-12);
@@ -283,14 +312,14 @@ int main(int argc, char **argv)
 		exit(1);
 	}
 	memset(h_psp, 0, len);
-	memset(&time, 0, sizeof(struct dt_time));
+	memset(&ts, 0, sizeof(struct dt_time));
 
 	/* dat_ia_open, dat_pz_create */
 	h_async_evd = DAT_HANDLE_NULL;
 	start = get_time();
 	ret = dat_ia_open(provider, 8, &h_async_evd, &h_ia);
 	stop = get_time();
-	time.open += ((stop - start) * 1.0e6);
+	ts.open += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error Adaptor open: %s\n",
 			DT_RetToString(ret));
@@ -298,12 +327,33 @@ int main(int argc, char **argv)
 	} else
 		LOGPRINTF(" Opened Interface Adaptor\n");
 
+	/* query for UCM addressing */
+	ret = dat_ia_query(h_ia, 0, DAT_IA_FIELD_ALL, &ia_attr, 0, 0);
+	if (ret != DAT_SUCCESS) {
+		fprintf(stderr, "%d: Error Adaptor query: %s\n",
+			getpid(), DT_RetToString(ret));
+		exit(1);
+	}
+	memcpy((void*)&local, 
+	       (void*)ia_attr.ia_address_ptr, 
+	        sizeof(DAT_SOCK_ADDR6));
+
+	if (local.ib.qp_type == IBV_QPT_UD) {
+		ucm = 1;
+		printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n", 
+			getpid(), ntohl(local.ib.qpn), 
+			ntohs(local.ib.lid));
+		printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n", 
+			getpid(), ntohl(remote.ib.qpn), 
+			ntohs(remote.ib.lid));
+	} 
+
 	/* Create Protection Zone */
 	start = get_time();
 	LOGPRINTF(" Create Protection Zone\n");
 	ret = dat_pz_create(h_ia, &h_pz);
 	stop = get_time();
-	time.pzc += ((stop - start) * 1.0e6);
+	ts.pzc += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error creating Protection Zone: %s\n",
 			DT_RetToString(ret));
@@ -345,8 +395,8 @@ int main(int argc, char **argv)
 				    &ep_attr, &h_ep[i]);
 	}
 	stop = get_time();
-	time.epc += ((stop - start) * 1.0e6);
-	time.total += time.epc;
+	ts.epc += ((stop - start) * 1.0e6);
+	ts.total += ts.epc;
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error dat_ep_create: %s\n",
 			DT_RetToString(ret));
@@ -447,7 +497,7 @@ complete:
 	start = get_time();
 	ret = dat_pz_free(h_pz);
 	stop = get_time();
-	time.pzf += ((stop - start) * 1.0e6);
+	ts.pzf += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error freeing PZ: %s\n",
 			DT_RetToString(ret));
@@ -462,7 +512,7 @@ complete:
 	start = get_time();
 	ret = dat_ia_close(h_ia, DAT_CLOSE_ABRUPT_FLAG);
 	stop = get_time();
-	time.close += ((stop - start) * 1.0e6);
+	ts.close += ((stop - start) * 1.0e6);
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error Adaptor close: %s\n",
 			DT_RetToString(ret));
@@ -471,25 +521,25 @@ complete:
 		LOGPRINTF(" Closed Interface Adaptor\n");
 
 	printf(" DAPL Connection Test Complete.\n");
-	printf(" open:      %10.2lf usec\n", time.open);
-	printf(" close:     %10.2lf usec\n", time.close);
-	printf(" PZ create: %10.2lf usec\n", time.pzc);
-	printf(" PZ free:   %10.2lf usec\n", time.pzf);
-	printf(" LMR create:%10.2lf usec\n", time.reg);
-	printf(" LMR free:  %10.2lf usec\n", time.unreg);
-	printf(" EVD create:%10.2lf usec\n", time.evdc);
-	printf(" EVD free:  %10.2lf usec\n", time.evdf);
-	printf(" EP create: %10.2lf usec avg\n", time.epc/connections);
-	printf(" EP free:   %10.2lf usec avg\n", time.epf/connections);
+	printf(" open:      %10.2lf usec\n", ts.open);
+	printf(" close:     %10.2lf usec\n", ts.close);
+	printf(" PZ create: %10.2lf usec\n", ts.pzc);
+	printf(" PZ free:   %10.2lf usec\n", ts.pzf);
+	printf(" LMR create:%10.2lf usec\n", ts.reg);
+	printf(" LMR free:  %10.2lf usec\n", ts.unreg);
+	printf(" EVD create:%10.2lf usec\n", ts.evdc);
+	printf(" EVD free:  %10.2lf usec\n", ts.evdf);
+	printf(" EP create: %10.2lf usec avg\n", ts.epc/connections);
+	printf(" EP free:   %10.2lf usec avg\n", ts.epf/connections);
 	if (!server) {
 		printf(" Connections: %8.2lf usec, CPS %7.2lf "
 			"Total %4.2lf secs, poll_cnt=%u, Num=%d\n", 
-		       (double)(time.conn/connections), 
-		       (double)(1/(time.conn/1000000/connections)), 
-		       (double)(time.conn/1000000), 
+		       (double)(ts.conn/connections), 
+		       (double)(1/(ts.conn/1000000/connections)), 
+		       (double)(ts.conn/1000000), 
 		       conn_poll_count, connections);
 	}
-	printf(" TOTAL:     %4.2lf sec\n",  time.total/1000000);
+	printf(" TOTAL:     %4.2lf sec\n",  ts.total/1000000);
 	fflush(stderr);	fflush(stdout);
 bail:
 	free(h_ep);
@@ -501,6 +551,19 @@ bail:
 	return (0);
 }
 
+#if defined(_WIN32) || defined(_WIN64)
+
+void gettimeofday(struct timeval *t, char *jnk)
+{
+	SYSTEMTIME now;
+	GetLocalTime(&now);
+	t->tv_sec = now.wMinute * 60;
+	t->tv_sec += now.wSecond;
+	t->tv_usec = now.wMilliseconds;
+}
+
+#endif
+
 double get_time(void)
 {
 	struct timeval tp;
@@ -644,7 +707,7 @@ DAT_RETURN conn_server()
 	
 DAT_RETURN conn_client() 
 {
-	DAT_SOCK_ADDR raddr;
+	DAT_IA_ADDRESS_PTR raddr = (DAT_IA_ADDRESS_PTR)&remote;
 	DAT_RETURN ret;
 	DAT_EVENT event;
 	DAT_COUNT nmore;
@@ -657,6 +720,9 @@ DAT_RETURN conn_client()
 	struct addrinfo *target;
 	int rval;
 
+	if (ucm)
+		goto no_resolution;
+
 #if defined(_WIN32) || defined(_WIN64)
 	if ((rval = getaddrinfo(hostname, "ftp", NULL, &target)) != 0) {
 		printf("\n remote name resolution failed! %s\n",
@@ -677,8 +743,9 @@ DAT_RETURN conn_client()
 		(rval >> 16) & 0xff, (rval >> 24) & 0xff, 
 		port_id);
 
-	raddr = *((DAT_IA_ADDRESS_PTR)target->ai_addr);
-	freeaddrinfo(target);
+	raddr = (DAT_IA_ADDRESS_PTR)target->ai_addr;
+	
+no_resolution:
 
 	for (i = 0; i < 48; i++) /* simple pattern in private data */
 		pdata[i] = i + 1;
@@ -692,7 +759,7 @@ DAT_RETURN conn_client()
 			else
 				conn_id = port_id;
 
-			ret = dat_ep_connect(h_ep[i+ii], &raddr, 
+			ret = dat_ep_connect(h_ep[i+ii], raddr, 
 					     conn_id, CONN_TIMEOUT,
 					     48, (DAT_PVOID) pdata, 0, 
 					     DAT_CONNECT_DEFAULT_FLAG);
@@ -790,7 +857,10 @@ DAT_RETURN conn_client()
 	}
 
       	stop = get_time();
-       	time.conn += ((stop - start) * 1.0e6);
+       	ts.conn += ((stop - start) * 1.0e6);
+
+	if (!ucm)
+		freeaddrinfo(target);
 
 	printf("\n ALL %d CONNECTED on Client!\n\n",  connections);
 
@@ -825,8 +895,8 @@ DAT_RETURN disconnect_eps(void)
 			}
 		}
 		stop = get_time();
-		time.epf += ((stop - start) * 1.0e6);
-		time.total += time.epf;
+		ts.epf += ((stop - start) * 1.0e6);
+		ts.total += ts.epf;
 		return DAT_SUCCESS;
 	}
 	
@@ -900,8 +970,8 @@ DAT_RETURN disconnect_eps(void)
 	}
 	/* free EPs */
 	stop = get_time();
-	time.epf += ((stop - start) * 1.0e6);
-	time.total += time.epf;
+	ts.epf += ((stop - start) * 1.0e6);
+	ts.total += ts.epf;
 	return DAT_SUCCESS;
 }
 
@@ -918,8 +988,8 @@ DAT_RETURN create_events(void)
 	ret = dat_evd_create(h_ia, connections, DAT_HANDLE_NULL, 
 			     DAT_EVD_CR_FLAG, &h_cr_evd);
 	stop = get_time();
-	time.evdc += ((stop - start) * 1.0e6);
-	time.total += time.evdc;
+	ts.evdc += ((stop - start) * 1.0e6);
+	ts.total += ts.evdc;
 	if (ret != DAT_SUCCESS) {
 		fprintf(stderr, " Error dat_evd_create: %s\n",
 			 DT_RetToString(ret));
@@ -1009,8 +1079,8 @@ DAT_RETURN destroy_events(void)
 		start = get_time();
 		ret = dat_evd_free(h_dto_rcv_evd);
 		stop = get_time();
-		time.evdf += ((stop - start) * 1.0e6);
-		time.total += time.evdf;
+		ts.evdf += ((stop - start) * 1.0e6);
+		ts.total += ts.evdf;
 		if (ret != DAT_SUCCESS) {
 			fprintf(stderr, " Error freeing dto EVD: %s\n",
 				 DT_RetToString(ret));
diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c
index a14785b..af87af0 100755
--- a/test/dtest/dtestx.c
+++ b/test/dtest/dtestx.c
@@ -65,6 +65,7 @@
 
 #endif
 
+#include "infiniband/verbs.h"
 #include "dat2/udat.h"
 #include "dat2/dat_ib_extensions.h"
 
@@ -178,6 +179,22 @@ int eps = 1;
 int verbose = 0;
 int counters = 0;
 int counters_ok = 0;
+static int ucm = 0;
+
+/* IB address structure used by DAPL uCM provider */
+union dcm_addr {
+	DAT_SOCK_ADDR6          so;
+	struct {
+		uint8_t		qp_type;
+		uint8_t		port_num;
+		uint16_t	lid;
+		uint32_t	qpn;
+		union ibv_gid	gid;
+	} ib;
+};
+
+static union dcm_addr remote;
+static union dcm_addr local;
 
 #define LOGPRINTF if (verbose) printf
 
@@ -392,8 +409,9 @@ void process_conn(int idx)
 
 int connect_ep(char *hostname)
 {
-	DAT_SOCK_ADDR remote_addr;
+	DAT_IA_ADDRESS_PTR remote_addr = (DAT_IA_ADDRESS_PTR)&remote;
 	DAT_EP_ATTR ep_attr;
+	DAT_IA_ATTR ia_attr;
 	DAT_RETURN status;
 	DAT_REGION_DESCRIPTION region;
 	DAT_EVENT event;
@@ -412,10 +430,26 @@ int connect_ep(char *hostname)
 	_OK(status, "dat_ia_open");
 
 	memset(&prov_attrs, 0, sizeof(prov_attrs));
-	status = dat_ia_query(ia, NULL, 0, NULL,
+	status = dat_ia_query(ia, NULL, 
+			      DAT_IA_FIELD_ALL, &ia_attr,
 			      DAT_PROVIDER_FIELD_ALL, &prov_attrs);
 	_OK(status, "dat_ia_query");
 
+	memcpy((void*)&local,
+		(void*)ia_attr.ia_address_ptr,
+		sizeof(DAT_SOCK_ADDR6));
+
+	if (local.ib.qp_type == IBV_QPT_UD) {
+		ucm = 1;
+		printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n",
+			getpid(), ntohl(local.ib.qpn),
+			ntohs(local.ib.lid));
+		printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n",
+			getpid(), ntohl(remote.ib.qpn),
+			ntohs(remote.ib.lid));
+	}
+
+
 	/* Print provider specific attributes */
 	for (i = 0; i < prov_attrs.num_provider_specific_attr; i++) {
 		LOGPRINTF(" Provider Specific Attribute[%d] %s=%s\n",
@@ -567,6 +601,9 @@ int connect_ep(char *hostname)
 	if (!server || (server && ud_test)) {
 		struct addrinfo *target;
 
+		if (ucm)
+			goto no_resolution;
+
 		if (getaddrinfo(hostname, NULL, NULL, &target) != 0) {
 			printf("Error getting remote address.\n");
 			exit(1);
@@ -579,10 +616,11 @@ int connect_ep(char *hostname)
 		       inet_ntoa(((struct sockaddr_in *)
 				  target->ai_addr)->sin_addr));
 
-		remote_addr = *((DAT_IA_ADDRESS_PTR) target->ai_addr);
-		freeaddrinfo(target);
 		strcpy((char *)buf[SND_RDMA_BUF_INDEX], "Client written data");
-
+		
+		remote_addr = (DAT_IA_ADDRESS_PTR)&target->ai_addr; /* IP */
+no_resolution:
+		
 		/* one Client EP, multiple Server EPs, same conn_qual 
 		 * use private data to select EP on Server 
 		 */
@@ -596,13 +634,16 @@ int connect_ep(char *hostname)
 				pdata = 0;	/* just use first EP */
 
 			status = dat_ep_connect(ep[0],
-						&remote_addr,
+						remote_addr,
 						(server ? CLIENT_ID :
 						 SERVER_ID), CONN_TIMEOUT, 4,
 						(DAT_PVOID) & pdata, 0,
 						DAT_CONNECT_DEFAULT_FLAG);
 			_OK(status, "dat_ep_connect");
 		}
+
+		if (!ucm)
+			freeaddrinfo(target);
 	}
 
 	/* UD: process CR's starting with 2nd on server, 1st for client */
@@ -721,7 +762,19 @@ int disconnect_ep(void)
 	DAT_EVENT event;
 	DAT_COUNT nmore;
 	int i;
+	
+	if (counters) {		/* examples of query and print */
+		int ii;
+		DAT_UINT64 ia_cntrs[DCNT_IA_ALL_COUNTERS];
 
+		dat_query_counters(ia, DCNT_IA_ALL_COUNTERS, ia_cntrs, 0);
+		printf(" IA Cntrs:");
+		for (ii = 0; ii < DCNT_IA_ALL_COUNTERS; ii++)
+			printf(" " F64u "", ia_cntrs[ii]);
+		printf("\n");
+		dat_print_counters(ia, DCNT_IA_ALL_COUNTERS, 0);
+	}
+	
 	if (!ud_test) {
 		status = dat_ep_disconnect(ep[0], DAT_CLOSE_DEFAULT);
 		_OK2(status, "dat_ep_disconnect");
@@ -797,17 +850,6 @@ int disconnect_ep(void)
 	status = dat_pz_free(pz);
 	_OK2(status, "dat_pz_free");
 
-	if (counters) {		/* examples of query and print */
-		int ii;
-		DAT_UINT64 ia_cntrs[DCNT_IA_ALL_COUNTERS];
-
-		dat_query_counters(ia, DCNT_IA_ALL_COUNTERS, ia_cntrs, 0);
-		printf(" IA Cntrs:");
-		for (ii = 0; ii < DCNT_IA_ALL_COUNTERS; ii++)
-			printf(" " F64u "", ia_cntrs[ii]);
-		printf("\n");
-		dat_print_counters(ia, DCNT_IA_ALL_COUNTERS, 0);
-	}
 	status = dat_ia_close(ia, DAT_CLOSE_DEFAULT);
 	_OK2(status, "dat_ia_close");
 
@@ -1200,7 +1242,7 @@ int main(int argc, char **argv)
 	int rc;
 
 	/* parse arguments */
-	while ((rc = getopt(argc, argv, "csvumpU:h:b:P:")) != -1) {
+	while ((rc = getopt(argc, argv, "csvumpU:h:b:P:q:l:")) != -1) {
 		switch (rc) {
 		case 'u':
 			ud_test = 1;
@@ -1235,6 +1277,16 @@ int main(int argc, char **argv)
 		case 'v':
 			verbose = 1;
 			break;
+		case 'q':
+			remote.ib.qpn = htonl(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
+		case 'l':
+			remote.ib.lid = htons(strtol(optarg,NULL,0));
+			ucm = 1;
+			server = 0;
+			break;
 		default:
 			print_usage();
 			exit(-12);
-- 
1.5.2.5


From rdreier at cisco.com  Tue Aug 18 12:39:21 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 18 Aug 2009 12:39:21 -0700
Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send
	indicate?
In-Reply-To: <4A89DD47.5030902@riorey.com> (Nitin Mehrotra's message of "Mon, 
	17 Aug 2009 18:44:23 -0400")
References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>
	<1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>
	<9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>
	<4A89DD47.5030902@riorey.com>
Message-ID: <adapras29p2.fsf@cisco.com>


 > I am getting this error on a verb send operation and I can't figure
 > out what could be the cause; I searched for all instances of this
 > error in the IB code and while I found 4, none was illuminating.

IBV_WC_REM_OP_ERR corresponds to "Remote Operation Error," which the IB
spec describes as:

    The operation could not be completed successfully by the
    responder. Possible causes include a responder QP related error that
    prevented the responder from completing the request or a malformed
    WQE on the Receive Queue.

Usually means a memory protection problem on the remote end.

 - R.


From swise at opengridcomputing.com  Tue Aug 18 13:01:26 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 18 Aug 2009 15:01:26 -0500
Subject: [ofa-general] Re: [PATCH] krping: Add support for fast_reg_mr with
	dma_local_lkey
In-Reply-To: <20090818191203.GB20947@opengridcomputing.com>
References: <20090818191203.GB20947@opengridcomputing.com>
Message-ID: <4A8B0896.2050107@opengridcomputing.com>

Applied.

Thanks,


Steve.


From nmehrotra at riorey.com  Tue Aug 18 13:34:43 2009
From: nmehrotra at riorey.com (Nitin Mehrotra)
Date: Tue, 18 Aug 2009 16:34:43 -0400
Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send
	indicate?
In-Reply-To: <adapras29p2.fsf@cisco.com>
References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com>	<1048182029.3001250276541490.JavaMail.root@zmail.riorey.com>	<9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com>	<4A89DD47.5030902@riorey.com>
	<adapras29p2.fsf@cisco.com>
Message-ID: <4A8B1063.9020605@riorey.com>

Roland,

Thanks for your response; we suspected that because of the way we 
allocate buffers and are reworking the code to change it. Good to see 
your response because that means it's probably not a wasted effort.

Nitin

Roland Dreier wrote:
>  > I am getting this error on a verb send operation and I can't figure
>  > out what could be the cause; I searched for all instances of this
>  > error in the IB code and while I found 4, none was illuminating.
>
> IBV_WC_REM_OP_ERR corresponds to "Remote Operation Error," which the IB
> spec describes as:
>
>     The operation could not be completed successfully by the
>     responder. Possible causes include a responder QP related error that
>     prevented the responder from completing the request or a malformed
>     WQE on the Receive Queue.
>
> Usually means a memory protection problem on the remote end.
>
>  - R.
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.409 / Virus Database: 270.13.60/2311 - Release Date: 08/18/09 06:03:00
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090818/a08b01e7/attachment.html>

From vlad at lists.openfabrics.org  Wed Aug 19 03:04:08 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 19 Aug 2009 03:04:08 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090819-0200 daily build status
Message-ID: <20090819100409.3CEBCE2822E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From eli at mellanox.co.il  Wed Aug 19 07:38:43 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:38:43 +0300
Subject: [ofa-general] [PATCHv5 01/10] ib_core: Refine device personality
	from node type to port type
Message-ID: <20090819143843.GA8675@mtls03>

As a preparation to devices that, in general, support a different transport
protocol for each port, specifically RDMAoE, this patch defines a transport
type for each of a device's ports. As a result, rdma_node_get_transport() has
been unexported and is used internally by the implementation of the new API,
rdma_port_get_transport(), which gives the transport protocol of the queried
port. rdma_is_transport_supported() is also added to be used for verifying if a
given device supports a given protocol on any of its ports. All references to
rdma_node_get_transport() are changed to to use the new APIs. Also,
ib_port_attr is extended to contain enum rdma_transport_type.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

Changes from previous version:
Define and make use of rdma_is_transport_supported(), an API that
allows the caller to check if a given device supports a given
transport protocol on any of its ports.


 drivers/infiniband/core/cm.c              |   25 +++++++++----
 drivers/infiniband/core/cma.c             |   54 +++++++++++++++--------------
 drivers/infiniband/core/mad.c             |   41 ++++++++++++++--------
 drivers/infiniband/core/multicast.c       |    4 +-
 drivers/infiniband/core/sa_query.c        |   39 +++++++++++++--------
 drivers/infiniband/core/ucm.c             |    8 +++-
 drivers/infiniband/core/ucma.c            |    2 +-
 drivers/infiniband/core/user_mad.c        |    6 +++-
 drivers/infiniband/core/verbs.c           |   25 ++++++++++++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   12 +++---
 include/rdma/ib_verbs.h                   |   11 ++++--
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |    3 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c  |    2 +-
 13 files changed, 148 insertions(+), 84 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 5130fc5..d082f59 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3678,8 +3678,9 @@ static void cm_add_one(struct ib_device *ib_device)
 	unsigned long flags;
 	int ret;
 	u8 i;
+	enum rdma_transport_type tt;
 
-	if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB)
+	if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB))
 		return;
 
 	cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) *
@@ -3700,6 +3701,10 @@ static void cm_add_one(struct ib_device *ib_device)
 
 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
+		tt = rdma_port_get_transport(ib_device, i);
+		if (tt != RDMA_TRANSPORT_IB)
+			continue;
+
 		port = kzalloc(sizeof *port, GFP_KERNEL);
 		if (!port)
 			goto error1;
@@ -3742,9 +3747,11 @@ error1:
 	port_modify.clr_port_cap_mask = IB_PORT_CM_SUP;
 	while (--i) {
 		port = cm_dev->port[i-1];
-		ib_modify_port(ib_device, port->port_num, 0, &port_modify);
-		ib_unregister_mad_agent(port->mad_agent);
-		cm_remove_port_fs(port);
+		if (port) {
+			ib_modify_port(ib_device, port->port_num, 0, &port_modify);
+			ib_unregister_mad_agent(port->mad_agent);
+			cm_remove_port_fs(port);
+		}
 	}
 	device_unregister(cm_dev->device);
 	kfree(cm_dev);
@@ -3770,10 +3777,12 @@ static void cm_remove_one(struct ib_device *ib_device)
 
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
 		port = cm_dev->port[i-1];
-		ib_modify_port(ib_device, port->port_num, 0, &port_modify);
-		ib_unregister_mad_agent(port->mad_agent);
-		flush_workqueue(cm.wq);
-		cm_remove_port_fs(port);
+		if (port) {
+			ib_modify_port(ib_device, port->port_num, 0, &port_modify);
+			ib_unregister_mad_agent(port->mad_agent);
+			flush_workqueue(cm.wq);
+			cm_remove_port_fs(port);
+		}
 	}
 	device_unregister(cm_dev->device);
 	kfree(cm_dev);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 851de83..02fd045 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -329,24 +329,26 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 	struct cma_device *cma_dev;
 	union ib_gid gid;
 	int ret = -ENODEV;
-
-	switch (rdma_node_get_transport(dev_addr->dev_type)) {
-	case RDMA_TRANSPORT_IB:
-		ib_addr_get_sgid(dev_addr, &gid);
-		break;
-	case RDMA_TRANSPORT_IWARP:
-		iw_addr_get_sgid(dev_addr, &gid);
-		break;
-	default:
-		return -ENODEV;
-	}
+	int port;
 
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, &gid,
-					 &id_priv->id.port_num, NULL);
-		if (!ret) {
-			cma_attach_to_dev(id_priv, cma_dev);
-			break;
+		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) {
+			switch (rdma_port_get_transport(cma_dev->device, port)) {
+			case RDMA_TRANSPORT_IB:
+				ib_addr_get_sgid(dev_addr, &gid);
+				break;
+			case RDMA_TRANSPORT_IWARP:
+				iw_addr_get_sgid(dev_addr, &gid);
+				break;
+			default:
+				return -ENODEV;
+			}
+			ret = ib_find_cached_gid(cma_dev->device, &gid,
+						 &id_priv->id.port_num, NULL);
+			if (!ret) {
+				cma_attach_to_dev(id_priv, cma_dev);
+				return ret;
+			}
 		}
 	}
 	return ret;
@@ -597,7 +599,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 	int ret = 0;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps))
 			ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
@@ -747,7 +749,7 @@ static inline int cma_user_data_offset(enum rdma_port_space ps)
 
 static void cma_cancel_route(struct rdma_id_private *id_priv)
 {
-	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (id_priv->query)
 			ib_sa_cancel_query(id_priv->query_id, id_priv->query);
@@ -843,7 +845,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
 	mutex_lock(&lock);
 	if (id_priv->cma_dev) {
 		mutex_unlock(&lock);
-		switch (rdma_node_get_transport(id->device->node_type)) {
+		switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 		case RDMA_TRANSPORT_IB:
 			if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
@@ -1500,7 +1502,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
 
 	id_priv->backlog = backlog;
 	if (id->device) {
-		switch (rdma_node_get_transport(id->device->node_type)) {
+		switch (rdma_port_get_transport(id->device, id->port_num)) {
 		case RDMA_TRANSPORT_IB:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
@@ -1727,7 +1729,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 		return -EINVAL;
 
 	atomic_inc(&id_priv->refcount);
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_resolve_ib_route(id_priv, timeout_ms);
 		break;
@@ -2407,7 +2409,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_resolve_ib_udp(id_priv, conn_param);
@@ -2520,7 +2522,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
@@ -2581,7 +2583,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
 	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
@@ -2612,7 +2614,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
 	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_modify_qp_err(id_priv);
 		if (ret)
@@ -2764,7 +2766,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	list_add(&mc->list, &id_priv->mc_list);
 	spin_unlock(&id_priv->lock);
 
-	switch (rdma_node_get_transport(id->device->node_type)) {
+	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
 		break;
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index de922a0..c06117c 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2905,8 +2905,9 @@ static int ib_mad_port_close(struct ib_device *device, int port_num)
 static void ib_mad_init_device(struct ib_device *device)
 {
 	int start, end, i;
+	enum rdma_transport_type tt;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB))
 		return;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH) {
@@ -2918,6 +2919,10 @@ static void ib_mad_init_device(struct ib_device *device)
 	}
 
 	for (i = start; i <= end; i++) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt != RDMA_TRANSPORT_IB)
+			continue;
+
 		if (ib_mad_port_open(device, i)) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, i);
@@ -2941,13 +2946,15 @@ error:
 	i--;
 
 	while (i >= start) {
-		if (ib_agent_port_close(device, i))
-			printk(KERN_ERR PFX "Couldn't close %s port %d "
-			       "for agents\n",
-			       device->name, i);
-		if (ib_mad_port_close(device, i))
-			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
-			       device->name, i);
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) {
+			if (ib_agent_port_close(device, i))
+				printk(KERN_ERR PFX "Couldn't close %s port %d "
+				       "for agents\n",
+				       device->name, i);
+			if (ib_mad_port_close(device, i))
+				printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+				       device->name, i);
+		}
 		i--;
 	}
 }
@@ -2955,6 +2962,7 @@ error:
 static void ib_mad_remove_device(struct ib_device *device)
 {
 	int i, num_ports, cur_port;
+	enum rdma_transport_type tt;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		num_ports = 1;
@@ -2964,13 +2972,16 @@ static void ib_mad_remove_device(struct ib_device *device)
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		if (ib_agent_port_close(device, cur_port))
-			printk(KERN_ERR PFX "Couldn't close %s port %d "
-			       "for agents\n",
-			       device->name, cur_port);
-		if (ib_mad_port_close(device, cur_port))
-			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
-			       device->name, cur_port);
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB) {
+			if (ib_agent_port_close(device, cur_port))
+				printk(KERN_ERR PFX "Couldn't close %s port %d "
+				       "for agents\n",
+				       device->name, cur_port);
+			if (ib_mad_port_close(device, cur_port))
+				printk(KERN_ERR PFX "Couldn't close %s port %d\n",
+				       device->name, cur_port);
+		}
 	}
 }
 
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index 107f170..e6c98e7 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -788,10 +788,10 @@ static void mcast_add_one(struct ib_device *device)
 	struct mcast_port *port;
 	int i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB))
 		return;
 
-	dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
+	dev = kzalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
 		      GFP_KERNEL);
 	if (!dev)
 		return;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 1865049..46899de 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -416,14 +416,16 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 		struct ib_sa_port *port =
 			&sa_dev->port[event->element.port_num - sa_dev->start_port];
 
-		spin_lock_irqsave(&port->ah_lock, flags);
-		if (port->sm_ah)
-			kref_put(&port->sm_ah->ref, free_sm_ah);
-		port->sm_ah = NULL;
-		spin_unlock_irqrestore(&port->ah_lock, flags);
-
-		schedule_work(&sa_dev->port[event->element.port_num -
-					    sa_dev->start_port].update_task);
+		if (rdma_port_get_transport(handler->device, port->port_num) == RDMA_TRANSPORT_IB) {
+			spin_lock_irqsave(&port->ah_lock, flags);
+			if (port->sm_ah)
+				kref_put(&port->sm_ah->ref, free_sm_ah);
+			port->sm_ah = NULL;
+			spin_unlock_irqrestore(&port->ah_lock, flags);
+
+			schedule_work(&sa_dev->port[event->element.port_num -
+						    sa_dev->start_port].update_task);
+		}
 	}
 }
 
@@ -991,7 +993,7 @@ static void ib_sa_add_one(struct ib_device *device)
 	struct ib_sa_device *sa_dev;
 	int s, e, i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB))
 		return;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH)
@@ -1001,7 +1003,7 @@ static void ib_sa_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
-	sa_dev = kmalloc(sizeof *sa_dev +
+	sa_dev = kzalloc(sizeof *sa_dev +
 			 (e - s + 1) * sizeof (struct ib_sa_port),
 			 GFP_KERNEL);
 	if (!sa_dev)
@@ -1011,6 +1013,9 @@ static void ib_sa_add_one(struct ib_device *device)
 	sa_dev->end_port   = e;
 
 	for (i = 0; i <= e - s; ++i) {
+		if (rdma_port_get_transport(device, i + 1) != RDMA_TRANSPORT_IB)
+			continue;
+
 		sa_dev->port[i].sm_ah    = NULL;
 		sa_dev->port[i].port_num = i + s;
 		spin_lock_init(&sa_dev->port[i].ah_lock);
@@ -1039,13 +1044,15 @@ static void ib_sa_add_one(struct ib_device *device)
 		goto err;
 
 	for (i = 0; i <= e - s; ++i)
-		update_sm_ah(&sa_dev->port[i].update_task);
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB)
+			update_sm_ah(&sa_dev->port[i].update_task);
 
 	return;
 
 err:
 	while (--i >= 0)
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
+		if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB)
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
 
 	kfree(sa_dev);
 
@@ -1065,9 +1072,11 @@ static void ib_sa_remove_one(struct ib_device *device)
 	flush_scheduled_work();
 
 	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
-		if (sa_dev->port[i].sm_ah)
-			kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
+		if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+			if (sa_dev->port[i].sm_ah)
+				kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
+		}
 	}
 
 	kfree(sa_dev);
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 51bd966..b508020 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1239,11 +1239,15 @@ static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
 static void ib_ucm_add_one(struct ib_device *device)
 {
 	struct ib_ucm_device *ucm_dev;
+	int i;
 
-	if (!device->alloc_ucontext ||
-	    rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!device->alloc_ucontext)
 		return;
 
+	for (i = 1; i <= device->phys_port_cnt; ++i)
+		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
+			return;
+
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
 	if (!ucm_dev)
 		return;
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 4346a24..24d9510 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -614,7 +614,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 
 	resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
 	resp.port_num = ctx->cm_id->port_num;
-	switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
+	switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) {
 	case RDMA_TRANSPORT_IB:
 		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
 		break;
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 8c46f22..aa4eeb3 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1113,7 +1113,7 @@ static void ib_umad_add_one(struct ib_device *device)
 	struct ib_umad_device *umad_dev;
 	int s, e, i;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB))
 		return;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH)
@@ -1123,6 +1123,10 @@ static void ib_umad_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
+	for (i = s; i <= e; ++i)
+		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
+			return;
+
 	umad_dev = kzalloc(sizeof *umad_dev +
 			   (e - s + 1) * sizeof (struct ib_umad_port),
 			   GFP_KERNEL);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a7da9be..d81e217 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -77,7 +77,7 @@ enum ib_rate mult_to_ib_rate(int mult)
 }
 EXPORT_SYMBOL(mult_to_ib_rate);
 
-enum rdma_transport_type
+static enum rdma_transport_type
 rdma_node_get_transport(enum rdma_node_type node_type)
 {
 	switch (node_type) {
@@ -92,7 +92,28 @@ rdma_node_get_transport(enum rdma_node_type node_type)
 		return 0;
 	}
 }
-EXPORT_SYMBOL(rdma_node_get_transport);
+
+enum rdma_transport_type rdma_port_get_transport(struct ib_device *device,
+						 u8 port_num)
+{
+	return device->get_port_transport ?
+		device->get_port_transport(device, port_num) :
+		rdma_node_get_transport(device->node_type);
+}
+EXPORT_SYMBOL(rdma_port_get_transport);
+
+int rdma_is_transport_supported(struct ib_device *device,
+				enum rdma_transport_type transport)
+{
+	int i;
+
+	for (i = 1; i <= device->phys_port_cnt; ++i)
+		if (rdma_port_get_transport(device, i) == transport)
+			return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL(rdma_is_transport_supported);
 
 /* Protection domains */
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index ab2c192..39df0f7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1337,9 +1337,6 @@ static void ipoib_add_one(struct ib_device *device)
 	struct ipoib_dev_priv *priv;
 	int s, e, p;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
 	if (!dev_list)
 		return;
@@ -1355,6 +1352,9 @@ static void ipoib_add_one(struct ib_device *device)
 	}
 
 	for (p = s; p <= e; ++p) {
+		if (rdma_port_get_transport(device, p) != RDMA_TRANSPORT_IB)
+			continue;
+
 		dev = ipoib_add_port("ib%d", device, p);
 		if (!IS_ERR(dev)) {
 			priv = netdev_priv(dev);
@@ -1370,12 +1370,12 @@ static void ipoib_remove_one(struct ib_device *device)
 	struct ipoib_dev_priv *priv, *tmp;
 	struct list_head *dev_list;
 
-	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
-		return;
-
 	dev_list = ib_get_client_data(device, &ipoib_client);
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
+		if (rdma_port_get_transport(device, priv->port) != RDMA_TRANSPORT_IB)
+			continue;
+
 		ib_unregister_event_handler(&priv->event_handler);
 
 		rtnl_lock();
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c179318..4cf42f3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -72,9 +72,6 @@ enum rdma_transport_type {
 	RDMA_TRANSPORT_IWARP
 };
 
-enum rdma_transport_type
-rdma_node_get_transport(enum rdma_node_type node_type) __attribute_const__;
-
 enum ib_device_cap_flags {
 	IB_DEVICE_RESIZE_MAX_WR		= 1,
 	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1),
@@ -298,6 +295,7 @@ struct ib_port_attr {
 	u8			active_width;
 	u8			active_speed;
 	u8                      phys_state;
+	enum rdma_transport_type	transport;
 };
 
 enum ib_device_modify_flags {
@@ -1003,6 +1001,8 @@ struct ib_device {
 	int		           (*query_port)(struct ib_device *device,
 						 u8 port_num,
 						 struct ib_port_attr *port_attr);
+	enum rdma_transport_type   (*get_port_transport)(struct ib_device *device,
+							 u8 port_num);
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
@@ -1213,6 +1213,11 @@ int ib_query_device(struct ib_device *device,
 int ib_query_port(struct ib_device *device,
 		  u8 port_num, struct ib_port_attr *port_attr);
 
+enum rdma_transport_type rdma_port_get_transport(struct ib_device *device,
+						 u8 port_num);
+int rdma_is_transport_supported(struct ib_device *device,
+				enum rdma_transport_type transport);
+
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 42a6f9f..769dc18 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -338,8 +338,7 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
 static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
 {
 	if ((RDMA_TRANSPORT_IWARP ==
-	     rdma_node_get_transport(xprt->sc_cm_id->
-				     device->node_type))
+	     rdma_port_get_transport(xprt->sc_cm_id->device, xprt->sc_cm_id->port_num))
 	    && sge_count > 1)
 		return 1;
 	else
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5151f9f..a5a4162 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -976,7 +976,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	/*
 	 * Determine if a DMA MR is required and if so, what privs are required
 	 */
-	switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
+	switch (rdma_port_get_transport(newxprt->sc_cm_id->device, newxprt->sc_cm_id->port_num)) {
 	case RDMA_TRANSPORT_IWARP:
 		newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
 		if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:38:59 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:38:59 +0300
Subject: [ofa-general] [PATCHv5 03/10] ib_core: RDMAoE support only QP1
Message-ID: <20090819143859.GB8675@mtls03>

Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. QP1
is still needed since it handles communications between CM agents. This patch
will create only QP1 for RDMAoE ports.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changes from previous version:

1. Instead of returning NULL for unsupported ports (which is no
considered an error), now callers of ib_register_mad_agent() must
verify that the port/special QP is supported before calling this
function.
2. Make use of rdma_is_transport_supported() where appropriate.


 drivers/infiniband/core/agent.c |   38 +++++++++++++++++++++++++-------------
 drivers/infiniband/core/mad.c   |   37 +++++++++++++++++++++++++++++--------
 2 files changed, 54 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index ae7c288..c130a4a 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -48,6 +48,8 @@
 struct ib_agent_port_private {
 	struct list_head port_list;
 	struct ib_mad_agent *agent[2];
+	struct ib_device    *device;
+	u8		     port_num;
 };
 
 static DEFINE_SPINLOCK(ib_agent_port_list_lock);
@@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int port_num)
 {
 	struct ib_agent_port_private *entry;
 
-	list_for_each_entry(entry, &ib_agent_port_list, port_list) {
-		if (entry->agent[0]->device == device &&
-		    entry->agent[0]->port_num == port_num)
+	list_for_each_entry(entry, &ib_agent_port_list, port_list)
+		if (entry->device == device && entry->port_num == port_num)
 			return entry;
-	}
+
 	return NULL;
 }
 
@@ -146,6 +147,7 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
 	struct ib_agent_port_private *port_priv;
 	unsigned long flags;
 	int ret;
+	enum rdma_transport_type tt;
 
 	/* Create new device info */
 	port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL);
@@ -155,14 +157,17 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
 		goto error1;
 	}
 
-	/* Obtain send only MAD agent for SMI QP */
-	port_priv->agent[0] = ib_register_mad_agent(device, port_num,
-						    IB_QPT_SMI, NULL, 0,
-						    &agent_send_handler,
-						    NULL, NULL);
-	if (IS_ERR(port_priv->agent[0])) {
-		ret = PTR_ERR(port_priv->agent[0]);
-		goto error2;
+	tt = rdma_port_get_transport(device, port_num);
+	if (tt == RDMA_TRANSPORT_IB) {
+		/* Obtain send only MAD agent for SMI QP */
+		port_priv->agent[0] = ib_register_mad_agent(device, port_num,
+							    IB_QPT_SMI, NULL, 0,
+							    &agent_send_handler,
+							    NULL, NULL);
+		if (IS_ERR(port_priv->agent[0])) {
+			ret = PTR_ERR(port_priv->agent[0]);
+			goto error2;
+		}
 	}
 
 	/* Obtain send only MAD agent for GSI QP */
@@ -175,6 +180,9 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
 		goto error3;
 	}
 
+	port_priv->device = device;
+	port_priv->port_num = port_num;
+
 	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
 	list_add_tail(&port_priv->port_list, &ib_agent_port_list);
 	spin_unlock_irqrestore(&ib_agent_port_list_lock, flags);
@@ -182,7 +190,8 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
 	return 0;
 
 error3:
-	ib_unregister_mad_agent(port_priv->agent[0]);
+	if (tt == RDMA_TRANSPORT_IB)
+		ib_unregister_mad_agent(port_priv->agent[0]);
 error2:
 	kfree(port_priv);
 error1:
@@ -194,6 +203,9 @@ int ib_agent_port_close(struct ib_device *device, int port_num)
 	struct ib_agent_port_private *port_priv;
 	unsigned long flags;
 
+	if (rdma_port_get_transport(device, port_num) != RDMA_TRANSPORT_IB)
+		return 0;
+
 	spin_lock_irqsave(&ib_agent_port_list_lock, flags);
 	port_priv = __ib_get_agent_port(device, port_num);
 	if (port_priv == NULL) {
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index c06117c..aceae79 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2602,6 +2602,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info)
 	struct ib_mad_private *recv;
 	struct ib_mad_list_head *mad_list;
 
+	if (!qp_info->qp)
+		return;
+
 	while (!list_empty(&qp_info->recv_queue.list)) {
 
 		mad_list = list_entry(qp_info->recv_queue.list.next,
@@ -2643,6 +2646,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
 		qp = port_priv->qp_info[i].qp;
+		if (!qp)
+			continue;
+
 		/*
 		 * PKey index for QP1 is irrelevant but
 		 * one is needed for the Reset to Init transition
@@ -2684,6 +2690,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv)
 	}
 
 	for (i = 0; i < IB_MAD_QPS_CORE; i++) {
+		if (!port_priv->qp_info[i].qp)
+			continue;
+
 		ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL);
 		if (ret) {
 			printk(KERN_ERR PFX "Couldn't post receive WRs\n");
@@ -2762,6 +2771,9 @@ error:
 
 static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
 {
+	if (!qp_info->qp)
+		return;
+
 	ib_destroy_qp(qp_info->qp);
 	kfree(qp_info->snoop_table);
 }
@@ -2777,6 +2789,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	struct ib_mad_port_private *port_priv;
 	unsigned long flags;
 	char name[sizeof "ib_mad123"];
+	int has_smi;
 
 	/* Create new device info */
 	port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL);
@@ -2792,7 +2805,11 @@ static int ib_mad_port_open(struct ib_device *device,
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
-	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
+	cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE;
+	has_smi = rdma_port_get_transport(device, port_num) == RDMA_TRANSPORT_IB;
+	if (has_smi)
+		cq_size *= 2;
+
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
 				     NULL, port_priv, cq_size, 0);
@@ -2816,9 +2833,11 @@ static int ib_mad_port_open(struct ib_device *device,
 		goto error5;
 	}
 
-	ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
-	if (ret)
-		goto error6;
+	if (has_smi) {
+		ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
+		if (ret)
+			goto error6;
+	}
 	ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI);
 	if (ret)
 		goto error7;
@@ -2907,7 +2926,8 @@ static void ib_mad_init_device(struct ib_device *device)
 	int start, end, i;
 	enum rdma_transport_type tt;
 
-	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB))
+	if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB) &&
+	    !rdma_is_transport_supported(device, RDMA_TRANSPORT_RDMAOE))
 		return;
 
 	if (device->node_type == RDMA_NODE_IB_SWITCH) {
@@ -2920,7 +2940,7 @@ static void ib_mad_init_device(struct ib_device *device)
 
 	for (i = start; i <= end; i++) {
 		tt = rdma_port_get_transport(device, i);
-		if (tt != RDMA_TRANSPORT_IB)
+		if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE)
 			continue;
 
 		if (ib_mad_port_open(device, i)) {
@@ -2946,7 +2966,8 @@ error:
 	i--;
 
 	while (i >= start) {
-		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) {
 			if (ib_agent_port_close(device, i))
 				printk(KERN_ERR PFX "Couldn't close %s port %d "
 				       "for agents\n",
@@ -2973,7 +2994,7 @@ static void ib_mad_remove_device(struct ib_device *device)
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
 		tt = rdma_port_get_transport(device, i);
-		if (tt == RDMA_TRANSPORT_IB) {
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) {
 			if (ib_agent_port_close(device, cur_port))
 				printk(KERN_ERR PFX "Couldn't close %s port %d "
 				       "for agents\n",
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:39:12 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:39:12 +0300
Subject: [ofa-general] [PATCHv5 04/10] IB/umad: Enable support only for IB
	ports
Message-ID: <20090819143912.GC8675@mtls03>

Initialize umad context for devices that have any of their ports IB. Since
devices may have ports of two different protocols (for example,
RDMA_TRANSPORT_IB and RDMA_TRANSPORT_RDMAOE), ib_umad_add_one() needs to
succeed if any of the ports is IB but ib_umad_init_port() is called only for IB
ports.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

Changes from last version:
1. Patch title changed from to "Enable support for RDMAoE ports" to
"Enable support only for IB ports".
2. Do not allow userspace MADs to RDMAoE ports.


 drivers/infiniband/core/user_mad.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index aa4eeb3..51888eb 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1123,10 +1123,6 @@ static void ib_umad_add_one(struct ib_device *device)
 		e = device->phys_port_cnt;
 	}
 
-	for (i = s; i <= e; ++i)
-		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
-			return;
-
 	umad_dev = kzalloc(sizeof *umad_dev +
 			   (e - s + 1) * sizeof (struct ib_umad_port),
 			   GFP_KERNEL);
@@ -1141,8 +1137,9 @@ static void ib_umad_add_one(struct ib_device *device)
 	for (i = s; i <= e; ++i) {
 		umad_dev->port[i - s].umad_dev = umad_dev;
 
-		if (ib_umad_init_port(device, i, &umad_dev->port[i - s]))
-			goto err;
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB)
+			if (ib_umad_init_port(device, i, &umad_dev->port[i - s]))
+				goto err;
 	}
 
 	ib_set_client_data(device, &umad_client, umad_dev);
@@ -1151,7 +1148,8 @@ static void ib_umad_add_one(struct ib_device *device)
 
 err:
 	while (--i >= s)
-		ib_umad_kill_port(&umad_dev->port[i - s]);
+		if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB)
+			ib_umad_kill_port(&umad_dev->port[i - s]);
 
 	kref_put(&umad_dev->ref, ib_umad_release_dev);
 }
@@ -1165,7 +1163,8 @@ static void ib_umad_remove_one(struct ib_device *device)
 		return;
 
 	for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i)
-		ib_umad_kill_port(&umad_dev->port[i]);
+		if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB)
+			ib_umad_kill_port(&umad_dev->port[i]);
 
 	kref_put(&umad_dev->ref, ib_umad_release_dev);
 }
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:39:28 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:39:28 +0300
Subject: [ofa-general] [PATCHv5 05/10] ib/cm: Enable CM support for RDMAoE
Message-ID: <20090819143928.GD8675@mtls03>

CM messages can be transported on RDMAoE protocol ports so they are enabled
here.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

Changes from last version:
Make use of rdma_is_transport_supported()


 drivers/infiniband/core/cm.c  |    5 +++--
 drivers/infiniband/core/ucm.c |   12 +++++++++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index d082f59..c9f9122 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3680,7 +3680,8 @@ static void cm_add_one(struct ib_device *ib_device)
 	u8 i;
 	enum rdma_transport_type tt;
 
-	if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB))
+	if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB) &&
+	    !rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_RDMAOE))
 		return;
 
 	cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) *
@@ -3702,7 +3703,7 @@ static void cm_add_one(struct ib_device *ib_device)
 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
 	for (i = 1; i <= ib_device->phys_port_cnt; i++) {
 		tt = rdma_port_get_transport(ib_device, i);
-		if (tt != RDMA_TRANSPORT_IB)
+		if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE)
 			continue;
 
 		port = kzalloc(sizeof *port, GFP_KERNEL);
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index b508020..3ce5df2 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device)
 {
 	struct ib_ucm_device *ucm_dev;
 	int i;
+	enum rdma_transport_type tt;
 
 	if (!device->alloc_ucontext)
 		return;
 
-	for (i = 1; i <= device->phys_port_cnt; ++i)
-		if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB)
-			return;
+	for (i = 1; i <= device->phys_port_cnt; ++i) {
+		tt = rdma_port_get_transport(device, i);
+		if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE)
+			break;
+	}
+
+	if (i > device->phys_port_cnt)
+		return;
 
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
 	if (!ucm_dev)
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:39:37 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:39:37 +0300
Subject: [ofa-general] [PATCHv5 06/10] ib_core: CMA device binding
Message-ID: <20090819143937.GE8675@mtls03>

Add support for RDMAoE device binding and IP --> GID resolution. Path resolving
and multicast joining are implemented within cma.c by filling the responses and
pushing the callbacks to the cma work queue. IP->GID resolution always yields
IPv6 link local addresses - remote GIDs are derived from the destination MAC
address of the remote port. Multicast GIDs are always mapped to multicast MACs
as is done in IPv6; addtion/removal of addresses are made by calling
dev_mc_add/delete thus causing the netedvice driver to update the corresponding
port's configuration. IPv4 multlicast is not supported currently. Some helper
functions are added to ib_addr.h.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

Changes from last version:
1. Add kref to struct cma_multicast to aid in maintaining reference
count on the object. This is to avoid freeing the object while the
worker thread is still using it.
2. return an immediate error if we get an invalid mtu in a resolved
path
3. Don't fail resolve path if rate is 0 since this value stands for
IB_RATE_PORT_CURRENT.
4. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero.
5. Add ucma_copy_rdmaoe_route() to copy route to userspace instead of
modifying ucma_copy_ib_route().


 drivers/infiniband/core/cma.c  |  207 ++++++++++++++++++++++++++++++++++++++-
 drivers/infiniband/core/ucma.c |   31 ++++++
 include/rdma/ib_addr.h         |   92 ++++++++++++++++++
 3 files changed, 324 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 02fd045..6e56e27 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 #define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
+#define RDMAOE_PACKET_LIFETIME 18
 
 static void cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device);
@@ -157,6 +158,7 @@ struct cma_multicast {
 	struct list_head	list;
 	void			*context;
 	struct sockaddr_storage	addr;
+	struct kref		mcref;
 };
 
 struct cma_work {
@@ -173,6 +175,12 @@ struct cma_ndev_work {
 	struct rdma_cm_event	event;
 };
 
+struct rdmaoe_mcast_work {
+	struct work_struct	 work;
+	struct rdma_id_private	*id;
+	struct cma_multicast	*mc;
+};
+
 union cma_ip_addr {
 	struct in6_addr ip6;
 	struct {
@@ -290,6 +298,20 @@ static inline void cma_deref_dev(struct cma_device *cma_dev)
 		complete(&cma_dev->comp);
 }
 
+static inline void release_mc(struct kref *kref)
+{
+	struct cma_multicast *mc = container_of(kref, struct cma_multicast, mcref);
+	struct rdma_dev_addr *dev_addr = &mc->id_priv->id.route.addr.dev_addr;
+	u8 mac[6];
+
+	rdma_get_mcast_mac((struct in6_addr *)(&mc->multicast.ib->rec.mgid), mac);
+	rtnl_lock();
+	dev_mc_delete(dev_addr->src_dev, mac, 6, 0);
+	rtnl_unlock();
+	kfree(mc->multicast.ib);
+	kfree(mc);
+}
+
 static void cma_detach_from_dev(struct rdma_id_private *id_priv)
 {
 	list_del(&id_priv->list);
@@ -340,6 +362,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 			case RDMA_TRANSPORT_IWARP:
 				iw_addr_get_sgid(dev_addr, &gid);
 				break;
+			case RDMA_TRANSPORT_RDMAOE:
+				rdmaoe_addr_get_sgid(dev_addr, &gid);
+				break;
 			default:
 				return -ENODEV;
 			}
@@ -568,10 +593,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv,
 {
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	int ret;
+	u16 pkey;
+
+	if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) ==
+	    RDMA_TRANSPORT_IB)
+		pkey = ib_addr_get_pkey(dev_addr);
+	else
+		pkey = 0xffff;
 
 	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
-				  ib_addr_get_pkey(dev_addr),
-				  &qp_attr->pkey_index);
+				  pkey, &qp_attr->pkey_index);
 	if (ret)
 		return ret;
 
@@ -601,6 +632,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 	id_priv = container_of(id, struct rdma_id_private, id);
 	switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps))
 			ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
 		else
@@ -828,8 +860,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 		mc = container_of(id_priv->mc_list.next,
 				  struct cma_multicast, list);
 		list_del(&mc->list);
-		ib_sa_free_multicast(mc->multicast.ib);
-		kfree(mc);
+		switch (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num)) {
+		case RDMA_TRANSPORT_IB:
+			ib_sa_free_multicast(mc->multicast.ib);
+			kfree(mc);
+			break;
+		case RDMA_TRANSPORT_RDMAOE:
+			kref_put(&mc->mcref, release_mc);
+			break;
+		default:
+			break;
+		}
 	}
 }
 
@@ -847,6 +888,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
 		mutex_unlock(&lock);
 		switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) {
 		case RDMA_TRANSPORT_IB:
+		case RDMA_TRANSPORT_RDMAOE:
 			if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
@@ -1504,6 +1546,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
 	if (id->device) {
 		switch (rdma_port_get_transport(id->device, id->port_num)) {
 		case RDMA_TRANSPORT_IB:
+		case RDMA_TRANSPORT_RDMAOE:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
 				goto err;
@@ -1719,6 +1762,66 @@ static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
 	return 0;
 }
 
+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv)
+{
+	struct rdma_route *route = &id_priv->id.route;
+	struct rdma_addr *addr = &route->addr;
+	struct cma_work *work;
+	int ret;
+	struct sockaddr_in *src_addr = (struct sockaddr_in *)&route->addr.src_addr;
+	struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route->addr.dst_addr;
+
+	if (src_addr->sin_family != dst_addr->sin_family)
+		return -EINVAL;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->id = id_priv;
+	INIT_WORK(&work->work, cma_work_handler);
+
+	route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL);
+	if (!route->path_rec) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+
+	route->num_paths = 1;
+
+	rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr);
+	rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr);
+
+	route->path_rec->hop_limit = 2;
+	route->path_rec->reversible = 1;
+	route->path_rec->pkey = cpu_to_be16(0xffff);
+	route->path_rec->mtu_selector = 2;
+	route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu);
+	route->path_rec->rate_selector = 2;
+	route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev);
+	route->path_rec->packet_life_time_selector = 2;
+	route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME;
+	if (!route->path_rec->mtu) {
+		ret = -EINVAL;
+		goto err2;
+	}
+
+	work->old_state = CMA_ROUTE_QUERY;
+	work->new_state = CMA_ROUTE_RESOLVED;
+	work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
+	work->event.status = 0;
+
+	queue_work(cma_wq, &work->work);
+
+	return 0;
+
+err2:
+	kfree(route->path_rec);
+err1:
+	kfree(work);
+	return ret;
+}
+
 int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 {
 	struct rdma_id_private *id_priv;
@@ -1736,6 +1839,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 	case RDMA_TRANSPORT_IWARP:
 		ret = cma_resolve_iw_route(id_priv, timeout_ms);
 		break;
+	case RDMA_TRANSPORT_RDMAOE:
+		ret = cma_resolve_rdmaoe_route(id_priv);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -2411,6 +2517,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_resolve_ib_udp(id_priv, conn_param);
 		else
@@ -2524,6 +2631,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
 						conn_param->private_data,
@@ -2585,6 +2693,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		if (cma_is_ud_ps(id->ps))
 			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT,
 						private_data, private_data_len);
@@ -2616,6 +2725,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
 
 	switch (rdma_port_get_transport(id->device, id->port_num)) {
 	case RDMA_TRANSPORT_IB:
+	case RDMA_TRANSPORT_RDMAOE:
 		ret = cma_modify_qp_err(id_priv);
 		if (ret)
 			goto out;
@@ -2742,6 +2852,77 @@ static int cma_join_ib_multicast(struct rdma_id_private *id_priv,
 	return 0;
 }
 
+
+static void rdmaoe_mcast_work_handler(struct work_struct *work)
+{
+	struct rdmaoe_mcast_work *mw = container_of(work, struct rdmaoe_mcast_work, work);
+	struct cma_multicast *mc = mw->mc;
+	struct ib_sa_multicast *m = mc->multicast.ib;
+	struct rdma_dev_addr *dev_addr = &mw->id->id.route.addr.dev_addr;
+	u8 mac[6];
+
+	mc->multicast.ib->context = mc;
+	rdma_get_mcast_mac((struct in6_addr *)(&mc->multicast.ib->rec.mgid), mac);
+	rtnl_lock();
+	dev_mc_add(dev_addr->src_dev, mac, 6, 0);
+	rtnl_unlock();
+	cma_ib_mc_handler(0, m);
+	kref_put(&mc->mcref, release_mc);
+	kfree(mw);
+}
+
+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv,
+				     struct cma_multicast *mc)
+{
+	struct rdmaoe_mcast_work *work;
+	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
+	int err;
+	struct sockaddr *addr = (struct sockaddr *)&mc->addr;
+
+	if (cma_zero_addr((struct sockaddr *)&mc->addr))
+		return -EINVAL;
+
+	/* IPv4 multicast is not supported currenntly */
+	if (addr->sa_family == AF_INET)
+		return -EINVAL;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL);
+	if (!mc->multicast.ib) {
+		err = -ENOMEM;
+		goto out1;
+	}
+
+	cma_set_mgid(id_priv, addr, &mc->multicast.ib->rec.mgid);
+	mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff);
+	if (id_priv->id.ps == RDMA_PS_UDP)
+		mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY);
+	mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev);
+	mc->multicast.ib->rec.hop_limit = 1;
+	mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu);
+	if (!mc->multicast.ib->rec.mtu) {
+		err = -EINVAL;
+		goto out2;
+	}
+	rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
+	work->id = id_priv;
+	work->mc = mc;
+	INIT_WORK(&work->work, rdmaoe_mcast_work_handler);
+	kref_get(&mc->mcref);
+	queue_work(cma_wq, &work->work);
+
+	return 0;
+
+out2:
+	kfree(mc->multicast.ib);
+out1:
+	kfree(work);
+	return err;
+}
+
 int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 			void *context)
 {
@@ -2770,6 +2951,10 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	case RDMA_TRANSPORT_IB:
 		ret = cma_join_ib_multicast(id_priv, mc);
 		break;
+	case RDMA_TRANSPORT_RDMAOE:
+		kref_init(&mc->mcref);
+		ret = cma_rdmaoe_join_multicast(id_priv, mc);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -2781,6 +2966,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 		spin_unlock_irq(&id_priv->lock);
 		kfree(mc);
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(rdma_join_multicast);
@@ -2801,8 +2987,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
 				ib_detach_mcast(id->qp,
 						&mc->multicast.ib->rec.mgid,
 						mc->multicast.ib->rec.mlid);
-			ib_sa_free_multicast(mc->multicast.ib);
-			kfree(mc);
+			switch (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num)) {
+			case RDMA_TRANSPORT_IB:
+				ib_sa_free_multicast(mc->multicast.ib);
+				kfree(mc);
+				break;
+			case RDMA_TRANSPORT_RDMAOE:
+				kref_put(&mc->mcref, release_mc);
+				break;
+			default:
+				break;
+			}
 			return;
 		}
 	}
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 24d9510..5eb1198 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -580,6 +580,34 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 	}
 }
 
+static void ucma_copy_rdmaoe_route(struct rdma_ucm_query_route_resp *resp,
+				   struct rdma_route *route)
+{
+	struct rdma_dev_addr *dev_addr;
+
+	resp->num_paths = route->num_paths;
+	switch (route->num_paths) {
+	case 0:
+		dev_addr = &route->addr.dev_addr;
+		rdmaoe_mac_to_ll((union ib_gid *) &resp->ib_route[0].dgid,
+				 dev_addr->dst_dev_addr);
+		rdmaoe_addr_get_sgid(dev_addr,
+				 (union ib_gid *) &resp->ib_route[0].sgid);
+		resp->ib_route[0].pkey = cpu_to_be16(0xffff);
+		break;
+	case 2:
+		ib_copy_path_rec_to_user(&resp->ib_route[1],
+					 &route->path_rec[1]);
+		/* fall through */
+	case 1:
+		ib_copy_path_rec_to_user(&resp->ib_route[0],
+					 &route->path_rec[0]);
+		break;
+	default:
+		break;
+	}
+}
+
 static ssize_t ucma_query_route(struct ucma_file *file,
 				const char __user *inbuf,
 				int in_len, int out_len)
@@ -618,6 +646,9 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 	case RDMA_TRANSPORT_IB:
 		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
 		break;
+	case RDMA_TRANSPORT_RDMAOE:
+		ucma_copy_rdmaoe_route(&resp, &ctx->cm_id->route);
+		break;
 	default:
 		break;
 	}
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index 483057b..ab06fe9 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -39,6 +39,8 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <rdma/ib_verbs.h>
+#include <linux/ethtool.h>
+#include <rdma/ib_pack.h>
 
 struct rdma_addr_client {
 	atomic_t refcount;
@@ -157,4 +159,94 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr *dev_addr,
 	memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid);
 }
 
+static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac)
+{
+	memset(gid->raw, 0, 16);
+	*((u32 *)gid->raw) = cpu_to_be32(0xfe800000);
+	gid->raw[12] = 0xfe;
+	gid->raw[11] = 0xff;
+	memcpy(gid->raw + 13, mac + 3, 3);
+	memcpy(gid->raw + 8, mac, 3);
+	gid->raw[8] ^= 2;
+}
+
+static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr,
+					union ib_gid *gid)
+{
+	rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr);
+}
+
+static inline enum ib_mtu rdmaoe_get_mtu(int mtu)
+{
+	/*
+	 * reduce IB headers from effective RDMAoE MTU. 28 stands for
+	 * atomic header which is the biggest possible header after BTH
+	 */
+	mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28;
+
+	if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096))
+		return IB_MTU_4096;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048))
+		return IB_MTU_2048;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024))
+		return IB_MTU_1024;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512))
+		return IB_MTU_512;
+	else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256))
+		return IB_MTU_256;
+	else
+		return 0;
+}
+
+static inline int rdmaoe_get_rate(struct net_device *dev)
+{
+	struct ethtool_cmd cmd;
+
+	if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings ||
+	    dev->ethtool_ops->get_settings(dev, &cmd))
+		return IB_RATE_PORT_CURRENT;
+
+	if (cmd.speed >= 40000)
+		return IB_RATE_40_GBPS;
+	else if (cmd.speed >= 30000)
+		return IB_RATE_30_GBPS;
+	else if (cmd.speed >= 20000)
+		return IB_RATE_20_GBPS;
+	else if (cmd.speed >= 10000)
+		return IB_RATE_10_GBPS;
+	else
+		return IB_RATE_PORT_CURRENT;
+}
+
+static inline int rdma_link_local_addr(struct in6_addr *addr)
+{
+	if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) &&
+	    addr->s6_addr32[1] == 0)
+		return 1;
+
+	return 0;
+}
+
+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
+{
+	memcpy(mac, &addr->s6_addr[8], 3);
+	memcpy(mac + 3, &addr->s6_addr[13], 3);
+	mac[0] ^= 2;
+}
+
+static inline int rdma_is_multicast_addr(struct in6_addr *addr)
+{
+	return addr->s6_addr[0] == 0xff ? 1 : 0;
+}
+
+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
+{
+	int i;
+
+	mac[0] = 0x33;
+	mac[1] = 0x33;
+	for (i = 2; i < 6; ++i)
+		mac[i] = addr->s6_addr[i + 10];
+}
+
 #endif /* IB_ADDR_H */
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:39:45 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:39:45 +0300
Subject: [ofa-general] [PATCHv5 09/10] mlx4: Add support for RDMAoE - address
	resolution
Message-ID: <20090819143945.GF8675@mtls03>

The following path handles address vectors creation for RDMAoE ports. mlx4
needs the MAC address of the remote node to include it in the WQE of a UD QP or
in the QP context of connected QPs. Address resolution is done atomically in
the case of a link local address or a multicast GID and otherwise -EINVAL is
returned.  mlx4 transport packets were changed too to accomodate for RDMAoE.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changes from previous version:
Call ib_register_mad_agent() for RDMA_TRANSPORT_IB type ports.


 drivers/infiniband/hw/mlx4/ah.c      |  187 ++++++++++++++++++++++++++++------
 drivers/infiniband/hw/mlx4/mad.c     |   32 ++++--
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   19 +++-
 drivers/infiniband/hw/mlx4/qp.c      |  172 +++++++++++++++++++++----------
 drivers/net/mlx4/fw.c                |    3 +-
 include/linux/mlx4/device.h          |   31 ++++++-
 include/linux/mlx4/qp.h              |    8 +-
 7 files changed, 347 insertions(+), 105 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index c75ac94..0a015c3 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -31,63 +31,166 @@
  */
 
 #include "mlx4_ib.h"
+#include <rdma/ib_addr.h>
+#include <linux/inet.h>
+#include <linux/string.h>
 
-struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
+			u8 *mac, int *is_mcast)
 {
-	struct mlx4_dev *dev = to_mdev(pd->device)->dev;
-	struct mlx4_ib_ah *ah;
+	struct mlx4_ib_rdmaoe *rdmaoe = &dev->rdmaoe;
+	struct sockaddr_in6 s6 = {0};
+	struct net_device *netdev;
+	int ifidx;
 
-	ah = kmalloc(sizeof *ah, GFP_ATOMIC);
-	if (!ah)
-		return ERR_PTR(-ENOMEM);
+	*is_mcast = 0;
+	spin_lock(&rdmaoe->lock);
+	netdev = rdmaoe->netdevs[ah_attr->port_num - 1];
+	if (!netdev) {
+		spin_unlock(&rdmaoe->lock);
+		return -EINVAL;
+	}
+	ifidx = netdev->ifindex;
+	spin_unlock(&rdmaoe->lock);
 
-	memset(&ah->av, 0, sizeof ah->av);
+	memcpy(s6.sin6_addr.s6_addr, ah_attr->grh.dgid.raw, sizeof ah_attr->grh);
+	s6.sin6_family = AF_INET6;
+	s6.sin6_scope_id = ifidx;
+	if (rdma_link_local_addr(&s6.sin6_addr))
+		rdma_get_ll_mac(&s6.sin6_addr, mac);
+	else if (rdma_is_multicast_addr(&s6.sin6_addr)) {
+		rdma_get_mcast_mac(&s6.sin6_addr, mac);
+		*is_mcast = 1;
+	} else
+		return -EINVAL;
 
-	ah->av.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
-	ah->av.g_slid  = ah_attr->src_path_bits;
-	ah->av.dlid    = cpu_to_be16(ah_attr->dlid);
-	if (ah_attr->static_rate) {
-		ah->av.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
-		while (ah->av.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
-		       !(1 << ah->av.stat_rate & dev->caps.stat_rate_support))
-			--ah->av.stat_rate;
-	}
-	ah->av.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+	return 0;
+}
+
+static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
+				  struct mlx4_ib_ah *ah)
+{
+	struct mlx4_dev *dev = to_mdev(pd->device)->dev;
+
+	ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
+	ah->av.ib.g_slid  = ah_attr->src_path_bits;
 	if (ah_attr->ah_flags & IB_AH_GRH) {
-		ah->av.g_slid   |= 0x80;
-		ah->av.gid_index = ah_attr->grh.sgid_index;
-		ah->av.hop_limit = ah_attr->grh.hop_limit;
-		ah->av.sl_tclass_flowlabel |=
+		ah->av.ib.g_slid   |= 0x80;
+		ah->av.ib.gid_index = ah_attr->grh.sgid_index;
+		ah->av.ib.hop_limit = ah_attr->grh.hop_limit;
+		ah->av.ib.sl_tclass_flowlabel |=
 			cpu_to_be32((ah_attr->grh.traffic_class << 20) |
 				    ah_attr->grh.flow_label);
-		memcpy(ah->av.dgid, ah_attr->grh.dgid.raw, 16);
+		memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16);
+	}
+
+	ah->av.ib.dlid    = cpu_to_be16(ah_attr->dlid);
+	if (ah_attr->static_rate) {
+		ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
+		while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
+		       !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support))
+			--ah->av.ib.stat_rate;
 	}
+	ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
 
 	return &ah->ibah;
 }
 
+static struct ib_ah *create_rdmaoe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
+				   struct mlx4_ib_ah *ah)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(pd->device);
+	struct mlx4_dev *dev = ibdev->dev;
+	u8 mac[6];
+	int err;
+	int is_mcast;
+
+	err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, &is_mcast);
+	if (err)
+		return ERR_PTR(err);
+
+	memcpy(ah->av.eth.mac_0_1, mac, 2);
+	memcpy(ah->av.eth.mac_2_5, mac + 2, 4);
+	ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
+	ah->av.ib.g_slid = 0x80;
+	if (ah_attr->static_rate) {
+		ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
+		while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET &&
+		       !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support))
+			--ah->av.ib.stat_rate;
+	}
+
+	/*
+	 * HW requires multicast LID so we just choose one.
+	 */
+	if (is_mcast)
+		ah->av.ib.dlid = cpu_to_be16(0xc000);
+
+	memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16);
+	ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28);
+
+	return &ah->ibah;
+}
+
+struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+{
+	struct mlx4_ib_ah *ah;
+	enum rdma_transport_type transport;
+	struct ib_ah *ret;
+
+	ah = kzalloc(sizeof *ah, GFP_ATOMIC);
+	if (!ah)
+		return ERR_PTR(-ENOMEM);
+
+	transport = rdma_port_get_transport(pd->device, ah_attr->port_num);
+	if (transport == RDMA_TRANSPORT_RDMAOE) {
+		if (!(ah_attr->ah_flags & IB_AH_GRH)) {
+			ret = ERR_PTR(-EINVAL);
+			goto out;
+		} else {
+			/* TBD: need to handle the case when we get called
+			in an atomic context and there we might sleep. We
+			don't expect this currently since we're working with
+			link local addresses which we can translate without
+			going to sleep */
+			ret = create_rdmaoe_ah(pd, ah_attr, ah);
+			if (IS_ERR(ret))
+				goto out;
+			else
+				return ret;
+		}
+	} else
+		return create_ib_ah(pd, ah_attr, ah); /* never fails */
+
+out:
+	kfree(ah);
+	return ret;
+}
+
 int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr)
 {
 	struct mlx4_ib_ah *ah = to_mah(ibah);
+	enum rdma_transport_type transport;
 
+	transport = rdma_port_get_transport(ibah->device, ah_attr->port_num);
 	memset(ah_attr, 0, sizeof *ah_attr);
-	ah_attr->dlid	       = be16_to_cpu(ah->av.dlid);
-	ah_attr->sl	       = be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28;
-	ah_attr->port_num      = be32_to_cpu(ah->av.port_pd) >> 24;
-	if (ah->av.stat_rate)
-		ah_attr->static_rate = ah->av.stat_rate - MLX4_STAT_RATE_OFFSET;
-	ah_attr->src_path_bits = ah->av.g_slid & 0x7F;
+	ah_attr->dlid = transport == RDMA_TRANSPORT_IB ? be16_to_cpu(ah->av.ib.dlid) : 0;
+	ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
+	ah_attr->port_num = be32_to_cpu(ah->av.ib.port_pd) >> 24;
+	if (ah->av.ib.stat_rate)
+		ah_attr->static_rate = ah->av.ib.stat_rate - MLX4_STAT_RATE_OFFSET;
+	ah_attr->src_path_bits = ah->av.ib.g_slid & 0x7F;
 
 	if (mlx4_ib_ah_grh_present(ah)) {
 		ah_attr->ah_flags = IB_AH_GRH;
 
 		ah_attr->grh.traffic_class =
-			be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20;
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20;
 		ah_attr->grh.flow_label =
-			be32_to_cpu(ah->av.sl_tclass_flowlabel) & 0xfffff;
-		ah_attr->grh.hop_limit  = ah->av.hop_limit;
-		ah_attr->grh.sgid_index = ah->av.gid_index;
-		memcpy(ah_attr->grh.dgid.raw, ah->av.dgid, 16);
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) & 0xfffff;
+		ah_attr->grh.hop_limit  = ah->av.ib.hop_limit;
+		ah_attr->grh.sgid_index = ah->av.ib.gid_index;
+		memcpy(ah_attr->grh.dgid.raw, ah->av.ib.dgid, 16);
 	}
 
 	return 0;
@@ -98,3 +201,21 @@ int mlx4_ib_destroy_ah(struct ib_ah *ah)
 	kfree(to_mah(ah));
 	return 0;
 }
+
+int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac)
+{
+	int err;
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct ib_ah_attr ah_attr = {
+		.port_num = port,
+	};
+	int is_mcast;
+
+	memcpy(ah_attr.grh.dgid.raw, gid, 16);
+	err = mlx4_ib_resolve_grh(ibdev, &ah_attr, mac, &is_mcast);
+	if (err)
+		ERR_PTR(err);
+
+	return 0;
+}
+
diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 19e68ab..3df4f64 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -310,19 +310,25 @@ int mlx4_ib_mad_init(struct mlx4_ib_dev *dev)
 	struct ib_mad_agent *agent;
 	int p, q;
 	int ret;
+	enum rdma_transport_type tt;
 
-	for (p = 0; p < dev->num_ports; ++p)
+	for (p = 0; p < dev->num_ports; ++p) {
+		tt = rdma_port_get_transport(&dev->ib_dev, p + 1);
 		for (q = 0; q <= 1; ++q) {
-			agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
-						      q ? IB_QPT_GSI : IB_QPT_SMI,
-						      NULL, 0, send_handler,
-						      NULL, NULL);
-			if (IS_ERR(agent)) {
-				ret = PTR_ERR(agent);
-				goto err;
-			}
-			dev->send_agent[p][q] = agent;
+			if (tt == RDMA_TRANSPORT_IB) {
+				agent = ib_register_mad_agent(&dev->ib_dev, p + 1,
+							      q ? IB_QPT_GSI : IB_QPT_SMI,
+							      NULL, 0, send_handler,
+							      NULL, NULL);
+				if (IS_ERR(agent)) {
+					ret = PTR_ERR(agent);
+					goto err;
+				}
+				dev->send_agent[p][q] = agent;
+			} else
+				dev->send_agent[p][q] = NULL;
 		}
+	}
 
 	return 0;
 
@@ -343,8 +349,10 @@ void mlx4_ib_mad_cleanup(struct mlx4_ib_dev *dev)
 	for (p = 0; p < dev->num_ports; ++p) {
 		for (q = 0; q <= 1; ++q) {
 			agent = dev->send_agent[p][q];
-			dev->send_agent[p][q] = NULL;
-			ib_unregister_mad_agent(agent);
+			if (agent) {
+				dev->send_agent[p][q] = NULL;
+				ib_unregister_mad_agent(agent);
+			}
 		}
 
 		if (dev->sm_ah[p])
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 8a7dd67..c644cac 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -138,6 +138,7 @@ struct mlx4_ib_qp {
 	u8			resp_depth;
 	u8			sq_no_prefetch;
 	u8			state;
+	int			mlx_type;
 };
 
 struct mlx4_ib_srq {
@@ -157,7 +158,14 @@ struct mlx4_ib_srq {
 
 struct mlx4_ib_ah {
 	struct ib_ah		ibah;
-	struct mlx4_av		av;
+	union mlx4_ext_av       av;
+};
+
+struct mlx4_ib_rdmaoe {
+	spinlock_t		lock;
+	struct net_device      *netdevs[MLX4_MAX_PORTS];
+	struct notifier_block 	nb;
+	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
 };
 
 struct mlx4_ib_dev {
@@ -175,6 +183,8 @@ struct mlx4_ib_dev {
 	spinlock_t		sm_lock;
 
 	struct mutex		cap_mask_mutex;
+
+	struct mlx4_ib_rdmaoe	rdmaoe;
 };
 
 static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev)
@@ -313,9 +323,14 @@ int mlx4_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, int npages,
 int mlx4_ib_unmap_fmr(struct list_head *fmr_list);
 int mlx4_ib_fmr_dealloc(struct ib_fmr *fmr);
 
+int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
+			u8 *mac, int *is_mcast);
+
+int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac);
+
 static inline int mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah)
 {
-	return !!(ah->av.g_slid & 0x80);
+	return !!(ah->av.ib.g_slid & 0x80);
 }
 
 #endif /* MLX4_IB_H */
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 20724ae..4b391fa 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -32,6 +32,7 @@
  */
 
 #include <linux/log2.h>
+#include <linux/netdevice.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
@@ -47,14 +48,21 @@ enum {
 
 enum {
 	MLX4_IB_DEFAULT_SCHED_QUEUE	= 0x83,
-	MLX4_IB_DEFAULT_QP0_SCHED_QUEUE	= 0x3f
+	MLX4_IB_DEFAULT_QP0_SCHED_QUEUE	= 0x3f,
+	MLX4_IB_LINK_TYPE_IB		= 0,
+	MLX4_IB_LINK_TYPE_ETH		= 1
 };
 
 enum {
 	/*
 	 * Largest possible UD header: send with GRH and immediate data.
+	 * 4 bytes added to accommodate for eth header instead of lrh
 	 */
-	MLX4_IB_UD_HEADER_SIZE		= 72
+	MLX4_IB_UD_HEADER_SIZE		= 76
+};
+
+enum {
+	MLX4_RDMAOE_ETHERTYPE = 0x8915
 };
 
 struct mlx4_ib_sqp {
@@ -62,7 +70,10 @@ struct mlx4_ib_sqp {
 	int			pkey_index;
 	u32			qkey;
 	u32			send_psn;
-	struct ib_ud_header	ud_header;
+	union {
+		struct ib_ud_header	ib;
+		struct eth_ud_header	eth;
+	} hdr;
 	u8			header_buf[MLX4_IB_UD_HEADER_SIZE];
 };
 
@@ -546,9 +557,9 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		}
 	}
 
-	if (sqpn) {
+	if (sqpn)
 		qpn = sqpn;
-	} else {
+	else {
 		err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn);
 		if (err)
 			goto err_wrid;
@@ -843,6 +854,12 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port)
 static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 			 struct mlx4_qp_path *path, u8 port)
 {
+	int err;
+	int is_eth = rdma_port_get_transport(&dev->ib_dev, port) ==
+		RDMA_TRANSPORT_RDMAOE ? 1 : 0;
+	u8 mac[6];
+	int is_mcast;
+
 	path->grh_mylmc     = ah->src_path_bits & 0x7f;
 	path->rlid	    = cpu_to_be16(ah->dlid);
 	if (ah->static_rate) {
@@ -873,6 +890,21 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 	path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE |
 		((port - 1) << 6) | ((ah->sl & 0xf) << 2);
 
+	if (is_eth) {
+		if (!(ah->ah_flags & IB_AH_GRH))
+			return -1;
+
+		err = mlx4_ib_resolve_grh(dev, ah, mac, &is_mcast);
+		if (err)
+			return err;
+
+		memcpy(path->dmac_h, mac, 2);
+		memcpy(path->dmac_l, mac + 2, 4);
+		path->ackto = MLX4_IB_LINK_TYPE_ETH;
+		/* use index 0 into MAC table for RDMAoE */
+		path->grh_mylmc &= 0x80;
+	}
+
 	return 0;
 }
 
@@ -972,7 +1004,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	}
 
 	if (attr_mask & IB_QP_TIMEOUT) {
-		context->pri_path.ackto = attr->timeout << 3;
+		context->pri_path.ackto |= (attr->timeout << 3);
 		optpar |= MLX4_QP_OPTPAR_ACK_TIMEOUT;
 	}
 
@@ -1218,79 +1250,109 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 	int header_size;
 	int spc;
 	int i;
+	void *tmp;
+	struct ib_ud_header *ib = NULL;
+	struct eth_ud_header *eth = NULL;
+	struct ib_unpacked_grh *grh;
+	struct ib_unpacked_bth  *bth;
+	struct ib_unpacked_deth *deth;
 
 	send_size = 0;
 	for (i = 0; i < wr->num_sge; ++i)
 		send_size += wr->sg_list[i].length;
 
-	ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), &sqp->ud_header);
+	if (rdma_port_get_transport(sqp->qp.ibqp.device, sqp->qp.port) == RDMA_TRANSPORT_IB) {
+		ib = &sqp->hdr.ib;
+		grh = &ib->grh;
+		bth = &ib->bth;
+		deth = &ib->deth;
+		ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), ib);
+		ib->lrh.service_level   =
+			be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
+		ib->lrh.destination_lid = ah->av.ib.dlid;
+		ib->lrh.source_lid      = cpu_to_be16(ah->av.ib.g_slid & 0x7f);
+	} else {
+		eth = &sqp->hdr.eth;
+		grh = &eth->grh;
+		bth = &eth->bth;
+		deth = &eth->deth;
+		ib_rdmaoe_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), eth);
+	}
 
-	sqp->ud_header.lrh.service_level   =
-		be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28;
-	sqp->ud_header.lrh.destination_lid = ah->av.dlid;
-	sqp->ud_header.lrh.source_lid      = cpu_to_be16(ah->av.g_slid & 0x7f);
 	if (mlx4_ib_ah_grh_present(ah)) {
-		sqp->ud_header.grh.traffic_class =
-			(be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff;
-		sqp->ud_header.grh.flow_label    =
-			ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff);
-		sqp->ud_header.grh.hop_limit     = ah->av.hop_limit;
-		ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24,
-				  ah->av.gid_index, &sqp->ud_header.grh.source_gid);
-		memcpy(sqp->ud_header.grh.destination_gid.raw,
-		       ah->av.dgid, 16);
+		grh->traffic_class =
+			(be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 0xff;
+		grh->flow_label    =
+			ah->av.ib.sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		grh->hop_limit     = ah->av.ib.hop_limit;
+		ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.ib.port_pd) >> 24,
+				  ah->av.ib.gid_index, &grh->source_gid);
+		memcpy(grh->destination_gid.raw,
+		       ah->av.ib.dgid, 16);
 	}
 
 	mlx->flags &= cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE);
-	mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) |
-				  (sqp->ud_header.lrh.destination_lid ==
-				   IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) |
-				  (sqp->ud_header.lrh.service_level << 8));
-	mlx->rlid   = sqp->ud_header.lrh.destination_lid;
+
+	if (ib) {
+		mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) |
+					  (ib->lrh.destination_lid ==
+					   IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) |
+					  (ib->lrh.service_level << 8));
+		mlx->rlid   = ib->lrh.destination_lid;
+	}
 
 	switch (wr->opcode) {
 	case IB_WR_SEND:
-		sqp->ud_header.bth.opcode	 = IB_OPCODE_UD_SEND_ONLY;
-		sqp->ud_header.immediate_present = 0;
+		bth->opcode	 = IB_OPCODE_UD_SEND_ONLY;
+		if (ib)
+			ib->immediate_present = 0;
+		else
+			eth->immediate_present = 0;
 		break;
 	case IB_WR_SEND_WITH_IMM:
-		sqp->ud_header.bth.opcode	 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
-		sqp->ud_header.immediate_present = 1;
-		sqp->ud_header.immediate_data    = wr->ex.imm_data;
+		bth->opcode	 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+		if (ib) {
+			ib->immediate_present = 1;
+			ib->immediate_data    = wr->ex.imm_data;
+		} else {
+			eth->immediate_present = 1;
+			eth->immediate_data    = wr->ex.imm_data;
+		}
 		break;
 	default:
 		return -EINVAL;
 	}
 
-	sqp->ud_header.lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
-	if (sqp->ud_header.lrh.destination_lid == IB_LID_PERMISSIVE)
-		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
-	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+	if (ib) {
+		ib->lrh.virtual_lane    = !sqp->qp.ibqp.qp_num ? 15 : 0;
+		if (ib->lrh.destination_lid == IB_LID_PERMISSIVE)
+			ib->lrh.source_lid = IB_LID_PERMISSIVE;
+	} else {
+		memcpy(eth->eth.dmac_h, ah->av.eth.mac_0_1, 2);
+		memcpy(eth->eth.dmac_h + 2, ah->av.eth.mac_2_5, 2);
+		memcpy(eth->eth.dmac_l, ah->av.eth.mac_2_5 + 2, 2);
+		tmp = to_mdev(sqp->qp.ibqp.device)->rdmaoe.netdevs[sqp->qp.port - 1]->dev_addr;
+		memcpy(eth->eth.smac_h, tmp, 2);
+		memcpy(eth->eth.smac_l, tmp + 2, 4);
+		eth->eth.type = cpu_to_be16(MLX4_RDMAOE_ETHERTYPE);
+	}
+	bth->solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
+
 	if (!sqp->qp.ibqp.qp_num)
 		ib_get_cached_pkey(ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
 	else
 		ib_get_cached_pkey(ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
-	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
-	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
-	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
-	sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
+	bth->pkey = cpu_to_be16(pkey);
+	bth->destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
+	bth->psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
+	deth->qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ?
 					       sqp->qkey : wr->wr.ud.remote_qkey);
-	sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
-
-	header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf);
-
-	if (0) {
-		printk(KERN_ERR "built UD header of size %d:\n", header_size);
-		for (i = 0; i < header_size / 4; ++i) {
-			if (i % 8 == 0)
-				printk("  [%02x] ", i * 4);
-			printk(" %08x",
-			       be32_to_cpu(((__be32 *) sqp->header_buf)[i]));
-			if ((i + 1) % 8 == 0)
-				printk("\n");
-		}
-		printk("\n");
-	}
+	deth->source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num);
+
+	if (ib)
+		header_size = ib_ud_header_pack(ib, sqp->header_buf);
+	else
+		header_size = rdmaoe_ud_header_pack(eth, sqp->header_buf);
 
 	/*
 	 * Inline data segments may not cross a 64 byte boundary.  If
@@ -1414,6 +1476,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg,
 	memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av));
 	dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn);
 	dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey);
+	dseg->vlan = to_mah(wr->wr.ud.ah)->av.eth.vlan;
+	memcpy(dseg->mac_0_1, to_mah(wr->wr.ud.ah)->av.eth.mac_0_1, 6);
 }
 
 static void set_mlx_icrc_seg(void *dseg)
diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index cee199c..20526ce 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -96,7 +96,8 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags)
 		[20] = "Address vector port checking support",
 		[21] = "UD multicast support",
 		[24] = "Demand paging support",
-		[25] = "Router support"
+		[25] = "Router support",
+		[30] = "RDMAoE support"
 	};
 	int i;
 
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 3aff8a6..b73b5f0 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -66,7 +66,8 @@ enum {
 	MLX4_DEV_CAP_FLAG_ATOMIC	= 1 << 18,
 	MLX4_DEV_CAP_FLAG_RAW_MCAST	= 1 << 19,
 	MLX4_DEV_CAP_FLAG_UD_AV_PORT	= 1 << 20,
-	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1 << 21
+	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1 << 21,
+	MLX4_DEV_CAP_FLAG_RDMAOE	= 1 << 30
 };
 
 enum {
@@ -371,6 +372,28 @@ struct mlx4_av {
 	u8			dgid[16];
 };
 
+struct mlx4_eth_av {
+	__be32		port_pd;
+	u8		reserved1;
+	u8		smac_idx;
+	u16		reserved2;
+	u8		reserved3;
+	u8		gid_index;
+	u8		stat_rate;
+	u8		hop_limit;
+	__be32		sl_tclass_flowlabel;
+	u8		dgid[16];
+	u32		reserved4[2];
+	__be16		vlan;
+	u8		mac_0_1[2];
+	u8		mac_2_5[4];
+};
+
+union mlx4_ext_av {
+	struct mlx4_av		ib;
+	struct mlx4_eth_av	eth;
+};
+
 struct mlx4_dev {
 	struct pci_dev	       *pdev;
 	unsigned long		flags;
@@ -399,6 +422,12 @@ struct mlx4_init_port_param {
 		if (((type) == MLX4_PORT_TYPE_IB ? (dev)->caps.port_mask : \
 		     ~(dev)->caps.port_mask) & 1 << ((port) - 1))
 
+#define mlx4_foreach_ib_transport_port(port, dev)			\
+	for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++)	\
+		if (((dev)->caps.port_mask & 1 << ((port) - 1)) ||	\
+		    ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE))
+
+
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		   struct mlx4_buf *buf);
 void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index bf8f119..d73534f 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -112,7 +112,9 @@ struct mlx4_qp_path {
 	u8			snooper_flags;
 	u8			reserved3[2];
 	u8			counter_index;
-	u8			reserved4[7];
+	u8			reserved4;
+	u8			dmac_h[2];
+	u8			dmac_l[4];
 };
 
 struct mlx4_qp_context {
@@ -218,7 +220,9 @@ struct mlx4_wqe_datagram_seg {
 	__be32			av[8];
 	__be32			dqpn;
 	__be32			qkey;
-	__be32			reservd[2];
+	__be16			vlan;
+	u8			mac_0_1[2];
+	u8			mac_2_5[4];
 };
 
 struct mlx4_wqe_lso_seg {
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 07:39:58 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 17:39:58 +0300
Subject: [ofa-general] [PATCHv5 10/10] mlx4: Add RDMAoE support - allow
	interfaces to correspond to each other
Message-ID: <20090819143958.GG8675@mtls03>

This patch add support RDMAoE for mlx4. Since mlx4_ib now needs to reference
mlx4_en netdevices, a new mechanism was added. Two new fields were added to
struct mlx4_interface to define a protocol and a get_prot_dev method to
retrieve the corresponding protocol's net device.  An implementation of the new
verb ib_get_port_link_type() - mlx4_ib_get_port_link_type - was added.
mlx4_ib_query_port() has been modified to support eth link types. An interface
is considered to be active if its corresponding eth interface is active. Code
for setting the GID table of a port has been added. Currently, each IB port has
a single GID entry in its table and that GID entery equals the link local IPv6
address.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changes from previous version:
Bug fix - call flush_workqueue after unregistering notifiers.

 drivers/infiniband/hw/mlx4/main.c |  309 +++++++++++++++++++++++++++++++++----
 drivers/net/mlx4/en_main.c        |   15 ++-
 drivers/net/mlx4/en_port.c        |    4 +-
 drivers/net/mlx4/en_port.h        |    3 +-
 drivers/net/mlx4/intf.c           |   20 +++
 drivers/net/mlx4/main.c           |    6 +
 drivers/net/mlx4/mlx4.h           |    1 +
 include/linux/mlx4/cmd.h          |    1 +
 include/linux/mlx4/driver.h       |   16 ++-
 9 files changed, 335 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..1828aec 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -34,9 +34,12 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
 
 #include <rdma/ib_smi.h>
 #include <rdma/ib_user_verbs.h>
+#include <rdma/ib_addr.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/cmd.h>
@@ -57,6 +60,15 @@ static const char mlx4_ib_version[] =
 	DRV_NAME ": Mellanox ConnectX InfiniBand driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+struct update_gid_work {
+	struct work_struct work;
+	union ib_gid gids[128];
+	int port;
+	struct mlx4_ib_dev *dev;
+};
+
+static struct workqueue_struct *wq;
+
 static void init_query_mad(struct ib_smp *mad)
 {
 	mad->base_version  = 1;
@@ -152,28 +164,19 @@ out:
 	return err;
 }
 
-static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
-			      struct ib_port_attr *props)
+static enum rdma_transport_type
+mlx4_ib_port_get_transport(struct ib_device *device, u8 port_num)
 {
-	struct ib_smp *in_mad  = NULL;
-	struct ib_smp *out_mad = NULL;
-	int err = -ENOMEM;
-
-	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
-	if (!in_mad || !out_mad)
-		goto out;
-
-	memset(props, 0, sizeof *props);
-
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
-	in_mad->attr_mod = cpu_to_be32(port);
+	struct mlx4_dev *dev = to_mdev(device)->dev;
 
-	err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad);
-	if (err)
-		goto out;
+	return dev->caps.port_mask & (1 << (port_num - 1)) ?
+		RDMA_TRANSPORT_IB : RDMA_TRANSPORT_RDMAOE;
+}
 
+static void ib_link_query_port(struct ib_device *ibdev, u8 port,
+			       struct ib_port_attr *props,
+			       struct ib_smp *out_mad)
+{
 	props->lid		= be16_to_cpup((__be16 *) (out_mad->data + 16));
 	props->lmc		= out_mad->data[34] & 0x7;
 	props->sm_lid		= be16_to_cpup((__be16 *) (out_mad->data + 18));
@@ -193,6 +196,67 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
 	props->subnet_timeout	= out_mad->data[51] & 0x1f;
 	props->max_vl_num	= out_mad->data[37] >> 4;
 	props->init_type_reply	= out_mad->data[41] >> 4;
+	props->transport	= RDMA_TRANSPORT_IB;
+}
+
+static void eth_link_query_port(struct ib_device *ibdev, u8 port,
+				struct ib_port_attr *props,
+				struct ib_smp *out_mad)
+{
+	struct mlx4_ib_rdmaoe *rdmaoe = &to_mdev(ibdev)->rdmaoe;
+	struct net_device *ndev;
+
+	props->port_cap_flags	= IB_PORT_CM_SUP;
+	props->gid_tbl_len	= to_mdev(ibdev)->dev->caps.gid_table_len[port];
+	props->max_msg_sz	= to_mdev(ibdev)->dev->caps.max_msg_sz;
+	props->pkey_tbl_len	= 1;
+	props->bad_pkey_cntr	= be16_to_cpup((__be16 *) (out_mad->data + 46));
+	props->qkey_viol_cntr	= be16_to_cpup((__be16 *) (out_mad->data + 48));
+	props->active_width	= 0;
+	props->active_speed	= 0;
+	props->max_mtu		= out_mad->data[41] & 0xf;
+	props->subnet_timeout	= 0;
+	props->max_vl_num	= out_mad->data[37] >> 4;
+	props->init_type_reply	= 0;
+	props->transport	= RDMA_TRANSPORT_RDMAOE;
+	spin_lock(&rdmaoe->lock);
+	ndev = rdmaoe->netdevs[port - 1];
+	if (!ndev)
+		goto out;
+
+	props->active_mtu	= rdmaoe_get_mtu(ndev->mtu);
+	props->state		= netif_running(ndev) &&  netif_oper_up(ndev) ?
+					IB_PORT_ACTIVE : IB_PORT_DOWN;
+	props->phys_state	= props->state;
+out:
+	spin_unlock(&rdmaoe->lock);
+}
+
+static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
+			      struct ib_port_attr *props)
+{
+	struct ib_smp *in_mad  = NULL;
+	struct ib_smp *out_mad = NULL;
+	int err = -ENOMEM;
+
+	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+	if (!in_mad || !out_mad)
+		goto out;
+
+	memset(props, 0, sizeof *props);
+
+	init_query_mad(in_mad);
+	in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
+	in_mad->attr_mod = cpu_to_be32(port);
+
+	err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad);
+	if (err)
+		goto out;
+
+	mlx4_ib_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB ?
+		ib_link_query_port(ibdev, port, props, out_mad) :
+		eth_link_query_port(ibdev, port, props, out_mad);
 
 out:
 	kfree(in_mad);
@@ -201,8 +265,8 @@ out:
 	return err;
 }
 
-static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
-			     union ib_gid *gid)
+static int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
+			       union ib_gid *gid)
 {
 	struct ib_smp *in_mad  = NULL;
 	struct ib_smp *out_mad = NULL;
@@ -239,6 +303,25 @@ out:
 	return err;
 }
 
+static int rdmaoe_query_gid(struct ib_device *ibdev, u8 port, int index,
+			    union ib_gid *gid)
+{
+	struct mlx4_ib_dev *dev = to_mdev(ibdev);
+
+	*gid = dev->rdmaoe.gid_table[port - 1][index];
+
+	return 0;
+}
+
+static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
+			     union ib_gid *gid)
+{
+	if (rdma_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB)
+		return __mlx4_ib_query_gid(ibdev, port, index, gid);
+	else
+		return rdmaoe_query_gid(ibdev, port, index, gid);
+}
+
 static int mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
 			      u16 *pkey)
 {
@@ -287,6 +370,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols,
 {
 	struct mlx4_cmd_mailbox *mailbox;
 	int err;
+	u8 is_eth = dev->dev->caps.port_type[port] == MLX4_PORT_TYPE_ETH;
 
 	mailbox = mlx4_alloc_cmd_mailbox(dev->dev);
 	if (IS_ERR(mailbox))
@@ -302,7 +386,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols,
 		((__be32 *) mailbox->buf)[1] = cpu_to_be32(cap_mask);
 	}
 
-	err = mlx4_cmd(dev->dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
+	err = mlx4_cmd(dev->dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT,
 		       MLX4_CMD_TIME_CLASS_B);
 
 	mlx4_free_cmd_mailbox(dev->dev, mailbox);
@@ -538,19 +622,146 @@ static struct device_attribute *mlx4_class_attributes[] = {
 	&dev_attr_board_id
 };
 
+static void mlx4_addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+{
+	memcpy(eui, dev->dev_addr, 3);
+	memcpy(eui + 5, dev->dev_addr + 3, 3);
+	eui[3] = 0xFF;
+	eui[4] = 0xFE;
+	eui[0] ^= 2;
+}
+
+static void update_gids_task(struct work_struct *work)
+{
+	struct update_gid_work *gw = container_of(work, struct update_gid_work, work);
+	struct mlx4_cmd_mailbox *mailbox;
+	union ib_gid *gids;
+	int err;
+	struct mlx4_dev	*dev = gw->dev->dev;
+	struct ib_event event;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox)) {
+		printk(KERN_WARNING "update gid table failed %ld\n", PTR_ERR(mailbox));
+		return;
+	}
+
+	gids = mailbox->buf;
+	memcpy(gids, gw->gids, sizeof gw->gids);
+
+	err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
+		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B);
+	if (err)
+		printk(KERN_WARNING "set port command failed\n");
+	else {
+		memcpy(gw->dev->rdmaoe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids);
+		event.device = &gw->dev->ib_dev;
+		event.element.port_num = gw->port;
+		event.event    = IB_EVENT_LID_CHANGE;
+		ib_dispatch_event(&event);
+	}
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	kfree(gw);
+}
+
+static int update_ipv6_gids(struct mlx4_ib_dev *dev, int port, int clear)
+{
+	struct net_device *ndev = dev->rdmaoe.netdevs[port - 1];
+	struct update_gid_work *work;
+
+	work = kzalloc(sizeof *work, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	if (!clear) {
+		mlx4_addrconf_ifid_eui48(&work->gids[0].raw[8], ndev);
+		work->gids[0].global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	}
+
+	INIT_WORK(&work->work, update_gids_task);
+	work->port = port;
+	work->dev = dev;
+	queue_work(wq, &work->work);
+
+	return 0;
+}
+
+static void handle_en_event(struct mlx4_ib_dev *dev, int port, unsigned long event)
+{
+	switch (event) {
+	case NETDEV_UP:
+		update_ipv6_gids(dev, port, 0);
+		break;
+
+	case NETDEV_DOWN:
+		update_ipv6_gids(dev, port, 1);
+	}
+}
+
+static void netdev_added(struct mlx4_ib_dev *dev, int port)
+{
+	update_ipv6_gids(dev, port, 0);
+}
+
+static void netdev_removed(struct mlx4_ib_dev *dev, int port)
+{
+	update_ipv6_gids(dev, port, 1);
+}
+
+static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event,
+				void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mlx4_ib_dev *ibdev;
+	struct net_device *oldnd;
+	struct mlx4_ib_rdmaoe *rdmaoe;
+	int port;
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return NOTIFY_DONE;
+
+	ibdev = container_of(this, struct mlx4_ib_dev, rdmaoe.nb);
+	rdmaoe = &ibdev->rdmaoe;
+
+	spin_lock(&rdmaoe->lock);
+	mlx4_foreach_ib_transport_port(port, ibdev->dev) {
+		oldnd = rdmaoe->netdevs[port - 1];
+		rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(ibdev->dev, MLX4_PROT_EN, port);
+		if (oldnd != rdmaoe->netdevs[port - 1]) {
+			if (rdmaoe->netdevs[port - 1])
+				netdev_added(ibdev, port);
+			else
+				netdev_removed(ibdev, port);
+		}
+	}
+
+	if (dev == rdmaoe->netdevs[0])
+		handle_en_event(ibdev, 1, event);
+	else if (dev == rdmaoe->netdevs[1])
+		handle_en_event(ibdev, 2, event);
+
+	spin_unlock(&rdmaoe->lock);
+
+	return NOTIFY_DONE;
+}
+
 static void *mlx4_ib_add(struct mlx4_dev *dev)
 {
 	static int mlx4_ib_version_printed;
 	struct mlx4_ib_dev *ibdev;
 	int num_ports = 0;
 	int i;
+	int err;
+	int port;
+	struct mlx4_ib_rdmaoe *rdmaoe;
 
 	if (!mlx4_ib_version_printed) {
 		printk(KERN_INFO "%s", mlx4_ib_version);
 		++mlx4_ib_version_printed;
 	}
 
-	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
+	mlx4_foreach_ib_transport_port(i, dev)
 		num_ports++;
 
 	/* No point in registering a device with no ports... */
@@ -563,6 +774,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		return NULL;
 	}
 
+	rdmaoe = &ibdev->rdmaoe;
+
 	if (mlx4_pd_alloc(dev, &ibdev->priv_pdn))
 		goto err_dealloc;
 
@@ -607,10 +820,12 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		(1ull << IB_USER_VERBS_CMD_CREATE_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
-		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
+		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
+		(1ull << IB_USER_VERBS_CMD_GET_MAC);
 
 	ibdev->ib_dev.query_device	= mlx4_ib_query_device;
 	ibdev->ib_dev.query_port	= mlx4_ib_query_port;
+	ibdev->ib_dev.get_port_transport = mlx4_ib_port_get_transport;
 	ibdev->ib_dev.query_gid		= mlx4_ib_query_gid;
 	ibdev->ib_dev.query_pkey	= mlx4_ib_query_pkey;
 	ibdev->ib_dev.modify_device	= mlx4_ib_modify_device;
@@ -654,15 +869,26 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	ibdev->ib_dev.map_phys_fmr	= mlx4_ib_map_phys_fmr;
 	ibdev->ib_dev.unmap_fmr		= mlx4_ib_unmap_fmr;
 	ibdev->ib_dev.dealloc_fmr	= mlx4_ib_fmr_dealloc;
+	ibdev->ib_dev.get_mac		= mlx4_ib_get_mac;
+
+	mlx4_foreach_ib_transport_port(port, dev)
+		rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(dev, MLX4_PROT_EN, port);
+	spin_lock_init(&rdmaoe->lock);
+	if (dev->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE && !rdmaoe->nb.notifier_call) {
+		rdmaoe->nb.notifier_call = mlx4_ib_netdev_event;
+		err = register_netdevice_notifier(&rdmaoe->nb);
+		if (err)
+			goto err_map;
+	}
 
 	if (init_node_data(ibdev))
-		goto err_map;
+		goto err_notif;
 
 	spin_lock_init(&ibdev->sm_lock);
 	mutex_init(&ibdev->cap_mask_mutex);
 
 	if (ib_register_device(&ibdev->ib_dev))
-		goto err_map;
+		goto err_notif;
 
 	if (mlx4_ib_mad_init(ibdev))
 		goto err_reg;
@@ -678,6 +904,10 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 err_reg:
 	ib_unregister_device(&ibdev->ib_dev);
 
+err_notif:
+	unregister_netdevice_notifier(&rdmaoe->nb);
+	flush_workqueue(wq);
+
 err_map:
 	iounmap(ibdev->uar_map);
 
@@ -700,11 +930,16 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 
 	mlx4_ib_mad_cleanup(ibdev);
 	ib_unregister_device(&ibdev->ib_dev);
+	if (ibdev->rdmaoe.nb.notifier_call) {
+		unregister_netdevice_notifier(&ibdev->rdmaoe.nb);
+		flush_workqueue(wq);
+		ibdev->rdmaoe.nb.notifier_call = NULL;
+	}
+	iounmap(ibdev->uar_map);
 
-	for (p = 1; p <= ibdev->num_ports; ++p)
+	mlx4_foreach_port(p, dev, MLX4_PORT_TYPE_IB)
 		mlx4_CLOSE_PORT(dev, p);
 
-	iounmap(ibdev->uar_map);
 	mlx4_uar_free(dev, &ibdev->priv_uar);
 	mlx4_pd_free(dev, ibdev->priv_pdn);
 	ib_dealloc_device(&ibdev->ib_dev);
@@ -745,17 +980,31 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr,
 static struct mlx4_interface mlx4_ib_interface = {
 	.add	= mlx4_ib_add,
 	.remove	= mlx4_ib_remove,
-	.event	= mlx4_ib_event
+	.event	= mlx4_ib_event,
+	.protocol	= MLX4_PROT_IB
 };
 
 static int __init mlx4_ib_init(void)
 {
-	return mlx4_register_interface(&mlx4_ib_interface);
+	int err;
+
+	wq = create_singlethread_workqueue("mlx4_ib");
+	if (!wq)
+		return -ENOMEM;
+
+	err = mlx4_register_interface(&mlx4_ib_interface);
+	if (err) {
+		destroy_workqueue(wq);
+		return err;
+	}
+
+	return 0;
 }
 
 static void __exit mlx4_ib_cleanup(void)
 {
 	mlx4_unregister_interface(&mlx4_ib_interface);
+	destroy_workqueue(wq);
 }
 
 module_init(mlx4_ib_init);
diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c
index 510633f..6f30eca 100644
--- a/drivers/net/mlx4/en_main.c
+++ b/drivers/net/mlx4/en_main.c
@@ -51,6 +51,13 @@ static const char mlx4_en_version[] =
 	DRV_NAME ": Mellanox ConnectX HCA Ethernet driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+static void *get_netdev(struct mlx4_dev *dev, void *ctx, u8 port)
+{
+	struct mlx4_en_dev *endev = ctx;
+
+	return endev->pndev[port];
+}
+
 static void mlx4_en_event(struct mlx4_dev *dev, void *endev_ptr,
 			  enum mlx4_dev_event event, int port)
 {
@@ -229,9 +236,11 @@ err_free_res:
 }
 
 static struct mlx4_interface mlx4_en_interface = {
-	.add	= mlx4_en_add,
-	.remove	= mlx4_en_remove,
-	.event	= mlx4_en_event,
+	.add		= mlx4_en_add,
+	.remove		= mlx4_en_remove,
+	.event		= mlx4_en_event,
+	.get_prot_dev	= get_netdev,
+	.protocol	= MLX4_PROT_EN,
 };
 
 static int __init mlx4_en_init(void)
diff --git a/drivers/net/mlx4/en_port.c b/drivers/net/mlx4/en_port.c
index a29abe8..a249887 100644
--- a/drivers/net/mlx4/en_port.c
+++ b/drivers/net/mlx4/en_port.c
@@ -127,8 +127,8 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn,
 	memset(context, 0, sizeof *context);
 
 	context->base_qpn = cpu_to_be32(base_qpn);
-	context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_SHIFT | base_qpn);
-	context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_SHIFT | base_qpn);
+	context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_EN_SHIFT | base_qpn);
+	context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_MODE_SHIFT | base_qpn);
 	context->intra_no_vlan = 0;
 	context->no_vlan = MLX4_NO_VLAN_IDX;
 	context->intra_vlan_miss = 0;
diff --git a/drivers/net/mlx4/en_port.h b/drivers/net/mlx4/en_port.h
index e6477f1..9354891 100644
--- a/drivers/net/mlx4/en_port.h
+++ b/drivers/net/mlx4/en_port.h
@@ -36,7 +36,8 @@
 
 
 #define SET_PORT_GEN_ALL_VALID	0x7
-#define SET_PORT_PROMISC_SHIFT	31
+#define SET_PORT_PROMISC_EN_SHIFT	31
+#define SET_PORT_PROMISC_MODE_SHIFT	30
 
 enum {
 	MLX4_CMD_SET_VLAN_FLTR  = 0x47,
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 0e7eb10..d64530e 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -159,3 +159,23 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 
 	mutex_unlock(&intf_mutex);
 }
+
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	struct mlx4_device_context *dev_ctx;
+	unsigned long flags;
+	void *result = NULL;
+
+	spin_lock_irqsave(&priv->ctx_lock, flags);
+
+	list_for_each_entry(dev_ctx, &priv->ctx_list, list)
+		if (dev_ctx->intf->protocol == proto && dev_ctx->intf->get_prot_dev) {
+			result = dev_ctx->intf->get_prot_dev(dev, dev_ctx->context, port);
+			break;
+		}
+
+	spin_unlock_irqrestore(&priv->ctx_lock, flags);
+
+	return result;
+}
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 30bea96..c72af51 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -100,6 +100,12 @@ module_param_named(use_prio, use_prio, bool, 0444);
 MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports "
 		  "(0/1, default 0)");
 
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	return mlx4_find_get_prot_dev(dev, proto, port);
+}
+EXPORT_SYMBOL(mlx4_get_prot_dev);
+
 int mlx4_check_port_params(struct mlx4_dev *dev,
 			   enum mlx4_port_type *port_type)
 {
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 5bd79c2..db068c9 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -364,6 +364,7 @@ int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port);
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 struct mlx4_dev_cap;
 struct mlx4_init_hca_param;
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 0f82293..22bd8d3 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -140,6 +140,7 @@ enum {
 	MLX4_SET_PORT_MAC_TABLE = 0x2,
 	MLX4_SET_PORT_VLAN_TABLE = 0x3,
 	MLX4_SET_PORT_PRIO_MAP  = 0x4,
+	MLX4_SET_PORT_GID_TABLE = 0x5,
 };
 
 struct mlx4_dev;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 53c5fdb..0083256 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -44,15 +44,23 @@ enum mlx4_dev_event {
 	MLX4_DEV_EVENT_PORT_REINIT,
 };
 
+enum mlx4_prot {
+	MLX4_PROT_IB,
+	MLX4_PROT_EN,
+};
+
 struct mlx4_interface {
-	void *			(*add)	 (struct mlx4_dev *dev);
-	void			(*remove)(struct mlx4_dev *dev, void *context);
-	void			(*event) (struct mlx4_dev *dev, void *context,
-					  enum mlx4_dev_event event, int port);
+	void *	(*add)	 (struct mlx4_dev *dev);
+	void   	(*remove)(struct mlx4_dev *dev, void *context);
+	void   	(*event) (struct mlx4_dev *dev, void *context,
+			  enum mlx4_dev_event event, int port);
+	void *	(*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port);
+	enum mlx4_prot		protocol;
 	struct list_head	list;
 };
 
 int mlx4_register_interface(struct mlx4_interface *intf);
 void mlx4_unregister_interface(struct mlx4_interface *intf);
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 #endif /* MLX4_DRIVER_H */
-- 
1.6.4


From eli at dev.mellanox.co.il  Wed Aug 19 10:19:35 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 20:19:35 +0300
Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support
Message-ID: <20090819171935.GA14411@mtls03>

RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using
Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet
fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned
Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet
management and SA services are not required for RDMAoE operation; Ethernet
management practices are used instead. RDMAoE encodes IP addresses into its
GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs,
standard IP to MAC mappings apply.

To support RDMAoE, a new transport protocol was added to the IB core. An RDMA
device can have ports with different transports, which are identified by a port
transport attribute.  The RDMA Verbs API is syntactically unmodified. When
referring to RDMAoE ports, Address handles are required to contain GIDs while
LID fields are ignored. The Ethernet L2 information is subsequently obtained by
the vendor-specific driver (both in kernel- and user-space) while modifying QPs
to RTR and creating address handles.  As there is no SA in RDMAoE, the CMA code
is modified to fill the necessary path record attributes locally before sending
CM packets. Similarly, the CMA provides to the user the required address handle
attributes when processing SIDR requests and joining multicast groups.

In this patch set, an RDMAoE port is currently assigned a single GID, encoding
the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code
temporarily uses IPv6 link-local addresses as GIDs instead of the IP address
provided by the user, thereby supporting any IP address.

To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib
drivers must be loaded, and the netdevice for the corresponding RDMAoE port
must be running. Individual ports of a multi port HCA can be independently
configured as Ethernet (with support for RDMAoE) or IB, as is already the case.
We have successfully tested MPI, SDP, RDS, and native Verbs applications over
RDMAoE.

Following is a series of 10 patches based on version 2.6.30 of the Linux
kernel. This new series reflects changes based on feedback from the community
on the previous set of patches, and is tagged v5.

Changes from v4:
1. Added rdma_is_transport_supported() and used it to simplify conditionals
throughout the code.
2. ib_register_mad_agent()for QP0 is only called for IB ports 3. PATCH 5/10
changed from "Enable support for RDMAoE ports" to "Enable support only for IB
ports".
4. MAD services from userspace currently not supported for RDMAoE ports.
5. Add kref to struct cma_multicast to aid in maintaining reference count on
the object. This is to avoid freeing the object while the worker thread is
still using it.
6. Return immediate error for invalid MTU when resolving an RDMAoE path 7.
Don't fail resolve path if rate is 0 since this value stands for
IB_RATE_PORT_CURRENT.
8. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero.
9. Add ucma_copy_rdmaoe_route()instead of modifying ucma_copy_ib_route().
10. Bug fix: in PATCH 10/10, call flush_workqueue after unregistering netdev
notifiers
11. Multicast no longer use the broadcast MAC.
12. No changes to patches 2, 7 and 8 from the v4 series.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

 b/drivers/infiniband/core/agent.c           |   38 ++-
 b/drivers/infiniband/core/cm.c              |   25 +-
 b/drivers/infiniband/core/cma.c             |   54 ++--
 b/drivers/infiniband/core/mad.c             |   41 ++-
 b/drivers/infiniband/core/multicast.c       |    4 
 b/drivers/infiniband/core/sa_query.c        |   39 ++-
 b/drivers/infiniband/core/ucm.c             |    8 
 b/drivers/infiniband/core/ucma.c            |    2 
 b/drivers/infiniband/core/ud_header.c       |  111 ++++++++++
 b/drivers/infiniband/core/user_mad.c        |    6 
 b/drivers/infiniband/core/uverbs.h          |    1 
 b/drivers/infiniband/core/uverbs_cmd.c      |   32 ++
 b/drivers/infiniband/core/uverbs_main.c     |    1 
 b/drivers/infiniband/core/verbs.c           |   25 ++
 b/drivers/infiniband/hw/mlx4/ah.c           |  187 +++++++++++++---
 b/drivers/infiniband/hw/mlx4/mad.c          |   32 +-
 b/drivers/infiniband/hw/mlx4/main.c         |  309 +++++++++++++++++++++++++---
 b/drivers/infiniband/hw/mlx4/mlx4_ib.h      |   19 +
 b/drivers/infiniband/hw/mlx4/qp.c           |  172 ++++++++++-----
 b/drivers/infiniband/ulp/ipoib/ipoib_main.c |   12 -
 b/drivers/net/mlx4/en_main.c                |   15 +
 b/drivers/net/mlx4/en_port.c                |    4 
 b/drivers/net/mlx4/en_port.h                |    3 
 b/drivers/net/mlx4/fw.c                     |    3 
 b/drivers/net/mlx4/intf.c                   |   20 +
 b/drivers/net/mlx4/main.c                   |    6 
 b/drivers/net/mlx4/mlx4.h                   |    1 
 b/include/linux/mlx4/cmd.h                  |    1 
 b/include/linux/mlx4/device.h               |   31 ++
 b/include/linux/mlx4/driver.h               |   16 +
 b/include/linux/mlx4/qp.h                   |    8 
 b/include/rdma/ib_addr.h                    |   92 ++++++++
 b/include/rdma/ib_pack.h                    |   26 ++
 b/include/rdma/ib_user_verbs.h              |   21 +
 b/include/rdma/ib_verbs.h                   |   11 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |    3 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c  |    2 
 drivers/infiniband/core/cm.c                |    5 
 drivers/infiniband/core/cma.c               |  207 ++++++++++++++++++
 drivers/infiniband/core/mad.c               |   37 ++-
 drivers/infiniband/core/ucm.c               |   12 -
 drivers/infiniband/core/ucma.c              |   31 ++
 drivers/infiniband/core/user_mad.c          |   15 -
 drivers/infiniband/core/verbs.c             |   10 
 include/rdma/ib_verbs.h                     |   15 +
 45 files changed, 1440 insertions(+), 273 deletions(-)


From eli at mellanox.co.il  Wed Aug 19 10:20:07 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 20:20:07 +0300
Subject: [ofa-general] [PATCHv5 02/10] ib_core: Add RDMAoE transport protocol
Message-ID: <20090819172007.GB14411@mtls03>

Add a new transport protocol, RDMAoE, used for transporting Infiniband traffic
over Ethernet fabrics.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 include/rdma/ib_verbs.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4cf42f3..d9146c4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -69,7 +69,8 @@ enum rdma_node_type {
 
 enum rdma_transport_type {
 	RDMA_TRANSPORT_IB,
-	RDMA_TRANSPORT_IWARP
+	RDMA_TRANSPORT_IWARP,
+	RDMA_TRANSPORT_RDMAOE
 };
 
 enum ib_device_cap_flags {
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 10:20:24 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 20:20:24 +0300
Subject: [ofa-general] [PATCHv5 07/10] ib_core: RDMAoE UD packet packing
	support
Message-ID: <20090819172024.GC14411@mtls03>

Add support functions to aid in packing RDMAoE packets.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/ud_header.c |  111 +++++++++++++++++++++++++++++++++++
 include/rdma/ib_pack.h              |   26 ++++++++
 2 files changed, 137 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c
index 8ec7876..d04b6f2 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -80,6 +80,29 @@ static const struct ib_field lrh_table[]  = {
 	  .size_bits    = 16 }
 };
 
+static const struct ib_field eth_table[]  = {
+	{ STRUCT_FIELD(eth, dmac_h),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ STRUCT_FIELD(eth, dmac_l),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ STRUCT_FIELD(eth, smac_h),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ STRUCT_FIELD(eth, smac_l),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 32 },
+	{ STRUCT_FIELD(eth, type),
+	  .offset_words = 3,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 }
+};
+
 static const struct ib_field grh_table[]  = {
 	{ STRUCT_FIELD(grh, ip_version),
 	  .offset_words = 0,
@@ -241,6 +264,53 @@ void ib_ud_header_init(int     		    payload_bytes,
 EXPORT_SYMBOL(ib_ud_header_init);
 
 /**
+ * ib_rdmaoe_ud_header_init - Initialize UD header structure
+ * @payload_bytes:Length of packet payload
+ * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @header:Structure to initialize
+ *
+ * ib_rdmaoe_ud_header_init() initializes the grh.ip_version, grh.payload_length,
+ * grh.next_header, bth.opcode, bth.pad_count and
+ * bth.transport_header_version fields of a &struct eth_ud_header given
+ * the payload length and whether a GRH will be included.
+ */
+void ib_rdmaoe_ud_header_init(int     		    payload_bytes,
+			   int    		    grh_present,
+			   struct eth_ud_header    *header)
+{
+	int header_len;
+
+	memset(header, 0, sizeof *header);
+
+	header_len =
+		sizeof header->eth  +
+		IB_BTH_BYTES  +
+		IB_DETH_BYTES;
+	if (grh_present)
+		header_len += IB_GRH_BYTES;
+
+	header->grh_present          = grh_present;
+	if (grh_present) {
+		header->grh.ip_version      = 6;
+		header->grh.payload_length  =
+			cpu_to_be16((IB_BTH_BYTES     +
+				     IB_DETH_BYTES    +
+				     payload_bytes    +
+				     4                + /* ICRC     */
+				     3) & ~3);          /* round up */
+		header->grh.next_header     = 0x1b;
+	}
+
+	if (header->immediate_present)
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE;
+	else
+		header->bth.opcode           = IB_OPCODE_UD_SEND_ONLY;
+	header->bth.pad_count                = (4 - payload_bytes) & 3;
+	header->bth.transport_header_version = 0;
+}
+EXPORT_SYMBOL(ib_rdmaoe_ud_header_init);
+
+/**
  * ib_ud_header_pack - Pack UD header struct into wire format
  * @header:UD header struct
  * @buf:Buffer to pack into
@@ -281,6 +351,47 @@ int ib_ud_header_pack(struct ib_ud_header *header,
 EXPORT_SYMBOL(ib_ud_header_pack);
 
 /**
+ * rdmaoe_ud_header_pack - Pack UD header struct into eth wire format
+ * @header:UD header struct
+ * @buf:Buffer to pack into
+ *
+ * ib_ud_header_pack() packs the UD header structure @header into wire
+ * format in the buffer @buf.
+ */
+int rdmaoe_ud_header_pack(struct eth_ud_header *header,
+		       void                 *buf)
+{
+	int len = 0;
+
+	ib_pack(eth_table, ARRAY_SIZE(eth_table),
+		&header->eth, buf);
+	len += IB_ETH_BYTES;
+
+	if (header->grh_present) {
+		ib_pack(grh_table, ARRAY_SIZE(grh_table),
+			&header->grh, buf + len);
+		len += IB_GRH_BYTES;
+	}
+
+	ib_pack(bth_table, ARRAY_SIZE(bth_table),
+		&header->bth, buf + len);
+	len += IB_BTH_BYTES;
+
+	ib_pack(deth_table, ARRAY_SIZE(deth_table),
+		&header->deth, buf + len);
+	len += IB_DETH_BYTES;
+
+	if (header->immediate_present) {
+		memcpy(buf + len, &header->immediate_data,
+		       sizeof header->immediate_data);
+		len += sizeof header->immediate_data;
+	}
+
+	return len;
+}
+EXPORT_SYMBOL(rdmaoe_ud_header_pack);
+
+/**
  * ib_ud_header_unpack - Unpack UD header struct from wire format
  * @header:UD header struct
  * @buf:Buffer to pack into
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index d7fc45c..bf199eb 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -37,6 +37,7 @@
 
 enum {
 	IB_LRH_BYTES  = 8,
+	IB_ETH_BYTES  = 14,
 	IB_GRH_BYTES  = 40,
 	IB_BTH_BYTES  = 12,
 	IB_DETH_BYTES = 8
@@ -210,6 +211,14 @@ struct ib_unpacked_deth {
 	__be32       source_qpn;
 };
 
+struct ib_unpacked_eth {
+	u8	dmac_h[4];
+	u8	dmac_l[2];
+	u8	smac_h[2];
+	u8	smac_l[4];
+	__be16	type;
+};
+
 struct ib_ud_header {
 	struct ib_unpacked_lrh  lrh;
 	int                     grh_present;
@@ -220,6 +229,16 @@ struct ib_ud_header {
 	__be32         		immediate_data;
 };
 
+struct eth_ud_header {
+	struct ib_unpacked_eth  eth;
+	int                     grh_present;
+	struct ib_unpacked_grh  grh;
+	struct ib_unpacked_bth  bth;
+	struct ib_unpacked_deth deth;
+	int            		immediate_present;
+	__be32         		immediate_data;
+};
+
 void ib_pack(const struct ib_field        *desc,
 	     int                           desc_len,
 	     void                         *structure,
@@ -234,10 +253,17 @@ void ib_ud_header_init(int     		   payload_bytes,
 		       int    		   grh_present,
 		       struct ib_ud_header *header);
 
+void ib_rdmaoe_ud_header_init(int     		   payload_bytes,
+			   int    		   grh_present,
+			   struct eth_ud_header   *header);
+
 int ib_ud_header_pack(struct ib_ud_header *header,
 		      void                *buf);
 
 int ib_ud_header_unpack(void                *buf,
 			struct ib_ud_header *header);
 
+int rdmaoe_ud_header_pack(struct eth_ud_header *header,
+		       void                 *buf);
+
 #endif /* IB_PACK_H */
-- 
1.6.4


From eli at mellanox.co.il  Wed Aug 19 10:20:39 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 19 Aug 2009 20:20:39 +0300
Subject: [ofa-general] [PATCHv5 08/10] ib_core: Add API to support RDMAoE
	from userspace
Message-ID: <20090819172039.GD14411@mtls03>

Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remore
port's MAC address. Port transport is also returned by ibv_query_port().
ABI version is incremented from 6 to 7.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/core/uverbs.h      |    1 +
 drivers/infiniband/core/uverbs_cmd.c  |   32 ++++++++++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_main.c |    1 +
 drivers/infiniband/core/verbs.c       |   10 ++++++++++
 include/rdma/ib_user_verbs.h          |   21 ++++++++++++++++++---
 include/rdma/ib_verbs.h               |   12 ++++++++++++
 6 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index b3ea958..e69b04c 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -194,5 +194,6 @@ IB_UVERBS_DECLARE_CMD(create_srq);
 IB_UVERBS_DECLARE_CMD(modify_srq);
 IB_UVERBS_DECLARE_CMD(query_srq);
 IB_UVERBS_DECLARE_CMD(destroy_srq);
+IB_UVERBS_DECLARE_CMD(get_mac);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 56feab6..012aadf 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -452,6 +452,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
 	resp.active_width    = attr.active_width;
 	resp.active_speed    = attr.active_speed;
 	resp.phys_state      = attr.phys_state;
+	resp.transport	     = attr.transport;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
@@ -1824,6 +1825,37 @@ err:
 	return ret;
 }
 
+ssize_t ib_uverbs_get_mac(struct ib_uverbs_file *file, const char __user *buf,
+			  int in_len, int out_len)
+{
+	struct ib_uverbs_get_mac        cmd;
+	struct ib_uverbs_get_mac_resp   resp;
+	int              ret;
+	struct ib_pd    *pd;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd)
+		return -EINVAL;
+
+	ret = ib_get_mac(pd->device, cmd.port, cmd.gid, resp.mac);
+	put_pd_read(pd);
+	if (!ret) {
+		if (copy_to_user((void __user *) (unsigned long) cmd.response,
+				 &resp, sizeof resp))
+			return -EFAULT;
+
+		return in_len;
+	}
+
+	return ret;
+}
+
 ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 			     const char __user *buf, int in_len, int out_len)
 {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index eb36a81..2641845 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -108,6 +108,7 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
 	[IB_USER_VERBS_CMD_MODIFY_SRQ]    	= ib_uverbs_modify_srq,
 	[IB_USER_VERBS_CMD_QUERY_SRQ]     	= ib_uverbs_query_srq,
 	[IB_USER_VERBS_CMD_DESTROY_SRQ]   	= ib_uverbs_destroy_srq,
+	[IB_USER_VERBS_CMD_GET_MAC]		= ib_uverbs_get_mac,
 };
 
 static struct vfsmount *uverbs_event_mnt;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index d81e217..aaa8778 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -925,3 +925,13 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
 	return qp->device->detach_mcast(qp, gid, lid);
 }
 EXPORT_SYMBOL(ib_detach_mcast);
+
+int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac)
+{
+	if (!device->get_mac)
+		return -ENOSYS;
+
+	return device->get_mac(device, port, gid, mac);
+}
+EXPORT_SYMBOL(ib_get_mac);
+
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index a17f771..49eee8a 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -42,7 +42,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define IB_USER_VERBS_ABI_VERSION	6
+#define IB_USER_VERBS_ABI_VERSION	7
 
 enum {
 	IB_USER_VERBS_CMD_GET_CONTEXT,
@@ -81,7 +81,8 @@ enum {
 	IB_USER_VERBS_CMD_MODIFY_SRQ,
 	IB_USER_VERBS_CMD_QUERY_SRQ,
 	IB_USER_VERBS_CMD_DESTROY_SRQ,
-	IB_USER_VERBS_CMD_POST_SRQ_RECV
+	IB_USER_VERBS_CMD_POST_SRQ_RECV,
+	IB_USER_VERBS_CMD_GET_MAC
 };
 
 /*
@@ -205,7 +206,8 @@ struct ib_uverbs_query_port_resp {
 	__u8  active_width;
 	__u8  active_speed;
 	__u8  phys_state;
-	__u8  reserved[3];
+	__u8  transport;
+	__u8  reserved[2];
 };
 
 struct ib_uverbs_alloc_pd {
@@ -621,6 +623,19 @@ struct ib_uverbs_destroy_ah {
 	__u32 ah_handle;
 };
 
+struct ib_uverbs_get_mac {
+	__u64	response;
+	__u32	pd_handle;
+	__u8	port;
+	__u8	reserved[3];
+	__u8	gid[16];
+};
+
+struct ib_uverbs_get_mac_resp {
+	__u8	mac[6];
+	__u16	reserved;
+};
+
 struct ib_uverbs_attach_mcast {
 	__u8  gid[16];
 	__u32 qp_handle;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index d9146c4..bf6e860 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1131,6 +1131,9 @@ struct ib_device {
 						  struct ib_grh *in_grh,
 						  struct ib_mad *in_mad,
 						  struct ib_mad *out_mad);
+	int                        (*get_mac)(struct ib_device *device, u8 port,
+					      u8 *gid, u8 *mac);
+
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
@@ -2037,4 +2040,13 @@ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
  */
 int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 
+/**
+  * ib_get_mac - get the mac address for the specified gid
+  * @device: IB device used for traffic
+  * @port: port number used.
+  * @gid: gid to be resolved into mac
+  * @mac: mac of the port bearing this gid
+  */
+int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac);
+
 #endif /* IB_VERBS_H */
-- 
1.6.4


From devel-ofed at morey-chaisemartin.com  Wed Aug 19 11:49:39 2009
From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 19 Aug 2009 20:49:39 +0200
Subject: [ofa-general] [PATCH 0/3]  Fat-Tree code cleanup
Message-ID: <4A8C4943.6090408@morey-chaisemartin.com>

Except the first one, patches in this series are trivial cleanups.
The first one remove the reverse_hop parameter which is not used anymore thanks to the current_hop.

Nicolas Morey-Chaisemartin (3):
  osm_ucast_ftree.c: Removed reverse_hop parameters from
    fabric_route_upgoing_by_going_down
  osm_ucast_ftree.c: Cleaned up many comments
  osm_ucast_ftree.c: Applied osm_indent

 opensm/opensm/osm_ucast_ftree.c |  169 +++++++++++++++++----------------------
 1 files changed, 72 insertions(+), 97 deletions(-)


From devel-ofed at morey-chaisemartin.com  Wed Aug 19 11:50:04 2009
From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 19 Aug 2009 20:50:04 +0200
Subject: [ofa-general] [PATCH 1/3] osm_ucast_ftree.c: Removed reverse_hop
 parameters from fabric_route_upgoing_by_going_down
In-Reply-To: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
References: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
Message-ID: <4A8C495C.202@morey-chaisemartin.com>


The parameter was only used to calculate the number of hops done up to this point but
this is not required anymore as there is a curront_hops parameter now.

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas at morey-chaisemartin.com>
---
 opensm/opensm/osm_ucast_ftree.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: d432935addec203b1a0b3d8343c30016fc49f5c0.diff
Type: text/x-patch
Size: 1476 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090819/ba029f6e/attachment.bin>

From nicolas at morey-chaisemartin.com  Wed Aug 19 11:50:23 2009
From: nicolas at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 19 Aug 2009 20:50:23 +0200
Subject: [ofa-general] [PATCH 2/3] osm_ucast_ftree.c: Cleaned up many
	comments
In-Reply-To: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
References: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
Message-ID: <4A8C496F.4020705@morey-chaisemartin.com>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas at morey-chaisemartin.com>
---
 opensm/opensm/osm_ucast_ftree.c |   33 +++++++++++----------------------
 1 files changed, 11 insertions(+), 22 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 9dd7818726e5c3062bf14e0b026dfc7e12f6e8fe.diff
Type: text/x-patch
Size: 5748 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090819/99de5eb0/attachment.bin>

From devel-ofed at morey-chaisemartin.com  Wed Aug 19 11:50:52 2009
From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 19 Aug 2009 20:50:52 +0200
Subject: [ofa-general] [PATCH 2/3] osm_ucast_ftree.c: Cleaned up many
	comments
In-Reply-To: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
References: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
Message-ID: <4A8C498C.8000603@morey-chaisemartin.com>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas at morey-chaisemartin.com>
---
 opensm/opensm/osm_ucast_ftree.c |   33 +++++++++++----------------------
 1 files changed, 11 insertions(+), 22 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 9dd7818726e5c3062bf14e0b026dfc7e12f6e8fe.diff
Type: text/x-patch
Size: 5748 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090819/81c7afee/attachment.bin>

From devel-ofed at morey-chaisemartin.com  Wed Aug 19 11:51:15 2009
From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 19 Aug 2009 20:51:15 +0200
Subject: [ofa-general] [PATCH 3/3] osm_ucast_ftree.c: Applied osm_indent
In-Reply-To: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
References: <cover.1248160190.git.nicolas@morey-chaisemartin.com>
Message-ID: <4A8C49A3.7020404@morey-chaisemartin.com>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas at morey-chaisemartin.com>
---
 opensm/opensm/osm_ucast_ftree.c |  135 ++++++++++++++++++---------------------
 1 files changed, 62 insertions(+), 73 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: c09f7c217e85b91938b57660f17fa212e6d27b1d.diff
Type: text/x-patch
Size: 11863 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090819/26335e72/attachment.bin>

From kononov at ftml.net  Wed Aug 19 15:01:06 2009
From: kononov at ftml.net (Roman Kononov)
Date: Wed, 19 Aug 2009 17:01:06 -0500
Subject: [ofa-general] ibv_rc_pingpong hangs with forks
Message-ID: <4A8C7622.8080505@ftml.net>

Hello,

The attached modification to ibv_rc_pingpong makes it never complete.

Forking seems to do something bad. I noticed that forking right after 
ibv_post_recv() "cancels" the posted WR (as if it was never issued): sender 
keeps retrying.

Is it expected behavior or a bug?

This happens with ConnectX MT25418 and InfiniHost MT25208 adapters.

Extracted from OFED-1.4.tgz:
libibverbs-1.1.2
libmlx4-1.0
libmthca-1.0.5

Kernel: 2.6.30.5 x86_64 SMP

Roman Kononov

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fork-test.patch
Type: text/x-patch
Size: 904 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090819/06ed7758/attachment.bin>

From rdreier at cisco.com  Wed Aug 19 15:14:29 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 19 Aug 2009 15:14:29 -0700
Subject: [ofa-general] ibv_rc_pingpong hangs with forks
In-Reply-To: <4A8C7622.8080505@ftml.net> (Roman Kononov's message of "Wed, 19
	Aug 2009 17:01:06 -0500")
References: <4A8C7622.8080505@ftml.net>
Message-ID: <adad46rzc1m.fsf@cisco.com>


 > The attached modification to ibv_rc_pingpong makes it never complete.
 > 
 > Forking seems to do something bad. I noticed that forking right after
 > ibv_post_recv() "cancels" the posted WR (as if it was never issued):
 > sender keeps retrying.
 > 
 > Is it expected behavior or a bug?

Yes, forking is expected to cause issues.  Do things work better if you
set the environment variable IBV_FORK_SAFE=1 ?

 - R.


From kononov at ftml.net  Wed Aug 19 16:06:21 2009
From: kononov at ftml.net (Roman Kononov)
Date: Wed, 19 Aug 2009 18:06:21 -0500
Subject: [ofa-general] ibv_rc_pingpong hangs with forks
In-Reply-To: <adad46rzc1m.fsf@cisco.com>
References: <4A8C7622.8080505@ftml.net> <adad46rzc1m.fsf@cisco.com>
Message-ID: <4A8C856D.7010705@ftml.net>

On 2009-08-19 17:14, Roland Dreier wrote:
> Yes, forking is expected to cause issues.  Do things work better if you
> set the environment variable IBV_FORK_SAFE=1 ?

Yes, indeed. This thing sucked tonnes of my blood. Thanks.

Roman


From vlad at lists.openfabrics.org  Thu Aug 20 03:00:41 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 20 Aug 2009 03:00:41 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090820-0200 daily build status
Message-ID: <20090820100042.02B9F10201B5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From kliteyn at dev.mellanox.co.il  Thu Aug 20 06:06:55 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Aug 2009 16:06:55 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to
 QoS level with pkey
Message-ID: <4A8D4A6F.9050404@dev.mellanox.co.il>


Hi Sasha,

Fixing a bug in matching PR query to QoS
levels when pkey specified - pkeys in QoS
policy are held w/o the MSB.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

---
 opensm/opensm/osm_qos_policy.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index febd7f6..9b72293 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -303,7 +303,7 @@ boolean_t osm_qos_level_has_pkey(IN const osm_qos_level_t * p_qos_level,
 		return FALSE;
 	return __is_num_in_range_arr(p_qos_level->pkey_range_arr,
 				     p_qos_level->pkey_range_len,
-				     cl_ntoh16(pkey));
+				     cl_ntoh16(ib_pkey_get_base(pkey)));
 }

 /***************************************************
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu Aug 20 06:07:16 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Aug 2009 16:07:16 +0300
Subject: [ofa-general] [PATCH] opensm: fixing some data types in
	osm_req_get/set
Message-ID: <4A8D4A84.3050605@dev.mellanox.co.il>

Hi Sasha,

Attribute ID and attribute modifier are used in osm_req_get/set
in network order - fixing data types.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/opensm/osm_sm.h |    8 ++++----
 opensm/opensm/osm_req.c        |    8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
index cc8321d..152ecd7 100644
--- a/opensm/include/opensm/osm_sm.h
+++ b/opensm/include/opensm/osm_sm.h
@@ -404,8 +404,8 @@ osm_sm_bind(IN osm_sm_t * const p_sm, IN const ib_net64_t port_guid);
 ib_api_status_t
 osm_req_get(IN osm_sm_t * sm,
 	    IN const osm_dr_path_t * const p_path,
-	    IN const uint16_t attr_id,
-	    IN const uint32_t attr_mod,
+	    IN const ib_net16_t attr_id,
+	    IN const ib_net32_t attr_mod,
 	    IN const cl_disp_msgid_t err_msg,
 	    IN const osm_madw_context_t * const p_context);
 /*
@@ -452,8 +452,8 @@ osm_req_set(IN osm_sm_t * sm,
 	    IN const osm_dr_path_t * const p_path,
 	    IN const uint8_t * const p_payload,
 	    IN const size_t payload_size,
-	    IN const uint16_t attr_id,
-	    IN const uint32_t attr_mod,
+	    IN const ib_net16_t attr_id,
+	    IN const ib_net32_t attr_mod,
 	    IN const cl_disp_msgid_t err_msg,
 	    IN const osm_madw_context_t * const p_context);
 /*
diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c
index baeeed7..f79e3ab 100644
--- a/opensm/opensm/osm_req.c
+++ b/opensm/opensm/osm_req.c
@@ -62,8 +62,8 @@
 **********************************************************************/
 ib_api_status_t osm_req_get(IN osm_sm_t * sm,
 			    IN const osm_dr_path_t * const p_path,
-			    IN const uint16_t attr_id,
-			    IN const uint32_t attr_mod,
+			    IN const ib_net16_t attr_id,
+			    IN const ib_net32_t attr_mod,
 			    IN const cl_disp_msgid_t err_msg,
 			    IN const osm_madw_context_t * const p_context)
 {
@@ -134,8 +134,8 @@ ib_api_status_t osm_req_set(IN osm_sm_t * sm,
 			    IN const osm_dr_path_t * const p_path,
 			    IN const uint8_t * const p_payload,
 			    IN const size_t payload_size,
-			    IN const uint16_t attr_id,
-			    IN const uint32_t attr_mod,
+			    IN const ib_net16_t attr_id,
+			    IN const ib_net32_t attr_mod,
 			    IN const cl_disp_msgid_t err_msg,
 			    IN const osm_madw_context_t * const p_context)
 {
-- 
1.5.1.4


From arlin.r.davis at intel.com  Thu Aug 20 11:04:56 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 20 Aug 2009 11:04:56 -0700
Subject: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep processor
	yield
Message-ID: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com>


Be thread scheduler friendly and release the current thread thus allowing other threads to run.

Signed off by Stan Smith stan.smith at intel.com

--- a/test/dapltest/mdep/linux/dapl_mdep_user.h	Wed Aug 19 14:09:52 2009
+++ b/test/dapltest/mdep/linux/dapl_mdep_user.h	Wed Aug 19 13:32:36 2009
@@ -200,4 +200,9 @@
 
 #define DT_Mdep_flush() fflush(NULL)
 
+/*
+ * Release processor to reschedule
+ */
+#define DT_Mdep_yield pthread_yield
+
 #endif

--- a/test/dapltest/mdep/solaris/dapl_mdep_user.h	Thu Aug 20 08:49:11 2009
+++ b/test/dapltest/mdep/solaris/dapl_mdep_user.h	Wed Aug 19 16:23:28 2009
@@ -74,6 +74,10 @@
 #define DT_Mdep_printf printf
 #define DT_Mdep_flush() fflush(NULL)
 
+/*
+ * Release processor to reschedule
+ */
+#define DT_Mdep_yield pthread_yield
 
 /*
  * Locks

--- a/test/dapltest/mdep/windows/dapl_mdep_user.h	Wed Aug 19 14:08:50 2009
+++ b/test/dapltest/mdep/windows/dapl_mdep_user.h	Tue Aug 18 13:57:09 2009
@@ -80,6 +80,11 @@
 #define DT_Mdep_flush() fflush(NULL)
 
 /*
+ * Release processor to reschedule
+ */
+#define DT_Mdep_yield() Sleep(0)
+
+/*
  * Locks
  */
 

--- a/test/dapltest/test/dapl_test_util.c	Wed Aug 19 14:20:07 2009
+++ b/test/dapltest/test/dapl_test_util.c	Wed Aug 19 14:20:00 2009
@@ -415,7 +415,7 @@
 		  DAT_EVD_HANDLE evd_handle,
 		  DAT_DTO_COMPLETION_EVENT_DATA * dto_statusp)
 {
-	for (;;) {
+	for (;;DT_Mdep_yield()) {
 		DAT_RETURN ret;
 		DAT_EVENT event;
 

From rdreier at cisco.com  Thu Aug 20 11:24:41 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Aug 2009 11:24:41 -0700
Subject: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep
	processor yield
In-Reply-To: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> (Arlin
	Davis's message of "Thu, 20 Aug 2009 11:04:56 -0700")
References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com>
Message-ID: <adamy5uxs0m.fsf@cisco.com>


 > +#define DT_Mdep_yield pthread_yield

Be aware that on Linux I believe this turns into sched_yield(), which
basically means "put me at the end of the thread list" ie wait for
everyone else to get a turn ie possibly huge latency...


From stan.smith at intel.com  Thu Aug 20 11:43:13 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Thu, 20 Aug 2009 11:43:13 -0700
Subject: [ofw] Re: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for
	mdep	processor yield
In-Reply-To: <adamy5uxs0m.fsf@cisco.com>
References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com>
	<adamy5uxs0m.fsf@cisco.com>
Message-ID: <3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com>

Roland Dreier wrote:
>  > +#define DT_Mdep_yield pthread_yield
>
> Be aware that on Linux I believe this turns into sched_yield(), which
> basically means "put me at the end of the thread list" ie wait for
> everyone else to get a turn ie possibly huge latency...
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Is sleep(0) a preferred way to go?


From rdreier at cisco.com  Thu Aug 20 11:45:38 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Aug 2009 11:45:38 -0700
Subject: [ofw] Re: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for
	mdep	processor yield
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com>
	(Stan Smith's message of "Thu, 20 Aug 2009 11:43:13 -0700")
References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com>
	<adamy5uxs0m.fsf@cisco.com>
	<3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com>
Message-ID: <adahbw2xr1p.fsf@cisco.com>


 > Is sleep(0) a preferred way to go?

I think the best solution is not coding spin-loops.  Not sure what
sleep(0) ends up turning into, but if you can tell the system "I'm
waiting for this object, wake me up when it's available" then that
should produce the best behavior.

 - R.


From rdreier at cisco.com  Thu Aug 20 13:46:53 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Aug 2009 13:46:53 -0700
Subject: [ofa-general][PATCH] mlx4_core: Avoid double icms free
In-Reply-To: <4A8805D6.10803@mellanox.co.il> (Yevgeny Petrilin's message of
	"Sun, 16 Aug 2009 16:12:54 +0300")
References: <4A8805D6.10803@mellanox.co.il>
Message-ID: <adabpmaxlfm.fsf@cisco.com>

thanks, applied.


From worleys at gmail.com  Thu Aug 20 14:17:01 2009
From: worleys at gmail.com (Chris Worley)
Date: Thu, 20 Aug 2009 15:17:01 -0600
Subject: [ofa-general] iSER issues in RHEL 5.3
Message-ID: <f3177b9e0908201417i1b496287u509fc306980671ef@mail.gmail.com>

Configuration: RHEL 5.3, 2.6.18-128 kernel, OFED 1.4.1

With one lun exported, running "fio" with 1MB blocks, I get sporadic
errors on the initiator like:

sd 4:0:0:1: timing out command, waited 360s
sd 4:0:0:1: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdb, sector 3921920
sd 4:0:0:1: timing out command, waited 360s
sd 4:0:0:1: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdb, sector 3962880

No problems reported on the target.

With multiple LUNS exported, the tgtd hangs on the target as soon as I
try to benchmark the LUNs (login was sucessful).

iSCSI seems fine (but slow), iSER seems problematic.

Any ideas?

What are good known configurations (distros/kernel/OFED) for iSER?

Thanks,

Chris


From rdreier at cisco.com  Thu Aug 20 14:33:41 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Aug 2009 14:33:41 -0700
Subject: [ofa-general] Better way to get sufficient EQ context memory?
Message-ID: <ada4os2xj9m.fsf@cisco.com>

Eli, it occurs to me that since we're doing more than one page for EQ
context now, we might as well use the normal ICM table stuff that
everything else uses.  Seems the code becomes much simpler and I don't
think there's any real overhead added... thoughts?

(Christoph, I tested this with "possible_cpus=32" and it still works for
me -- if you get a chance on your Dell systems that would be helpful too)

commit 58cafda0c3010fc2cdb0fc9be3fbd6d09640dd6f
Author: Roland Dreier <rolandd at cisco.com>
Date:   Thu Aug 20 14:26:21 2009 -0700

    mlx4_core: Allocate and map sufficient ICM memory for EQ context
    
    The current implementation allocates a single host page for EQ context
    memory, which was OK when we only allocated a few EQs.  However, since
    we now allocate an EQ for each CPU core, this patch removes the
    hard-coded limit (which we exceed with 4 KB pages and 128 byte EQ
    context entries with 32 CPUs) and uses the same ICM table code as all
    other context tables.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/net/mlx4/eq.c   |   42 ------------------------------------------
 drivers/net/mlx4/main.c |    9 ++++++---
 drivers/net/mlx4/mlx4.h |    7 +------
 3 files changed, 7 insertions(+), 51 deletions(-)

diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index c11a052..d7974a6 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -525,48 +525,6 @@ static void mlx4_unmap_clr_int(struct mlx4_dev *dev)
 	iounmap(priv->clr_base);
 }
 
-int mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt)
-{
-	struct mlx4_priv *priv = mlx4_priv(dev);
-	int ret;
-
-	/*
-	 * We assume that mapping one page is enough for the whole EQ
-	 * context table.  This is fine with all current HCAs, because
-	 * we only use 32 EQs and each EQ uses 64 bytes of context
-	 * memory, or 1 KB total.
-	 */
-	priv->eq_table.icm_virt = icm_virt;
-	priv->eq_table.icm_page = alloc_page(GFP_HIGHUSER);
-	if (!priv->eq_table.icm_page)
-		return -ENOMEM;
-	priv->eq_table.icm_dma  = pci_map_page(dev->pdev, priv->eq_table.icm_page, 0,
-					       PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
-	if (pci_dma_mapping_error(dev->pdev, priv->eq_table.icm_dma)) {
-		__free_page(priv->eq_table.icm_page);
-		return -ENOMEM;
-	}
-
-	ret = mlx4_MAP_ICM_page(dev, priv->eq_table.icm_dma, icm_virt);
-	if (ret) {
-		pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE,
-			       PCI_DMA_BIDIRECTIONAL);
-		__free_page(priv->eq_table.icm_page);
-	}
-
-	return ret;
-}
-
-void mlx4_unmap_eq_icm(struct mlx4_dev *dev)
-{
-	struct mlx4_priv *priv = mlx4_priv(dev);
-
-	mlx4_UNMAP_ICM(dev, priv->eq_table.icm_virt, 1);
-	pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE,
-		       PCI_DMA_BIDIRECTIONAL);
-	__free_page(priv->eq_table.icm_page);
-}
-
 int mlx4_alloc_eq_table(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 5c1afe0..528f89b 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -525,7 +525,10 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap,
 		goto err_unmap_aux;
 	}
 
-	err = mlx4_map_eq_icm(dev, init_hca->eqc_base);
+	err = mlx4_init_icm_table(dev, &priv->eq_table.table,
+				  init_hca->eqc_base, dev_cap->eqc_entry_sz,
+				  dev->caps.num_eqs, dev->caps.num_eqs,
+				  0, 0);
 	if (err) {
 		mlx4_err(dev, "Failed to map EQ context memory, aborting.\n");
 		goto err_unmap_cmpt;
@@ -668,7 +671,7 @@ err_unmap_mtt:
 	mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table);
 
 err_unmap_eq:
-	mlx4_unmap_eq_icm(dev);
+	mlx4_cleanup_icm_table(dev, &priv->eq_table.table);
 
 err_unmap_cmpt:
 	mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table);
@@ -698,11 +701,11 @@ static void mlx4_free_icms(struct mlx4_dev *dev)
 	mlx4_cleanup_icm_table(dev, &priv->qp_table.qp_table);
 	mlx4_cleanup_icm_table(dev, &priv->mr_table.dmpt_table);
 	mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table);
+	mlx4_cleanup_icm_table(dev, &priv->eq_table.table);
 	mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table);
 	mlx4_cleanup_icm_table(dev, &priv->cq_table.cmpt_table);
 	mlx4_cleanup_icm_table(dev, &priv->srq_table.cmpt_table);
 	mlx4_cleanup_icm_table(dev, &priv->qp_table.cmpt_table);
-	mlx4_unmap_eq_icm(dev);
 
 	mlx4_UNMAP_ICM_AUX(dev);
 	mlx4_free_icm(dev, priv->fw.aux_icm, 0);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 5bd79c2..bc72d6e 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -205,9 +205,7 @@ struct mlx4_eq_table {
 	void __iomem	      **uar_map;
 	u32			clr_mask;
 	struct mlx4_eq	       *eq;
-	u64			icm_virt;
-	struct page	       *icm_page;
-	dma_addr_t		icm_dma;
+	struct mlx4_icm_table	table;
 	struct mlx4_icm_table	cmpt_table;
 	int			have_irq;
 	u8			inta_pin;
@@ -373,9 +371,6 @@ u64 mlx4_make_profile(struct mlx4_dev *dev,
 		      struct mlx4_dev_cap *dev_cap,
 		      struct mlx4_init_hca_param *init_hca);
 
-int mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt);
-void mlx4_unmap_eq_icm(struct mlx4_dev *dev);
-
 int mlx4_cmd_init(struct mlx4_dev *dev);
 void mlx4_cmd_cleanup(struct mlx4_dev *dev);
 void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param);


From rdreier at cisco.com  Thu Aug 20 15:00:14 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Aug 2009 15:00:14 -0700
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list
	to vger?
Message-ID: <adavdkiw3gx.fsf@cisco.com>

Lately, I've had a few emails that I thought would have been of interest
to both lkml and also to general at lists.openfabrics.org.  I've held back
on cross-posting them because I know that general@ is subscribers-only,
and the bounce messages are quite annoying to replies coming from lkml.

The general@ list is subscribers-only because the openfabrics.org
sysadmin team is already overworked without trying to keep an open list
spam free.  (I say that with no intention to criticize the
openfabrics.org admins -- they do a terrific job of keeping things
running with the limited resources available; it's more a testament to
how impressive the vger mailing list admins are)

I've also noticed one or two messages about the possibility of moving
another moderated list to vger.  Certainly I prefer open lists that
don't require subscriptions to post.

So with that background, what would people think about creating an open
vger list (say, linux-rdma at vger.kernel.org) to carry the discussion
currently on general at lists.openfabrics.org?  (The transition plan would
probably be to keep the general@ list for a month or two, with frequent
announcements of the new list, until archives etc. have caught up with
the switch)

Thanks,
  Roland


From Jeffrey.C.Becker at nasa.gov  Thu Aug 20 15:02:01 2009
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Thu, 20 Aug 2009 15:02:01 -0700
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing
	list	to vger?
In-Reply-To: <adavdkiw3gx.fsf@cisco.com>
References: <adavdkiw3gx.fsf@cisco.com>
Message-ID: <4A8DC7D9.10005@nasa.gov>

Roland Dreier wrote:
> Lately, I've had a few emails that I thought would have been of interest
> to both lkml and also to general at lists.openfabrics.org.  I've held back
> on cross-posting them because I know that general@ is subscribers-only,
> and the bounce messages are quite annoying to replies coming from lkml.
>
> The general@ list is subscribers-only because the openfabrics.org
> sysadmin team is already overworked without trying to keep an open list
> spam free.  (I say that with no intention to criticize the
> openfabrics.org admins -- they do a terrific job of keeping things
> running with the limited resources available; it's more a testament to
> how impressive the vger mailing list admins are)
>
> I've also noticed one or two messages about the possibility of moving
> another moderated list to vger.  Certainly I prefer open lists that
> don't require subscriptions to post.
>
> So with that background, what would people think about creating an open
> vger list (say, linux-rdma at vger.kernel.org) to carry the discussion
> currently on general at lists.openfabrics.org?  (The transition plan would
> probably be to keep the general@ list for a month or two, with frequent
> announcements of the new list, until archives etc. have caught up with
> the switch)
>   

+1

-jeff

> Thanks,
>   Roland
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From jsquyres at cisco.com  Thu Aug 20 15:07:24 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 20 Aug 2009 18:07:24 -0400
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA
	mailinglist	to vger?
In-Reply-To: <4A8DC7D9.10005@nasa.gov>
References: <adavdkiw3gx.fsf@cisco.com> <4A8DC7D9.10005@nasa.gov>
Message-ID: <463C72F9-8011-4F6B-BE5B-619AA958B171@cisco.com>

+1

On Aug 20, 2009, at 6:02 PM, Jeff Becker wrote:

> Roland Dreier wrote:
> > Lately, I've had a few emails that I thought would have been of  
> interest
> > to both lkml and also to general at lists.openfabrics.org.  I've held  
> back
> > on cross-posting them because I know that general@ is subscribers- 
> only,
> > and the bounce messages are quite annoying to replies coming from  
> lkml.
> >
> > The general@ list is subscribers-only because the openfabrics.org
> > sysadmin team is already overworked without trying to keep an open  
> list
> > spam free.  (I say that with no intention to criticize the
> > openfabrics.org admins -- they do a terrific job of keeping things
> > running with the limited resources available; it's more a  
> testament to
> > how impressive the vger mailing list admins are)
> >
> > I've also noticed one or two messages about the possibility of  
> moving
> > another moderated list to vger.  Certainly I prefer open lists that
> > don't require subscriptions to post.
> >
> > So with that background, what would people think about creating an  
> open
> > vger list (say, linux-rdma at vger.kernel.org) to carry the discussion
> > currently on general at lists.openfabrics.org?  (The transition plan  
> would
> > probably be to keep the general@ list for a month or two, with  
> frequent
> > announcements of the new list, until archives etc. have caught up  
> with
> > the switch)
> >
>
> +1
>
> -jeff
>
> > Thanks,
> >   Roland
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


-- 
Jeff Squyres
jsquyres at cisco.com


From robert.j.woodruff at intel.com  Thu Aug 20 15:41:39 2009
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 20 Aug 2009 15:41:39 -0700
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing
	list	to vger?
In-Reply-To: <adavdkiw3gx.fsf@cisco.com>
References: <adavdkiw3gx.fsf@cisco.com>
Message-ID: <382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com>

Roland wrote,  

>Lately, I've had a few emails that I thought would have been of interest
>to both lkml and also to general at lists.openfabrics.org.  I've held back
>on cross-posting them because I know that general@ is subscribers-only,
>and the bounce messages are quite annoying to replies coming from lkml.

>The general@ list is subscribers-only because the openfabrics.org
>sysadmin team is already overworked without trying to keep an open list
>spam free.  (I say that with no intention to criticize the
>openfabrics.org admins -- they do a terrific job of keeping things
>running with the limited resources available; it's more a testament to
>how impressive the vger mailing list admins are)

>I've also noticed one or two messages about the possibility of moving
>another moderated list to vger.  Certainly I prefer open lists that
>don't require subscriptions to post.

>So with that background, what would people think about creating an open
>vger list (say, linux-rdma at vger.kernel.org) to carry the discussion
>currently on general at lists.openfabrics.org?  (The transition plan would
>probably be to keep the general@ list for a month or two, with frequent
>announcements of the new list, until archives etc. have caught up with
>the switch)

The one question I would have would be do the kernel.org people want to
see all of the traffic that we currently have on the open fabrics 
general list for all of the user-space components? 

I do not think it would be good if we had to have one list on vger for
kernel work and another one for all the user-space work.  Other than
that, I do not really care where the general develop list is hosted. 

my 2 cents,
woody


From davem at davemloft.net  Thu Aug 20 16:08:00 2009
From: davem at davemloft.net (David Miller)
Date: Thu, 20 Aug 2009 16:08:00 -0700 (PDT)
Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing
	list to vger?
In-Reply-To: <adavdkiw3gx.fsf@cisco.com>
References: <adavdkiw3gx.fsf@cisco.com>
Message-ID: <20090820.160800.50693597.davem@davemloft.net>

From: Roland Dreier <rdreier at cisco.com>
Date: Thu, 20 Aug 2009 15:00:14 -0700

> linux-rdma at vger.kernel.org

It's there, ready and waiting, should you choose to use it :-)


From jgunthorpe at obsidianresearch.com  Thu Aug 20 17:04:31 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 20 Aug 2009 18:04:31 -0600
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
Message-ID: <20090821000431.GA5713@obsidianresearch.com>

Check that the format of the multicast link address is correct before
taking it from dev->mc_list to priv->multicast_list. This way we never
try to send a bogus address to the SA, and prevents badness from
erronous 'ip maddr addr add', broken bonding drivers, or whatever.

Signed-off-by: Jason Gunthorpe <jgunthorpe at obsidianresearch.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

Same problem Moni was working on, but lets just address it directly.

There is work to try and fix the bonding driver but no fixed version
is in mainline yet. This is a cheap and simple work around that is
worth having even once the driver is fixed.

Despite this, I think it is still necessary to do something like Moni
was trying - to prevent the MCG join queue from head of line blocking
on a single bad SA response. This can happen even if everything is
correct.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 425e311..973a24b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -758,6 +758,20 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 	}
 }
 
+static int check_mcast(const u8 *addr,unsigned int addrlen,
+		       const u8 *broadcast)
+{
+	if (addrlen != 20)
+		return 0;
+	/* QPN, scope, reserved, sigature upper */
+	if (memcmp(addr,broadcast,6) != 0)
+		return 0;
+	/* signature lower, pkey */
+	if (memcmp(addr + 7,broadcast+7,3) != 0)
+		return 0;
+	return 1;
+}
+
 void ipoib_mcast_restart_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
@@ -791,6 +805,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	for (mclist = dev->mc_list; mclist; mclist = mclist->next) {
 		union ib_gid mgid;
 
+		if (!check_mcast(mclist->dmi_addr,mclist->dmi_addrlen,
+				 dev->broadcast))
+			continue;
+
 		memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid);
 
 		mcast = __ipoib_mcast_find(dev, &mgid);
-- 
1.5.4.2


From jenos at ncsa.uiuc.edu  Fri Aug 21 00:10:12 2009
From: jenos at ncsa.uiuc.edu (Jeremy Enos)
Date: Fri, 21 Aug 2009 02:10:12 -0500
Subject: [ofa-general] Fedora 10 OFED support plans
Message-ID: <4A8E4854.2060909@ncsa.uiuc.edu>

Coming up on a year of Fedora 10 GA...  Fedora 9 no longer maintained. 
No OFED support for FC10 yet creates a tough spot if trying to stay
secure.  Is there *any* version (1.5, etc) that will even build on FC10? 
thx-

    Jeremy


From rdreier at cisco.com  Fri Aug 21 02:10:04 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 21 Aug 2009 02:10:04 -0700
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing
	list	to vger?
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com>
	(Robert J. Woodruff's message of "Thu, 20 Aug 2009 15:41:39 -0700")
References: <adavdkiw3gx.fsf@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com>
Message-ID: <adaiqghwn0z.fsf@cisco.com>


 > The one question I would have would be do the kernel.org people want to
 > see all of the traffic that we currently have on the open fabrics 
 > general list for all of the user-space components? 

I don't believe there's any problem with that.  vger already hosts quite
a few lists that are userspace only (eg git) or span user and kernel (eg
alsa and kvm).  And in any case the total traffic (# of subscribers, #
of messages) that the current general@ list generates is pretty minimal
compared to what vger already handles, so I think there's no problem
with having a linux-rdma at vger.kernel.org carry everything that
general at lists.openfabrics.org does today.

 - R.


From vlad at lists.openfabrics.org  Fri Aug 21 03:03:36 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 21 Aug 2009 03:03:36 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090821-0200 daily build status
Message-ID: <20090821100336.A7718E282A2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From jsquyres at cisco.com  Fri Aug 21 07:44:47 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 21 Aug 2009 10:44:47 -0400
Subject: [ofa-general] Update on Roland's ummunotify kernel module
Message-ID: <912100AC-5AEE-4793-9A83-F62424CB027A@cisco.com>

Roland has pushed his new Linux "ummunotify" kernel upstream (i.e.,  
it's in his -next git branch):

http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=2fadea9acc19674c07ae7a9d90758f4b9b793940

It's not yet guaranteed that it will be accepted, but it looks good so  
far.  With some bug fixes from Pasha/Mellanox and Lenny+Mike/Voltaire,  
I think it's ready for wide-spread testing (I mailed some OMPI  
community members yesterday asking for specific testing).  I'm asking  
all MPI implementors to give the prototype code a whirl to shake out  
any remaining design bugs.  Others are welcome to review the design  
concepts and code as well; the more eyes, the better.  Bug fixes are  
easy later; design flaws are [much] better to be fixed now.

I describe the issue that we're fixing in my new MPI-themed blog:

    http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking

The HG where this OMPI work is being done is here:

    http://bitbucket.org/jsquyres/ummunot/

You need to have a very recent Linux kernel (2.6.31+) and Roland's  
umunotify module installed/running.  Build the OMPI HG tree with the  
"--enable-mca-no-build=memory-ptmalloc2" to disable ptmalloc2 and  
enable the ummunotify stuff.

This hack-ish "disable ptmalloc2" step is only necessary while we're  
shaking out the design issues.  I'm halfway through merging the ummunot 
+ptmalloc2 code into a new opal/mca/memory component named "linux".   
This component will choose at run time whether to use ptmalloc2 or the  
ummunotify stuff (i.e., the --enable-mca-no-build... step won't be  
necessary when all is said and done; a default OMPI Linux build will  
do the Right Things).

Thanks.

-- 
Jeff Squyres
jsquyres at cisco.com


From arlin.r.davis at intel.com  Fri Aug 21 12:56:10 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Fri, 21 Aug 2009 12:56:10 -0700
Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.22 release
Message-ID: <E3280858FA94444CA49D2BA02341C9835A17285D@orsmsx506.amr.corp.intel.com>

 
New release for uDAPL 2.0 available on the OFA download page and in my git tree.

New UCM provider uses it's own CM protocol on top of IB-UD queue pairs.
During device open, this provider creates a UD queue pair and
returns local address information via dat_ia_query. This 24 byte
opaque address must be exchange out-of-band before connecting to a
server via dat_ep_connect. This provider is targeted for MPI
implementations that already exchange address information
during boot/init phase and offers better scaling then existing 
scm and cma providers.

md5sum: 9a8be3e780a6105fb4d9c85dacf556af dapl-2.0.22.tar.gz 

Summary of changes for last 2 releases: 

2.0.22
v2 - ucm: new provider using DAPL based IB-UD cm mechanism for MPI 
v2 - dapltest: add processor yield when polling for completions

2.0.21
v2 - scm: Fix disconnect. QP's need to move to ERROR state in 
v2 - dtest: modify dtest.c to cleanup CNO wait code and consolidate into 
v2 - common: CNO events, once triggered will not be returned during the cno wait. 
v2 - scm, cma: CNO support broken in both CMA and SCM providers. 
v2 - common osd: include winsock2.h for IPv6 definitions. 
v2 - common osd: include w2tcpip.h for sockaddr_in6 definitions. 
v2 - DAPL introduced the concept of directly waiting on the CQ for 
v2 - dapltest: Implement a malloc() threshold for the completion reaping. 
v2 - scm: handle connected state when freeing CM objects 
v2 - scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. 
v2 - scm: set TCP_NODELAY sockopt on the server side for sends. 
v2 - windows: remove obsolete files in dapl/udapl source tree 
v2 - dtestcm: add UD type QP option to test 
v2 - scm: destroy QP called before disconnect 
v2 - cma: add support for rdma_cm TIME_WAIT event. 
v2 - scm: remove old udapl_scm code replaced by openib_scm. 
v2 - winof: fix build issues after consolidating cma, scm code base. 
v2 - cma: lock held when exiting as a result of a rdma_create_event_channel failurb 
v2 - windows: all dlist functions have been moved to the header file. 
v2 - dtestcm windows: add build infrastructure for new dtestcm test suite 
v2 - openib_common: reorganize provider code base to share common mem, cq, qp, dto 
v2 - scm: fixes and optimizations for connection scaling 
v2 - scm: double the default fd_set_size 
v2 - scm: EP reference in CR should be cleared during ep_destroy 
v2 - dtestx: fix conn establishment event checking 
v2 - dtestcm: new test to measure dapl connection rates. 

Vlad, please pull new v2 package into OFED 1.5 beta build and install the following:
 
dapl-2.0.22-1 
dapl-utils-2.0.22-1 
dapl-devel-2.0.22-1 
dapl-debuginfo-2.0.22-1 
compat-dapl-1.2.14-1 
compat-dapl-devel-1.2.14-1 

See http://www.openfabrics.org/downloads/dapl/ more details.

-arlin


From rdreier at cisco.com  Fri Aug 21 15:02:48 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 21 Aug 2009 15:02:48 -0700
Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing
	list to vger?
In-Reply-To: <20090820.160800.50693597.davem@davemloft.net> (David Miller's
	message of "Thu, 20 Aug 2009 16:08:00 -0700 (PDT)")
References: <adavdkiw3gx.fsf@cisco.com>
	<20090820.160800.50693597.davem@davemloft.net>
Message-ID: <adar5v4vn93.fsf@cisco.com>


 > > linux-rdma at vger.kernel.org

 > It's there, ready and waiting, should you choose to use it :-)

Thanks!

 - R.


From vlad at lists.openfabrics.org  Sat Aug 22 03:01:21 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 22 Aug 2009 03:01:21 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090822-0200 daily build status
Message-ID: <20090822100121.61CC6E28204@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From ogerlitz at voltaire.com  Sat Aug 22 23:02:29 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 23 Aug 2009 09:02:29 +0300
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing
	list	to vger?
In-Reply-To: <adavdkiw3gx.fsf@cisco.com>
References: <adavdkiw3gx.fsf@cisco.com>
Message-ID: <4A90DB75.9070303@voltaire.com>

Roland Dreier wrote:
> what would people think about creating an open vger list (say, linux-rdma at vger.kernel.org) to carry the discussion currently on general at lists.openfabrics.org? 
yes, lets do that

Or.


From ogerlitz at voltaire.com  Sat Aug 22 23:04:52 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 23 Aug 2009 09:04:52 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query
	to	QoS level with pkey
In-Reply-To: <4A8D4A6F.9050404@dev.mellanox.co.il>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>
Message-ID: <4A90DC04.3020906@voltaire.com>

Yevgeny Kliteynik wrote:
> Fixing a bug in matching PR query to QoS levels when pkey specified - pkeys in QoS
> policy are held w/o the MSB.
>   
Hi Yevgeny, so what's the impact of this bug in the field? does it 
create false positives or false negatives?

Or.


From bart.vanassche at gmail.com  Sun Aug 23 01:10:17 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sun, 23 Aug 2009 10:10:17 +0200
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adabpmi3uun.fsf_-_@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
Message-ID: <e2e108260908230110y422fcc23md13c7d94e4541fdc@mail.gmail.com>

On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier<rdreier at cisco.com> wrote:
> How about this approach?  Basically it just open-codes delayed work by
> splitting the timer and the work struct, and switches to mod_timer()
> instead of del_timer() + add_timer().  It passes very light testing here
> (basically I started ipoib and nothing blew up).
[ ... ]

Hello Roland,

I'm now using the SRP initiator from a kernel compiled from the
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
repository, and the lockdep complaints also occur on this system. The
system even deadlocks during boot about one out of two times. Do you
already know when you will have the time to commit the
locking-inversion fixes to the infiniband.git repository ?

Thanks,

Bart.


From tziporet at dev.mellanox.co.il  Sun Aug 23 01:16:24 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 23 Aug 2009 11:16:24 +0300
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A8E4854.2060909@ncsa.uiuc.edu>
References: <4A8E4854.2060909@ncsa.uiuc.edu>
Message-ID: <4A90FAD8.6000701@mellanox.co.il>

Jeremy Enos wrote:
> Coming up on a year of Fedora 10 GA...  Fedora 9 no longer maintained. 
> No OFED support for FC10 yet creates a tough spot if trying to stay
> secure.  Is there *any* version (1.5, etc) that will even build on FC10? 
> thx-
>
>     Jeremy
>
>
>   

I think OFED 1.5 might work on it but not sure. Which kernel version 
FC10 use?
In general OFED 1.5 supports FC11

Tziporet


From tziporet at dev.mellanox.co.il  Sun Aug 23 01:21:14 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 23 Aug 2009 11:21:14 +0300
Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing
	list	to vger?
In-Reply-To: <adavdkiw3gx.fsf@cisco.com>
References: <adavdkiw3gx.fsf@cisco.com>
Message-ID: <4A90FBFA.3010504@mellanox.co.il>

Roland Dreier wrote:
> Lately, I've had a few emails that I thought would have been of interest
> to both lkml and also to general at lists.openfabrics.org.  I've held back
> on cross-posting them because I know that general@ is subscribers-only,
> and the bounce messages are quite annoying to replies coming from lkml.
>
> The general@ list is subscribers-only because the openfabrics.org
> sysadmin team is already overworked without trying to keep an open list
> spam free.  (I say that with no intention to criticize the
> openfabrics.org admins -- they do a terrific job of keeping things
> running with the limited resources available; it's more a testament to
> how impressive the vger mailing list admins are)
>
> I've also noticed one or two messages about the possibility of moving
> another moderated list to vger.  Certainly I prefer open lists that
> don't require subscriptions to post.
>
> So with that background, what would people think about creating an open
> vger list (say, linux-rdma at vger.kernel.org) to carry the discussion
> currently on general at lists.openfabrics.org?  (The transition plan would
> probably be to keep the general@ list for a month or two, with frequent
> announcements of the new list, until archives etc. have caught up with
> the switch)
>
>
>   
Very good initiative
Tziporet


From kliteyn at dev.mellanox.co.il  Sun Aug 23 02:04:09 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 23 Aug 2009 12:04:09 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query
	to	QoS level with pkey
In-Reply-To: <4A90DC04.3020906@voltaire.com>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>
	<4A90DC04.3020906@voltaire.com>
Message-ID: <4A910609.3040305@dev.mellanox.co.il>

Or Gerlitz wrote:
> Yevgeny Kliteynik wrote:
>> Fixing a bug in matching PR query to QoS levels when pkey specified - 
>> pkeys in QoS
>> policy are held w/o the MSB.
>>   
> Hi Yevgeny, so what's the impact of this bug in the field? does it 
> create false positives or false negatives?

False negatives. PR queries with PKeys (e.g. 
IPoIB interfaces) weren't matched to their rules.
 
-- Yevgeny 

> Or.
> 
> 


From vlad at lists.openfabrics.org  Sun Aug 23 03:00:42 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 23 Aug 2009 03:00:42 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090823-0200 daily build status
Message-ID: <20090823100042.8E634E2816D@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport'
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport'
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Sun Aug 23 04:08:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Aug 2009 14:08:05 +0300
Subject: [ofa-general] Re: [PATCH] opensm: fixing some data types in
	osm_req_get/set
In-Reply-To: <4A8D4A84.3050605@dev.mellanox.co.il>
References: <4A8D4A84.3050605@dev.mellanox.co.il>
Message-ID: <20090823110804.GC9547@me>

On 16:07 Thu 20 Aug     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Attribute ID and attribute modifier are used in osm_req_get/set
> in network order - fixing data types.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 23 04:10:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Aug 2009 14:10:30 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_qos_policy.c: matching PR query
 to QoS level with pkey
In-Reply-To: <4A8D4A6F.9050404@dev.mellanox.co.il>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>
Message-ID: <20090823111030.GD9547@me>

On 16:06 Thu 20 Aug     , Yevgeny Kliteynik wrote:
> 
> Hi Sasha,
> 
> Fixing a bug in matching PR query to QoS
> levels when pkey specified - pkeys in QoS
> policy are held w/o the MSB.
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 23 04:52:58 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Aug 2009 14:52:58 +0300
Subject: [ofa-general] Re: [PATCH 3/5 V2] libibnetdisc: make all fields of
 ibnd_fabric_t public
In-Reply-To: <20090817140338.edd83fe0.weiny2@llnl.gov>
References: <20090813204251.df6446c1.weiny2@llnl.gov>
	<20090816114127.GW25501@me>
	<20090817140338.edd83fe0.weiny2@llnl.gov>
Message-ID: <20090823115258.GF9547@me>

On 14:03 Mon 17 Aug     , Ira Weiny wrote:
> 
> You are right, good catch.  I just copied it blindly with HTSZ which must be
> there.
> 
> git am is not working now on the last two patches [4/5 and 5/5] so I am
> sending new versions of them so that they apply cleanly.
> 
> V2 below,
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 13 Aug 2009 20:08:51 -0700
> Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public
> 
> 	In addition clean up the name of the chassis struct
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 23 05:06:09 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Aug 2009 15:06:09 +0300
Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc:
 Introduce a context object.
In-Reply-To: <20090817083023.da17378b.weiny2@llnl.gov>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
Message-ID: <20090823120609.GG9547@me>

Hi Ira,

On 08:30 Mon 17 Aug     , Ira Weiny wrote:
> 
> The immediate benefit is coming with the multi-threaded implementation where
> I plan on adding the following function.[*]

Ok, but could we discuss first how will multithreading architecture be
implemented with libibnetdisc: goals (in particular is it support for
multithreaded apps or just multithreaded discovery function), interaction
with caller application, etc.?

One of the desired feature of this I could think would be to keep API
simple for single threaded stuff.

Sasha


From sashak at voltaire.com  Sun Aug 23 08:01:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Aug 2009 18:01:27 +0300
Subject: [ofa-general] Re: [PATCH v2] opensm/complib: account for nsec
 overflow in timeout values
In-Reply-To: <20090813090602.226b2695.weiny2@llnl.gov>
References: <20090806183716.c08bbea3.weiny2@llnl.gov>
	<20090813113620.GV25501@me>
	<20090813090602.226b2695.weiny2@llnl.gov>
Message-ID: <20090823150127.GI9547@me>

On 09:06 Thu 13 Aug     , Ira Weiny wrote:
> > > @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event,
> > >  	} else {
> > >  		/* Get the current time */
> > >  		if (gettimeofday(&curtime, NULL) == 0) {
> > > -			timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000);
> > > -			timeout.tv_nsec =
> > > -			    (curtime.tv_usec + (wait_us % 1000000)) * 1000;
> > > +			uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000))
> > 
> > Do you really need fixed size (uint32_t) variable here?
> 
> Well I need at least int32_t.  I chose unsigned because we are not trying to go back in time.  I don't like leaving this as "int".  As rare as it might be, a compiler could chose 16bits for an int and that is not big enough, right?

Right, but tv_nsec field of struct timespec has 'long' type, not 'int'.

Actually my question was more about using *fixed* size type - I think
that we should avoid using fixed size types in cases when it is not
really needed (such as protocol structures, etc.).

So I'm changing this uint32_t to unsigned long which should be fine.

> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 6 Aug 2009 18:31:46 -0700
> Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From worleys at gmail.com  Sun Aug 23 08:15:13 2009
From: worleys at gmail.com (Chris Worley)
Date: Sun, 23 Aug 2009 09:15:13 -0600
Subject: [ofa-general] WinOF 2.1 RC3 issues
Message-ID: <f3177b9e0908230815w5b772772yddde3b338f26fd2f@mail.gmail.com>

1) SRP driver says "cannot start".
2) (trivial) x86_64 Installs into x86 program files.
3) Uninstall hangs interminably.


From rdreier at cisco.com  Sun Aug 23 08:21:46 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 23 Aug 2009 08:21:46 -0700
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <e2e108260908230110y422fcc23md13c7d94e4541fdc@mail.gmail.com>
	(Bart Van Assche's message of "Sun, 23 Aug 2009 10:10:17 +0200")
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<e2e108260907222335g33901fa4k523dd98624c30a25@mail.gmail.com>
	<adatz0mi03d.fsf@cisco.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
	<e2e108260908230110y422fcc23md13c7d94e4541fdc@mail.gmail.com>
Message-ID: <adad46mv9md.fsf@cisco.com>


 > I'm now using the SRP initiator from a kernel compiled from the
 > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
 > repository, and the lockdep complaints also occur on this system. The
 > system even deadlocks during boot about one out of two times. Do you
 > already know when you will have the time to commit the
 > locking-inversion fixes to the infiniband.git repository ?

Everything I know of is already in my tree now.  And I just checked
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=shortlog;h=for-next
and I see both "IB/mad: Fix possible lock-lock-timer deadlock" and
"IPoIB: Drop priv->lock before calling ipoib_send()" there.  Those are
all the lockdep-related things I know of.

I have a hard time imagining that either of those issues could cause a
deadlock on half of boots either.  Are you sure the deadlock you see is
related to one of those fixes?

 - R.


From bugzilla-daemon at bugzilla.kernel.org  Sun Aug 23 09:42:06 2009
From: bugzilla-daemon at bugzilla.kernel.org (bugzilla-daemon at bugzilla.kernel.org)
Date: Sun, 23 Aug 2009 16:42:06 GMT
Subject: [ofa-general] [Bug 14042] New: mlx4: device driver tries to sync DMA
 memory it has not allocated
Message-ID: <bug-14042-11804@http.bugzilla.kernel.org/>

http://bugzilla.kernel.org/show_bug.cgi?id=14042

           Summary: mlx4: device driver tries to sync DMA memory it has
                    not allocated
           Product: Drivers
           Version: 2.5
    Kernel Version: 2.6.30.4
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
        AssignedTo: drivers_infiniband-rdma at kernel-bugs.osdl.org
        ReportedBy: bart.vanassche at gmail.com
        Regression: No


The following message was generated while booting a system with 2.6.30.4 kernel
compiled with CONFIG_DMA_API_DEBUG=y and before any out-of-tree kernel modules
were loaded:

------------[ cut here ]------------
WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0()
Hardware name: P5Q DELUXE
mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it has
not allocated [device address=0x0000000139482000] [size=4096 bytes]
Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel
snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801
rtc_core hid_belkin mlx4_core(
+) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev
serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx xor
raid0 sd_mod crc_t10dif
ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic
ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal
processor thermal_sys hwmon
Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6
Call Trace:
 [<ffffffff8039bc7c>] ? check_sync+0x47c/0x4b0
 [<ffffffff80248b48>] warn_slowpath_common+0x78/0xd0
 [<ffffffff80248bfc>] warn_slowpath_fmt+0x3c/0x40
 [<ffffffff80517769>] ? _spin_lock_irqsave+0x49/0x60
 [<ffffffff8039b8ab>] ? check_sync+0xab/0x4b0
 [<ffffffff8039bc7c>] check_sync+0x47c/0x4b0
 [<ffffffff802724ac>] ? mark_held_locks+0x6c/0x90
 [<ffffffff8039be1d>] debug_dma_sync_single_for_cpu+0x1d/0x20
 [<ffffffffa024a969>] mlx4_write_mtt+0x159/0x1e0 [mlx4_core]
 [<ffffffffa0243c02>] mlx4_create_eq+0x222/0x650 [mlx4_core]
 [<ffffffff8027281d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa02441f5>] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core]
 [<ffffffffa0248b08>] mlx4_setup_hca+0x98/0x550 [mlx4_core]
 [<ffffffffa0249891>] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core]
 [<ffffffffa0249331>] __mlx4_init_one+0x371/0x920 [mlx4_core]
 [<ffffffffa024df18>] mlx4_init_one+0x22/0x44 [mlx4_core]
 [<ffffffff8025cd90>] ? do_work_for_cpu+0x0/0x30
 [<ffffffff803a43e2>] local_pci_probe+0x12/0x20
 [<ffffffff8025cda3>] do_work_for_cpu+0x13/0x30
 [<ffffffff802613e6>] kthread+0x56/0x90
 [<ffffffff8020cffa>] child_rip+0xa/0x20
 [<ffffffff8020c9c0>] ? restore_args+0x0/0x30
 [<ffffffff80261390>] ? kthread+0x0/0x90
 [<ffffffff8020cff0>] ? child_rip+0x0/0x20
---[ end trace 4480af29bc755c6a ]---

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bart.vanassche at gmail.com  Sun Aug 23 11:53:44 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sun, 23 Aug 2009 20:53:44 +0200
Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock
	(cancel_delayed_work inside spinlock)
In-Reply-To: <adad46mv9md.fsf@cisco.com>
References: <e2e108260907100955s128cb2bcha028ef938c6651ac@mail.gmail.com>
	<e2e108260908060258p54fe7030pc1231f8d757756b7@mail.gmail.com>
	<adaws5gg71x.fsf@cisco.com>
	<e2e108260908070258s5ac9cc1ak386b6d9aed397b3c@mail.gmail.com>
	<ada8whr8kf7.fsf@cisco.com>
	<2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com>
	<adazla76rh4.fsf@cisco.com> <adabpmi3uun.fsf_-_@cisco.com>
	<e2e108260908230110y422fcc23md13c7d94e4541fdc@mail.gmail.com>
	<adad46mv9md.fsf@cisco.com>
Message-ID: <e2e108260908231153u215d1942uf511b576fff126d6@mail.gmail.com>

On Sun, Aug 23, 2009 at 5:21 PM, Roland Dreier <rdreier at cisco.com> wrote:
>
>  > I'm now using the SRP initiator from a kernel compiled from the
>  > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
>  > repository, and the lockdep complaints also occur on this system. The
>  > system even deadlocks during boot about one out of two times. Do you
>  > already know when you will have the time to commit the
>  > locking-inversion fixes to the infiniband.git repository ?
>
> Everything I know of is already in my tree now.  And I just checked
> http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=shortlog;h=for-next
> and I see both "IB/mad: Fix possible lock-lock-timer deadlock" and
> "IPoIB: Drop priv->lock before calling ipoib_send()" there.  Those are
> all the lockdep-related things I know of.
>
> I have a hard time imagining that either of those issues could cause a
> deadlock on half of boots either.  Are you sure the deadlock you see is
> related to one of those fixes?

After having switched from the master branch to the for-next branch  I
do now also see the patches mentioned above.

And apparently the phenomenon I observed during boot was not a
deadlock but some other strange phenomenon. See also
http://bugzilla.kernel.org/show_bug.cgi?id=14043 for the details.

Bart.


From eli at dev.mellanox.co.il  Mon Aug 24 01:22:18 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 24 Aug 2009 11:22:18 +0300
Subject: [ofa-general] Better way to get sufficient EQ context memory?y
In-Reply-To: <ada4os2xj9m.fsf@cisco.com>
References: <ada4os2xj9m.fsf@cisco.com>
Message-ID: <20090824082218.GA16493@mtls03>

On Thu, Aug 20, 2009 at 02:33:41PM -0700, Roland Dreier wrote:
> Eli, it occurs to me that since we're doing more than one page for EQ
> context now, we might as well use the normal ICM table stuff that
> everything else uses.  Seems the code becomes much simpler and I don't
> think there's any real overhead added... thoughts?
Yes it look cleaner, and it works well on my 4 core system. Let's wait
for Christoph's approval on his system.

> 
> (Christoph, I tested this with "possible_cpus=32" and it still works for
> me -- if you get a chance on your Dell systems that would be helpful too)
> 


From monis at Voltaire.COM  Mon Aug 24 01:23:01 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 24 Aug 2009 11:23:01 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <20090821000431.GA5713@obsidianresearch.com>
References: <20090821000431.GA5713@obsidianresearch.com>
Message-ID: <4A924DE5.30505@Voltaire.COM>

Jason Gunthorpe wrote:
> Check that the format of the multicast link address is correct before
> taking it from dev->mc_list to priv->multicast_list. This way we never
> try to send a bogus address to the SA, and prevents badness from
> erronous 'ip maddr addr add', broken bonding drivers, or whatever.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe at obsidianresearch.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   18 ++++++++++++++++++
>  1 files changed, 18 insertions(+), 0 deletions(-)
> 
> Same problem Moni was working on, but lets just address it directly.
> 
> There is work to try and fix the bonding driver but no fixed version
> is in mainline yet. This is a cheap and simple work around that is
> worth having even once the driver is fixed.
> 
The fix is available from at least 2.6.31-rc2. However, I still need to check your claim that it doesn't work for you.

> Despite this, I think it is still necessary to do something like Moni
> was trying - to prevent the MCG join queue from head of line blocking
> on a single bad SA response. This can happen even if everything is
> correct.
> 
I'll resend my patch since it has been a long time from the last time I sent it.


From monis at Voltaire.COM  Mon Aug 24 01:36:44 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 24 Aug 2009 11:36:44 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <20090821000431.GA5713@obsidianresearch.com>
References: <20090821000431.GA5713@obsidianresearch.com>
Message-ID: <4A92511C.90402@Voltaire.COM>

Jason Gunthorpe wrote:
> Check that the format of the multicast link address is correct before
> taking it from dev->mc_list to priv->multicast_list. This way we never
> try to send a bogus address to the SA, and prevents badness from
> erronous 'ip maddr addr add', broken bonding drivers, or whatever.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe at obsidianresearch.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   18 ++++++++++++++++++
>  1 files changed, 18 insertions(+), 0 deletions(-)
> 
> Same problem Moni was working on, but lets just address it directly.
> 
> There is work to try and fix the bonding driver but no fixed version
> is in mainline yet. This is a cheap and simple work around that is
> worth having even once the driver is fixed.
> 
> Despite this, I think it is still necessary to do something like Moni
> was trying - to prevent the MCG join queue from head of line blocking
> on a single bad SA response. This can happen even if everything is
> correct.
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 425e311..973a24b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -758,6 +758,20 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
>  	}
>  }
>  
> +static int check_mcast(const u8 *addr,unsigned int addrlen,
> +		       const u8 *broadcast)
> +{
> +	if (addrlen != 20)
> +		return 0;
> +	/* QPN, scope, reserved, sigature upper */
> +	if (memcmp(addr,broadcast,6) != 0)
> +		return 0;
> +	/* signature lower, pkey */
> +	if (memcmp(addr + 7,broadcast+7,3) != 0)
> +		return 0;
> +	return 1;
> +}
> +
>  void ipoib_mcast_restart_task(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> @@ -791,6 +805,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>  	for (mclist = dev->mc_list; mclist; mclist = mclist->next) {
>  		union ib_gid mgid;
>  
> +		if (!check_mcast(mclist->dmi_addr,mclist->dmi_addrlen,
> +				 dev->broadcast))
> +			continue;
> +
>  		memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid);
>  
>  		mcast = __ipoib_mcast_find(dev, &mgid);

Why not check validity with something that looks like the reverse operation 
of the kernel function that maps ip -> link mcast addresses? 

for example this is the function for IPv4

static inline void ip_ib_mc_map(__be32 naddr, const unsigned char *broadcast, char *buf)
{
        __u32 addr;
        unsigned char scope = broadcast[5] & 0xF;

        buf[0]  = 0;            /* Reserved */
        buf[1]  = 0xff;         /* Multicast QPN */
        buf[2]  = 0xff;
        buf[3]  = 0xff;
        addr    = ntohl(naddr);
        buf[4]  = 0xff;
        buf[5]  = 0x10 | scope; /* scope from broadcast address */
        buf[6]  = 0x40;         /* IPv4 signature */
        buf[7]  = 0x1b;
        buf[8]  = broadcast[8];         /* P_Key */
        buf[9]  = broadcast[9];
        buf[10] = 0;
        buf[11] = 0;
        buf[12] = 0;
        buf[13] = 0;
        buf[14] = 0;
        buf[15] = 0;
        buf[19] = addr & 0xff;
        addr  >>= 8;
        buf[18] = addr & 0xff;
        addr  >>= 8;
        buf[17] = addr & 0xff;
        addr  >>= 8;
        buf[16] = addr & 0x0f;
}


From vlad at lists.openfabrics.org  Mon Aug 24 02:44:41 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 24 Aug 2009 02:44:41 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090824-0200 daily build status
Message-ID: <20090824094441.95F30E61E7C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26

Failed:
Build failed on i686 with linux-2.6.19
Build failed on i686 with linux-2.6.18
Build failed on i686 with linux-2.6.21.1
Build failed on i686 with linux-2.6.22
Build failed on i686 with linux-2.6.24
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-128.el5
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-128.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.19'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-93.el5
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-93.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.21.1
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.21.1'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.20
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.20'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.22
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.22'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1313: warning: pointer targets in passing argument 2 of 'memcpy_toiovec' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_bz_setup':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1430: warning: pointer targets in assignment differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1313: warning: pointer targets in passing argument 2 of 'memcpy_toiovec' differ in signedness
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_bz_setup':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1430: warning: pointer targets in assignment differ in signedness
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.21.1
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.22
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.24'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.23
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.23'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb':
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb'
/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From monis at Voltaire.COM  Mon Aug 24 04:20:41 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 24 Aug 2009 14:20:41 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <4A92511C.90402@Voltaire.COM>
References: <20090821000431.GA5713@obsidianresearch.com>
	<4A92511C.90402@Voltaire.COM>
Message-ID: <4A927789.2060300@Voltaire.COM>


> 
> Why not check validity with something that looks like the reverse operation 
> of the kernel function that maps ip -> link mcast addresses? 

Sorry. Please ignore this comment. Jason's patch does exactly that.


From eli at dev.mellanox.co.il  Mon Aug 24 05:13:07 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 24 Aug 2009 15:13:07 +0300
Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support
In-Reply-To: <20090819171935.GA14411@mtls03>
References: <20090819171935.GA14411@mtls03>
Message-ID: <20090824121307.GA3919@mtls03>

Roland,

what about this series of patches? Would you like me to re-create them
over your xrc branch or would you rather take them before xrc?

On Wed, Aug 19, 2009 at 08:19:35PM +0300, Eli Cohen wrote:
> RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using
> Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet
> fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned
> Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet
> management and SA services are not required for RDMAoE operation; Ethernet
> management practices are used instead. RDMAoE encodes IP addresses into its
> GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs,
> standard IP to MAC mappings apply.
> 
> To support RDMAoE, a new transport protocol was added to the IB core. An RDMA
> device can have ports with different transports, which are identified by a port
> transport attribute.  The RDMA Verbs API is syntactically unmodified. When
> referring to RDMAoE ports, Address handles are required to contain GIDs while
> LID fields are ignored. The Ethernet L2 information is subsequently obtained by
> the vendor-specific driver (both in kernel- and user-space) while modifying QPs
> to RTR and creating address handles.  As there is no SA in RDMAoE, the CMA code
> is modified to fill the necessary path record attributes locally before sending
> CM packets. Similarly, the CMA provides to the user the required address handle
> attributes when processing SIDR requests and joining multicast groups.
> 
> In this patch set, an RDMAoE port is currently assigned a single GID, encoding
> the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code
> temporarily uses IPv6 link-local addresses as GIDs instead of the IP address
> provided by the user, thereby supporting any IP address.
> 
> To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib
> drivers must be loaded, and the netdevice for the corresponding RDMAoE port
> must be running. Individual ports of a multi port HCA can be independently
> configured as Ethernet (with support for RDMAoE) or IB, as is already the case.
> We have successfully tested MPI, SDP, RDS, and native Verbs applications over
> RDMAoE.
> 
> Following is a series of 10 patches based on version 2.6.30 of the Linux
> kernel. This new series reflects changes based on feedback from the community
> on the previous set of patches, and is tagged v5.
> 
> Changes from v4:
> 1. Added rdma_is_transport_supported() and used it to simplify conditionals
> throughout the code.
> 2. ib_register_mad_agent()for QP0 is only called for IB ports 3. PATCH 5/10
> changed from "Enable support for RDMAoE ports" to "Enable support only for IB
> ports".
> 4. MAD services from userspace currently not supported for RDMAoE ports.
> 5. Add kref to struct cma_multicast to aid in maintaining reference count on
> the object. This is to avoid freeing the object while the worker thread is
> still using it.
> 6. Return immediate error for invalid MTU when resolving an RDMAoE path 7.
> Don't fail resolve path if rate is 0 since this value stands for
> IB_RATE_PORT_CURRENT.
> 8. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero.
> 9. Add ucma_copy_rdmaoe_route()instead of modifying ucma_copy_ib_route().
> 10. Bug fix: in PATCH 10/10, call flush_workqueue after unregistering netdev
> notifiers
> 11. Multicast no longer use the broadcast MAC.
> 12. No changes to patches 2, 7 and 8 from the v4 series.
> 
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>
> ---
> 
>  b/drivers/infiniband/core/agent.c           |   38 ++-
>  b/drivers/infiniband/core/cm.c              |   25 +-
>  b/drivers/infiniband/core/cma.c             |   54 ++--
>  b/drivers/infiniband/core/mad.c             |   41 ++-
>  b/drivers/infiniband/core/multicast.c       |    4 
>  b/drivers/infiniband/core/sa_query.c        |   39 ++-
>  b/drivers/infiniband/core/ucm.c             |    8 
>  b/drivers/infiniband/core/ucma.c            |    2 
>  b/drivers/infiniband/core/ud_header.c       |  111 ++++++++++
>  b/drivers/infiniband/core/user_mad.c        |    6 
>  b/drivers/infiniband/core/uverbs.h          |    1 
>  b/drivers/infiniband/core/uverbs_cmd.c      |   32 ++
>  b/drivers/infiniband/core/uverbs_main.c     |    1 
>  b/drivers/infiniband/core/verbs.c           |   25 ++
>  b/drivers/infiniband/hw/mlx4/ah.c           |  187 +++++++++++++---
>  b/drivers/infiniband/hw/mlx4/mad.c          |   32 +-
>  b/drivers/infiniband/hw/mlx4/main.c         |  309 +++++++++++++++++++++++++---
>  b/drivers/infiniband/hw/mlx4/mlx4_ib.h      |   19 +
>  b/drivers/infiniband/hw/mlx4/qp.c           |  172 ++++++++++-----
>  b/drivers/infiniband/ulp/ipoib/ipoib_main.c |   12 -
>  b/drivers/net/mlx4/en_main.c                |   15 +
>  b/drivers/net/mlx4/en_port.c                |    4 
>  b/drivers/net/mlx4/en_port.h                |    3 
>  b/drivers/net/mlx4/fw.c                     |    3 
>  b/drivers/net/mlx4/intf.c                   |   20 +
>  b/drivers/net/mlx4/main.c                   |    6 
>  b/drivers/net/mlx4/mlx4.h                   |    1 
>  b/include/linux/mlx4/cmd.h                  |    1 
>  b/include/linux/mlx4/device.h               |   31 ++
>  b/include/linux/mlx4/driver.h               |   16 +
>  b/include/linux/mlx4/qp.h                   |    8 
>  b/include/rdma/ib_addr.h                    |   92 ++++++++
>  b/include/rdma/ib_pack.h                    |   26 ++
>  b/include/rdma/ib_user_verbs.h              |   21 +
>  b/include/rdma/ib_verbs.h                   |   11 
>  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |    3 
>  b/net/sunrpc/xprtrdma/svc_rdma_transport.c  |    2 
>  drivers/infiniband/core/cm.c                |    5 
>  drivers/infiniband/core/cma.c               |  207 ++++++++++++++++++
>  drivers/infiniband/core/mad.c               |   37 ++-
>  drivers/infiniband/core/ucm.c               |   12 -
>  drivers/infiniband/core/ucma.c              |   31 ++
>  drivers/infiniband/core/user_mad.c          |   15 -
>  drivers/infiniband/core/verbs.c             |   10 
>  include/rdma/ib_verbs.h                     |   15 +
>  45 files changed, 1440 insertions(+), 273 deletions(-)
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From monis at Voltaire.COM  Mon Aug 24 06:51:03 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 24 Aug 2009 16:51:03 +0300
Subject: [ofa-general] [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast
 address in the join list stop subsequent joins 
Message-ID: <4A929AC7.4060402@Voltaire.COM>

Hi Roland

http://lists.openfabrics.org/pipermail/general/2009-July/060496.html
The discussion in the link above didn't end with a decision. You were asking 
about a way to inject illegal mcast addresses from userspace to ib_ipoib and 
Jason pointed about such (described below). Could you please review the patch?

thanks

 MoniS
-------------------

Illegal multicast address can be handed for IPoIB from userspace. For example
the command ip maddr add 33:33:00:00:00:01 dev ib0 injects an illegal muticast
address to IPoIB that will start a join task for this address. However, whenever
an illegal multicast address is passed to IPoIB it stops all subsequent
requests from join attempts. That happens because IPoIB joins to multicast
addresses in the order they arrived and doesn't handle the next address until the 
current address join finishes with success. 

This patch moves the multicast address to the end of the list after a join attempt.
Even if the join fails the next attempt will be with a different address.

Signed-off-by: Moni Shoua <monis at voltaire.com>
--

 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a0e9753..3c3c63d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -379,6 +379,7 @@ static int ipoib_mcast_join_complete(int status,
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_mcast *next_mcast;
 
 	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
 			mcast->mcmember.mgid.raw, status);
@@ -427,9 +428,17 @@ static int ipoib_mcast_join_complete(int status,
 
 	mutex_lock(&mcast_mutex);
 	spin_lock_irq(&priv->lock);
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
-				   mcast->backoff * HZ);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
+		list_for_each_entry(next_mcast, &priv->multicast_list, list) {
+			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &next_mcast->flags)
+			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &next_mcast->flags)
+			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &next_mcast->flags))
+				break;
+		}
+		if (&next_mcast->list != &priv->multicast_list)
+			queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+				next_mcast->backoff * HZ);
+	}
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
@@ -570,13 +579,16 @@ void ipoib_mcast_join_task(struct work_struct *work)
 				break;
 			}
 		}
-		spin_unlock_irq(&priv->lock);
 
 		if (&mcast->list == &priv->multicast_list) {
 			/* All done */
+			spin_unlock_irq(&priv->lock);
 			break;
 		}
 
+		list_move_tail(&mcast->list, &priv->multicast_list);
+		spin_unlock_irq(&priv->lock);
+
 		ipoib_mcast_join(dev, mcast, 1);
 		return;
 	}
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jenos at ncsa.uiuc.edu  Mon Aug 24 07:16:38 2009
From: jenos at ncsa.uiuc.edu (Jeremy Enos)
Date: Mon, 24 Aug 2009 09:16:38 -0500
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A90FAD8.6000701@mellanox.co.il>
References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il>
Message-ID: <4A92A0C6.9030501@ncsa.uiuc.edu>

2.6.27.29-170.2.79 is the current fc10 x64 kernel.  I had tried the 
latest tarball for 1.5- perhaps that's too late?  I can try something 
older but would be great to have a starting point.  Thx-

    Jeremy

Tziporet Koren wrote:
> Jeremy Enos wrote:
>> Coming up on a year of Fedora 10 GA...  Fedora 9 no longer 
>> maintained. No OFED support for FC10 yet creates a tough spot if 
>> trying to stay
>> secure.  Is there *any* version (1.5, etc) that will even build on 
>> FC10? thx-
>>
>>     Jeremy
>>
>>
>>   
>
> I think OFED 1.5 might work on it but not sure. Which kernel version 
> FC10 use?
> In general OFED 1.5 supports FC11
>
> Tziporet
>
>


From jgunthorpe at obsidianresearch.com  Mon Aug 24 09:15:09 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 24 Aug 2009 10:15:09 -0600
Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad
	muticast address in the join list stop subsequent joins
In-Reply-To: <4A929AC7.4060402@Voltaire.COM>
References: <4A929AC7.4060402@Voltaire.COM>
Message-ID: <20090824161509.GC4973@obsidianresearch.com>

On Mon, Aug 24, 2009 at 04:51:03PM +0300, Moni Shoua wrote:

> http://lists.openfabrics.org/pipermail/general/2009-July/060496.html
> The discussion in the link above didn't end with a decision. You were asking 
> about a way to inject illegal mcast addresses from userspace to ib_ipoib and 
> Jason pointed about such (described below). Could you please review the patch?

FWIW, upon looking at this more closely, I would rather see this patch
of yours fix the timeout problem. This actually has nothing to do with
illegal addreses but with any situation where the SA returns failure
(ie MLID exhaustion, etc)

There is already a per-event increasing back off, it just needs a
little fussing to keep track of time properly and sort the list by
expiration.

Jason


From jgl at johngroves.net  Mon Aug 24 09:37:51 2009
From: jgl at johngroves.net (John Groves)
Date: Mon, 24 Aug 2009 11:37:51 -0500
Subject: [ofa-general] OFED Source Code Cross Reference Server Announcement
Message-ID: <b062d32b0908240937s648c14a2q329371e415badf64@mail.gmail.com>

I'm pleased to announce that System Fabric Works is hosting a code cross
reference server for the OFED distributions at
http://SystemFabricWorks.com/ofed-xr.html<http://systemfabricworks.com/ofed-xr.html>.
We've used the LXR indexing engine, which will already be familiar to most
Linux kernel developers.  The code can be browsed and searched, and symbols
appear as hyper links that retrieve all references to the symbols.

We already have many of the recent OFED distributions indexed.  Feel free to
send questions, suggestions or problem reports directly to me.

Regards,
John Groves
System Fabric Works <http://systemfabricworks.com/>
John at SystemFabricWorks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090824/a4e6327d/attachment.html>

From yosefe at voltaire.com  Mon Aug 24 09:48:51 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Mon, 24 Aug 2009 19:48:51 +0300
Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad
	muticast address in the join list stop subsequent joins
In-Reply-To: <20090824161509.GC4973@obsidianresearch.com>
References: <4A929AC7.4060402@Voltaire.COM>
	<20090824161509.GC4973@obsidianresearch.com>
Message-ID: <4A92C473.4090208@voltaire.com>

On 24/08/09 19:15, Jason Gunthorpe wrote:
> On Mon, Aug 24, 2009 at 04:51:03PM +0300, Moni Shoua wrote:
> 
>> http://lists.openfabrics.org/pipermail/general/2009-July/060496.html
>> The discussion in the link above didn't end with a decision. You were asking 
>> about a way to inject illegal mcast addresses from userspace to ib_ipoib and 
>> Jason pointed about such (described below). Could you please review the patch?
> 
> FWIW, upon looking at this more closely, I would rather see this patch
> of yours fix the timeout problem. This actually has nothing to do with
> illegal addreses but with any situation where the SA returns failure
> (ie MLID exhaustion, etc)
> 
> There is already a per-event increasing back off, it just needs a
> little fussing to keep track of time properly and sort the list by
> expiration.
> 
> Jason

Are you suggesting to sort the list each time we have add/remove a new entry,
or search for the correct location to insert the new entry? I'm afraid that
would add too much complexity and be inefficient (in O() terms).

Moreover, I believe that moving a failed mcast entry to the end of the list 
behaves the same as always joining the least-backoff-value mcast entry (since
everybody start with the same backoff).

BTW Moni - Do send-only joins need the same solution too? 

--Yossi


From John at SystemFabricWorks.com  Mon Aug 24 09:30:19 2009
From: John at SystemFabricWorks.com (John Groves)
Date: Mon, 24 Aug 2009 11:30:19 -0500
Subject: [ofa-general] OFED Source Code Cross Reference Server Announcement
Message-ID: <b062d32b0908240930m1d89dcc1l3b8ea7cd6dfdb9d3@mail.gmail.com>

I'm pleased to announce that System Fabric Works is hosting a code cross
reference server for the OFED distributions at
http://SystemFabricWorks.com/ofed-xr.html<http://systemfabricworks.com/ofed-xr.html>.
We've used the LXR indexing engine, which will already be familiar to most
Linux kernel developers.  The code can be browsed and searched, and symbols
appear as hyper links that retrieve all references to the symbols.

We already have many of the recent OFED distributions indexed.  Feel free to
send questions, suggestions or problem reports directly to me.

Regards,
John Groves
System Fabric Works <http://systemfabricworks.com>
John at SystemFabricWorks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090824/650e5b0a/attachment.html>

From jgunthorpe at obsidianresearch.com  Mon Aug 24 10:18:58 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 24 Aug 2009 11:18:58 -0600
Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad
	muticast address in the join list stop subsequent joins
In-Reply-To: <4A92C473.4090208@voltaire.com>
References: <4A929AC7.4060402@Voltaire.COM>
	<20090824161509.GC4973@obsidianresearch.com>
	<4A92C473.4090208@voltaire.com>
Message-ID: <20090824171858.GJ406@obsidianresearch.com>

On Mon, Aug 24, 2009 at 07:48:51PM +0300, Yossi Etigin wrote:

> Are you suggesting to sort the list each time we have add/remove a new entry,
> or search for the correct location to insert the new entry? I'm afraid that
> would add too much complexity and be inefficient (in O() terms).

1) This is an unlikely failure path 2) It is only O(n) to insert a
entry into the proper place in an already sorted linked list. 3) I
think you can do it about four lines of code.

list_for_each_entry_reverse(i,priv->multicast_list,list) {
   if (i->xx < mcast->xx || priv->multicast_list == i) {
       list_move(mcast->list,i->list);
       break;
   }
}

Which is actually O(1) in the most common cases.

> Moreover, I believe that moving a failed mcast entry to the end of
> the list behaves the same as always joining the least-backoff-value
> mcast entry (since everybody start with the same backoff).

Nope.. New entries can be added at any time which unsorts things.

Jason


From yosefe at voltaire.com  Mon Aug 24 10:23:36 2009
From: yosefe at voltaire.com (Yossi Etigin)
Date: Mon, 24 Aug 2009 20:23:36 +0300
Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad
	muticast address in the join list stop subsequent joins
In-Reply-To: <20090824171858.GJ406@obsidianresearch.com>
References: <4A929AC7.4060402@Voltaire.COM>	<20090824161509.GC4973@obsidianresearch.com>
	<4A92C473.4090208@voltaire.com>
	<20090824171858.GJ406@obsidianresearch.com>
Message-ID: <4A92CC98.4090508@voltaire.com>

On 24/08/09 20:18, Jason Gunthorpe wrote:
> 1) This is an unlikely failure path 2) It is only O(n) to insert a
> entry into the proper place in an already sorted linked list. 3) I
> think you can do it about four lines of code.
> 
> list_for_each_entry_reverse(i,priv->multicast_list,list) {
>    if (i->xx < mcast->xx || priv->multicast_list == i) {
>        list_move(mcast->list,i->list);
>        break;
>    }
> }
> 


So you suggest putting these 4 lines instead of list_move_tail() ?


From cl at linux-foundation.org  Mon Aug 24 10:23:33 2009
From: cl at linux-foundation.org (Christoph Lameter)
Date: Mon, 24 Aug 2009 13:23:33 -0400 (EDT)
Subject: [ofa-general] Re: Better way to get sufficient EQ context memory?
In-Reply-To: <ada4os2xj9m.fsf@cisco.com>
References: <ada4os2xj9m.fsf@cisco.com>
Message-ID: <alpine.DEB.1.10.0908241322540.11714@gentwo.org>

On Thu, 20 Aug 2009, Roland Dreier wrote:

> (Christoph, I tested this with "possible_cpus=32" and it still works for
> me -- if you get a chance on your Dell systems that would be helpful too)

Works here.

Tested-by: Christoph Lameter <cl at linux-foundation.org>


From Rafael.Tinoco at Sun.COM  Mon Aug 24 11:46:04 2009
From: Rafael.Tinoco at Sun.COM (Rafael David Tinoco)
Date: Mon, 24 Aug 2009 15:46:04 -0300
Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH
	topology.
Message-ID: <4A92DFEC.3010300@Sun.COM>

Hello,

I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 
asics each, 8 qnems).
They are configured in a MESH topology.
I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.

I'm booting PXE from IB, my initrd image is bringing the ib0 interface, 
getting the squashfs image and mounting with aufs.

The problem is.. When booting more then 60 nodes, I start to get above 
errors on subnet manager.
And the problem seems to be intermitent, because each time it gives 
errors on different path.

Any ideas ?

Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713838 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713842 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713847 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713866 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713871 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
Aug 24 15:36:19 714805 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 19. Adding to light sweep sampling list
Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:19 714822 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 21. Adding to light sweep sampling list
Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:19 714831 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 25. Adding to light sweep sampling list
Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- 
dropping
Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR 
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
                Initial path = 0,0,0,0,0,0
                Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: 
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
                base_ver................0x1
                mgmt_class..............0x81
                class_ver...............0x1
                method..................0x1 (SubnGet)
                D bit...................0x0
                status..................0x0
                hop_ptr.................0x0
                hop_count...............0x5
                trans_id................0x36595
                attr_id.................0x15 (PortInfo)
                resv....................0x0
                attr_mod................0x0
                m_key...................0x0000000000000000
                dr_slid.................65535
                dr_dlid.................65535

                Initial path: 0,1,15,15,15,19
                Return path:  0,0,0,0,0,0
                Reserved:     [0][0][0][0][0][0][0]

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- 
dropping
Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR 
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
                Initial path = 0,0,0,0,0,0
                Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: 
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
                base_ver................0x1
                mgmt_class..............0x81
                class_ver...............0x1
                method..................0x1 (SubnGet)
                D bit...................0x0
                status..................0x0
                hop_ptr.................0x0
                hop_count...............0x5
                trans_id................0x36596
                attr_id.................0x15 (PortInfo)
                resv....................0x0
....


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090824/ddf18345/attachment.html>

From stan.smith at intel.com  Mon Aug 24 18:01:51 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Mon, 24 Aug 2009 18:01:51 -0700
Subject: [ofa-general] WinOF 2.1 RC 4 available for download.
Message-ID: <3F6F638B8D880340AB536D29CD4C1E1912C58BC901@orsmsx501.amr.corp.intel.com>


WinOF 2.1 Release Candidate #4 (RC4) is available @ http://www.openfabrics.org/downloads/WinOF/v2.1-RC4/

<Refresh browser view before download> just in case.

Please address comments and concerns to the ofw at lists.openfabrics.org


Changes since RC3
-----------------

  All RC4 WinOF installers are signed with a new OpenFabrics Alliance Digital-ID
  from Verisign. Only unattended installs (HPC compute nodes generally) are
  affected. Console installs will be prompted for an answer to the question
  'Do you trust drivers from the OpenFabrics Alliance 3rd party SW Publisher?'
  Check the 'always trust the OpenFabrics Alliance SW Publisher' box
  and answer OK to install drivers.

Server 2008 HPC installs only:

  The implication for HPC compute node installation is the head-node script
  '%ProgramFiles(x86)\WinOF\HPC\cert-add.bat' must be run to install the 'new'
  OFA cert in the compute nodes cert store prior to performing an unattended
  WinOF install; otherwise the unattended install attempts to prompt for an
  answer to 'Do you trust the OpenFabrics Alliance 3rd party SW Publisher?'
  and fails as it's unattended (no one home to answer).
  See Release_notes.htm (Server 2008 install) and HPC\ReadMe-HPC.txt for
  details.

SVN Commits

2381 - [WinOF] RC4 staging; again.
    Modified : /gen1/branches/WOF2-1/WinOF/buildrelease.bat
    Modified : /gen1/branches/WOF2-1/WinOF/Wix/ReadMe_release.txt
    Modified : /gen1/branches/WOF2-1/WinOF/Wix/Release_notes.htm

2380 - [LIBRDMACM] Fix a potential race with ucma_init() and calls that check whether the library
       is ready for use.
    Modified : /gen1/branches/WOF2-1/ulp/librdmacm/src/cma.cpp

2379 - [WinOF] OFA Digital-ID expired 8/20/09, added new OFA cert signature so the 'new' cert can
       be added to compute nodes cert store.
    Modified : /gen1/branches/WOF2-1/WinOF/WIX/HPC/cert-add.bat

2373 - [WinOF] RC4 staging.
    Modified : /gen1/branches/WOF2-1/WinOF/buildrelease.bat
    Modified : /gen1/branches/WOF2-1/WinOF/Wix/common/WinOF_cfg.inc
    Modified : /gen1/branches/WOF2-1/WinOF/Wix/ReadMe_release.txt
    Modified : /gen1/branches/WOF2-1/WinOF/Wix/Release_notes.htm

2372 - [DAPL2] Completion Channel refactoring
    Modified : /gen1/branches/WOF2-1/etc/user/comp_channel.cpp
    Modified : /gen1/branches/WOF2-1/inc/user/comp_channel.h
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/dapl/openib_cma/device.c
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/dapl/openib_scm/device.c

2371 - [DAPL2] dapltest.exe: yield the processor as the Windows thread scheduler will starve
               other threads unlike the Linux scheduler.
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/linux/dapl_mdep_user.h
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/solaris/dapl_mdep_user.h
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/windows/dapl_mdep_user.h
    Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/test/dapl_test_util.c

2370,2369 [WIX]
    wix\win7\x64\wof.wxs: added comment indicating Win7_x64 installer was for Server 2008 R2 also.
    WIX/build-OFA-dist.bat: update usage instructions to reflect WinOF is now under Trunk\ and no longer under branches\

2368 - [DAPL2] dt-cli.bat: %ERRORLEVEL% inside a for loop does not evaluate as expected; change to !ERRORLEVEL!
2367 - [DOCS] manual.htm: Update DAPL provider text w.r.t. names in DAT.conf file.
2364 - [IPOIB] IPoIB PXE boot support: Clear remainder of chaddr
    The IPoIB PXE boot firmware (gPXE) now sends the 8-byte port GUID in
    the DHCP chaddr field.  WinOF replaces the first 6 bytes of chaddr
    with the Ethernet-style MAC address, but leaves the remain untouched.
    This results in trailing garbage after the Ethernet-style MAC in the modified chaddr.
    Fix by explicitly zeroing the remainder of chaddr.
    Modified : /gen1/branches/WOF2-1/ulp/ipoib/kernel/ipoib_port.c

2363 - [IBAT] allow simultaneous IBAT device access from user mode by adding RW sharing attributes to CreateFileW() call.
    Modified : /gen1/branches/WOF2-1/core/ibat/user/ibat.cpp
2362 - [LIBRDMACM] retry IBAT call on E_PENDING return.
    Modified : /gen1/branches/WOF2-1/ulp/librdmacm/src/cma.cpp

2361 - [MLX4] on catastrophic error, dump error buffer before reset. [winof: 2358]
    Modified : /gen1/branches/WOF2-1/hw/mlx4/kernel/bus/net/catas.c

2360 - [MLX4] bug fix in error flow: doesn't return error on allocation failure. [winof: 2356]
    Modified : /gen1/branches/WOF2-1/hw/mlx4/kernel/bus/core/l2w_umem.c

2359 - [IBAL] fix to 2226. cause an asynchronic event to be handled immediately (and not after SMI_POLL_INTERVAL, which is 20 secs)
    Modified : /gen1/branches/WOF2-1/core/al/al_ci_ca_shared.c
    Modified : /gen1/branches/WOF2-1/core/al/kernel/al_pnp.c

2353 - [ND provider] patch to fix to 2333. Eemove a facility to define MaxDataInlineSize from application, because it breaks MS API
   [ND porvider] Improved latency of ND provider by using INLINE send. [winof: 2333, 2352]
   This patch adds usage of INLINE DATA facility of Mellanox HCAs for improving latency of ND provider.
   Ideas of the patch:
    - by default, ND provider will create QP with inline data of 160 bytes;
    (this can enlarge user's QP size)
    - one can change this default by defining environment variable IBNDPROV_MAX_INLINE_SIZE;

   Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdEndpoint.cpp
   Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdEndpoint.h
   Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdProv.cpp

2351 - [IPOIB] Prevent a BSOD which happens when restarting the opensm more than once
              (if the local endpoint was not in the lid_endpts list).
    Modified : /gen1/branches/WOF2-1/ulp/ipoib/kernel/ipoib_port.c
    Modified : /gen1/trunk/ulp/ipoib/kernel/ipoib_port.c


WinOF 2.1 Summary
-----------------

1) The WinOF 2.1 release is based on openib-windows source svn revision
   (branches\WOF2-1 svn.2381).

   Last WinOF release (2.0.2) based on svn.1975.

2) Bug fixes in

   IB Core
   IPoIB
   WSD
   SRP
   DAT/DAPL
   WinVerbs
   WinMAD
   OFED (Open Fabrics Enterprise Distribution [Linux]) verbs API
   OFED Diagnostic utilities
   WinOF Installer

3) Integrated Functionality

  - OFED Compatibility layers allow for easy porting of OFED applications
    into the WinOF environment.
        libibverbs - OFED verbs API library.
        libmad - InfiniBand MAD (Management Datagram) library.
        libumad - IB MAD exported user-mode interface library.
        librdmacm - OFED RDMA CM (Comunications Manager).

  - OFED Fabric Diagnostics available ( for usage info, see --help ).
       ibaddr - query InfiniBand address(es)
       ibnetdiscover - generate a fabric topology.
       iblinkinfo - report link info for all links in the fabric
       ibping - ping an InfiniBand address
       ibportstate - manage IB port (physical) state and link speed
       ibqueryerrors - query and report non-zero IB port counters
       ibroute - query InfiniBand switch forwarding tables
       ibstat - display HCA information.
       ibsysstat - system status for an InfiniBand address
       ibtracert - trace InfiniBand path
       saquery - SA (Subnet Administrator) query test.
       sminfo - query InfiniBand SMInfo attributes
       smpdump - dump InfiniBand subnet management attributes
       smpquery - query InfiniBand subnet management attributes
       vendstat - query InfiniBand vendor specific functions

4) New Functionality

  - All WinOF installs now utilize the Windows Driver Store along with the
     Plug-n-Play (PNP) subsystem to install the correct HCA driver(s).
     Selection of a specific Mellanox HCA device type is no longer required.

  - Server 2008-HPC install support has been enhanced to provide a no-drivers
     installed mode to ease WinOF installation when drivers have been previously
     installed with WDM (Windows Deployment Manager) node templates.
     From an msiexec.exe command line when NODRV=1, device driver '.inf' files
     are not processed during the WinOF install.
     The base assumption is the WDM node provisioning template (see cluster
     Manager) will install HCA drivers. All other WinOF files are installed
     to the standard WinOF location '%ProgramFiles(x86)%\WinOF'.
     When uninstalling a WinOF install which was done with NODRV=1, you MUST
     include NODRV=1 on the msiexec.exe uninstall command line or the uninstall
     will uninstall HCA drivers installed via WDM templates.

     Incorporating a msiexec based WinOF install into a node provisioning
     template works well.
     See examples '%ProgramFiles(x86)%\WinOF\HPC\ReadMe-HPC.txt'

     For 'first' time HPC WinOF installs or node provisioning with WinOF drivers
     via WDM, the batch script cert-add.bat, in '%ProgramFiles(x86)%\WinOF\HPC',
     should be utilized to extract the 3rd Party Software Publisher certificate
     from the WinOF_2-1_wlh_x64.msi installer and inserted in all compute nodes
     certificate store.
     Suggest WinOF install on head node then run 'cert-add.bat' from
     head-node; requires a common share visiable from all remote nodes prio to
     execution.
     For WDM node provisioning, suggest cert-add.bat invocation followed by
     WinOF-Install.bat from Node provisioning template.

     Examples

        unattended install (for use with clusrun.bat)
           start/wait msiexec /I WOF.msi /quiet NODRV=1

        console based non-interactive install with progress window:
           start/wait msiexec /I WOF.msi /passive NODRV=1

        Install selectable features (No drivers):
          start/wait msiexec /I WOF.msi NODRV=1

        Extract WinOF install files (aka driver files for WDM install)
          start/wait msiexec /A WinOF_wlh_x64.msi TARGETDIR=%TEMP%

          The folder %TEMP%\PFiles\WinOF will be created.

        console based unattended uninstall with auto-reboot:
          start/wait msiexec /X WOF.msi /passive

        clusrun unattended uninstall with auto-reboot
          start/wait msiexec /X WOF.msi /quiet /forcereboot

  - Subnet Management started as a local Windows Service from a command line:
        start/wait msiexec /I WOF.msi /passive OSMS=1

  - HCA drivers now load WinVerbs and WinMad filter drivers by default.


  - ndinstall.exe command line interface changes - see manual.htm

5) Technology Previews

   DAPL2 Socket-CM provider
    Uses IPv4 sockets to exchange Queue pair setup information, thus bypassing IB
    Path Record lookups.

   DAPL rdma-CM
    Compatible with OFED rdma-CM interfaces; facilates IB application portability
    between Linux/OFED and Windows/WinOF.


6) Vista installs Only:

  Vista installs must be performed from an Administrator priviledged command
  window. Right-clicking the .msi installer file for a Vista installation
  will fail due to insufficent privileges to install the HCA driver!
  From the Administrator privileged cmd-window (Interactive install) say

    start/wait msiexec /I WinOF_wlh_xxx.msi
          -or-
    a quiet, default install:  start/wait msiexec /I WinOF_wlh_xxx.msi /passive


**** WARNING ****

After the WinOF.msi file has started installation execution, an errant
"Welcome to the Found New Hardware Wizard" window 'may' popup.

Just ignore or 'cancel' the errant FNHW popup window in order to proceed with
the installation. XP requires a cancel, for WLH & WNET, the notifiers will
disappear on their own.

You do need to answer 'Yes' or 'Continue' to those popup windows which refer to
non-WHQL'ed drivers.

If the install appears to hang, look around for popup windows requesting input
which are covered by other windows. Such is the case on Server 2008 initial
install - Answer 'yes' to always trust the OpenFabrics Alliance as a SW publisher.


Please:
  read the Release_notes.htm file!
  make 'sure' your HCA firmware is recent; vstat.exe displays HCA firmware
  version.

Thank you,

WinOF Developers.


From weiny2 at llnl.gov  Mon Aug 24 18:52:06 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Aug 2009 18:52:06 -0700
Subject: [ofa-general] Combined DR path with empty DR path,
	what is the expected behavior?
Message-ID: <20090824185206.39e5e377.weiny2@llnl.gov>

If I send a combined DR path with a start lid but an empty (0 length) DR
path.  What is the expected behavior?

I know this could be specified with LID routing, but I don't see anywhere in
the specification which says this is an error.  I do however seem to have 2
different implementations on 2 different switches.  For example:

I have Switch A (Lid 1) and Switch B (Lid 7).  I attempt to query PortInfo of
Port 1 of each switch using the LID followed by an empty DR path.

17:55:22 > ./smpquery -c portinfo 1 0 1
ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1)
./smpquery: iberror: failed: operation portinfo: port info query failed


17:55:31 > ./smpquery -c portinfo 7 0 1
# Port info: Lid 7 port 1
Mkey:............................0x0000000000000000
GidPrefix:.......................0x0000000000000000
...
<normal output snipped>

Detecting this special case in libibmad and turning the packet into a LID
routed one succeeds but I wonder if this is an error in the SMI?  I also
notice this is an error on the HCA I am running from (lid 2).

17:57:42 > ./smpquery -c portinfo 2 0 1
ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2)
./smpquery: iberror: failed: operation portinfo: port info query failed

Running with a simple DR path works, I guess because this is the loopback case
mentioned on page 805.

17:58:16 > ./smpquery -D portinfo 0 1
# Port info: DR path slid 65535; dlid 65535; 0 port 1
Mkey:............................0x0000000000000000
GidPrefix:.......................0x2007000000000000
...
<snip>

It guess that the comment "Since each part may be empty, there are eight
combinations, although only four are really useful:" on line 36 Page 805 can
be interpreted to mean that only those 4 combinations need to be supported.
Is this true?

On the other hand I think strictly this should be supported.  Item 4 of C14-9
(line 24 page 810) requires the SMI to handle the packet if the HopPointer
equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1).
Then after processing the SMI should return the packet as specified in C14-13
item 3 on line 9 page 812.

Am I wrong?  In the end it does not matter as I have to make the software work
for all the hardware I have; so I will change the software.  However, I wonder
where exactly the spec falls on this, because I think it will influence where
the fix resides.  If the spec does not allow this then I think it is fine to
have libibmad return an error since the user specified an invalid combined DR
path.  However, if this should be legal I think libibmad should work around
the bad hardware out there.

Thoughts?
Ira

-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From ofedrnicuser at yahoo.com  Mon Aug 24 21:25:51 2009
From: ofedrnicuser at yahoo.com (Bill N)
Date: Mon, 24 Aug 2009 21:25:51 -0700 (PDT)
Subject: [ofa-general] when to use get_dma_mr,
	which doesn't take physical buffer list size argument
Message-ID: <612829.47304.qm@web111214.mail.gq1.yahoo.com>

Hi,

When to use get_dma_mr() instead of reg_phys_mr()?

Looking at the current implementation of get_dma_mr() for Chelsio and NetEffect driver's, they seem to register all the possible system memory to the device. 4GB for Chelsio & 64GB for Neteffect.

Isn't this a hole in case where system has less memory then the capability of the address bus? (Say - 2GB of physical memory & we register 4GB??)

Is get_dma_mr() used for kernel space ULPs with only SGEs(stag=0) instead of stag based send(), recv()?

Can user get_dma_mr() can reregister lesser size memory? If yes, how?

get_dma_mr() is not equivalent to 
9.2.6.1 Allocate Non-Shared Memory Region STag of iWarp-RDMAC specification?

Regards,
Bill


From vlad at dev.mellanox.co.il  Tue Aug 25 02:02:19 2009
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 25 Aug 2009 12:02:19 +0300
Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.22 release
In-Reply-To: <E3280858FA94444CA49D2BA02341C9835A17285D@orsmsx506.amr.corp.intel.com>
References: <E3280858FA94444CA49D2BA02341C9835A17285D@orsmsx506.amr.corp.intel.com>
Message-ID: <4A93A89B.7080200@dev.mellanox.co.il>

Davis, Arlin R wrote:
> Vlad, please pull new v2 package into OFED 1.5 beta build and install the following:
>  
> dapl-2.0.22-1 
> dapl-utils-2.0.22-1 
> dapl-devel-2.0.22-1 
> dapl-debuginfo-2.0.22-1 
> compat-dapl-1.2.14-1 
> compat-dapl-devel-1.2.14-1 
>
> See http://www.openfabrics.org/downloads/dapl/ more details.
>
> -arlin
>
>   
Done,

Regards,
Vladimir


From vlad at lists.openfabrics.org  Tue Aug 25 03:08:34 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 25 Aug 2009 03:08:34 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090825-0200 daily build status
Message-ID: <20090825100835.0CC58E61DD3@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From kovlensky at interia.pl  Tue Aug 25 09:25:16 2009
From: kovlensky at interia.pl (kovlensky at interia.pl)
Date: 25 Aug 2009 18:25:16 +0200
Subject: [ofa-general] ofed 1.3.2 opensmd failover
Message-ID: <20090825162517.7955C21C827@f28.poczta.interia.pl>

Hi all,

Quick question - is there a need to run anything except opensmd deamons to provide failover capability on ib network in ofed 1.3? I'm aware that when master manager dies standby one comes in and manages the network, but that does not necessary means that lids are preserved, especially for nodes joining in. I used to run sldd.sh for distributing lids list on ofed 1.2.5, but while this script seems to be in place noone mentions necessity for it.

So subnet manager failover is provided by running standby opensm. And how LID preservation is provided?

Regards,

Zdenek Kovlensky

----------------------------------------------------------------------
Kup wlasne mieszkanie za 33 tys. zl!
Sprawdz >>> http://link.interia.pl/f22f2


From akepner at sgi.com  Tue Aug 25 14:12:50 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Tue, 25 Aug 2009 14:12:50 -0700
Subject: [ofa-general] [TRIVIAL PATCH] ibutils: fix regexp for pkey matching
Message-ID: <20090825211250.GI16590@sgi.com>


There's an error in a regular expression for matching pkeys 
in ibdebug.tcl. The following fixes it.

Signed-off-by: Arthur Kepner <akepner at sgi.com>
---


diff -rup a/ibutils-1.2/ibdiag/src/ibdebug.tcl b/ibutils-1.2/ibdiag/src/ibdebug.tcl
--- a/ibutils-1.2/ibdiag/src/ibdebug.tcl	2009-08-25 12:38:45.646392453 -0700
+++ b/ibutils-1.2/ibdiag/src/ibdebug.tcl	2009-08-25 12:39:23.180706933 -0700
@@ -3048,7 +3048,7 @@ proc GetPortPkeys {drPath portNum numPKe
 	    continue
 	}
 	foreach pkey $pkeyTable {
-	    if {[regexp {^0x[0-9a-fA-F]$} $pkey]} {
+	    if {[regexp {^0x[0-9a-fA-F]+$} $pkey]} {
 		lappend pkeys $pkey
 	    }
 	}


From hal.rosenstock at gmail.com  Tue Aug 25 14:59:16 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Aug 2009 17:59:16 -0400
Subject: [ofa-general] ofed 1.3.2 opensmd failover
In-Reply-To: <20090825162517.7955C21C827@f28.poczta.interia.pl>
References: <20090825162517.7955C21C827@f28.poczta.interia.pl>
Message-ID: <f0e08f230908251459j2657ef17o1a0b7c5abc836267@mail.gmail.com>

On 8/25/09, kovlensky at interia.pl <kovlensky at interia.pl> wrote:
>
> Hi all,
>
> Quick question - is there a need to run anything except opensmd deamons to
> provide failover capability on ib network in ofed 1.3?


In terms of SM failover, modulo bugs fixed relative to this feature since
OFED 1.3 (there are a couple of things here which may affect your
environment if I recall correctly), you only need to run more than 1 SM for
this (one will become master, the other standby).

 I'm aware that when master manager dies standby one comes in and manages
> the network, but that does not necessary means that lids are preserved,
> especially for nodes joining in. I used to run sldd.sh for distributing lids
> list on ofed 1.2.5, but while this script seems to be in place noone
> mentions necessity for it.


So subnet manager failover is provided by running standby opensm.


And how LID preservation is provided?


If you want LIDs to be preserved, the guid2lid file needs to be sync'd
(copied from the master SM once it's fully assembled to the node which is
running the standby SM). That's what the sldd.sh script does.

-- Hal

Regards,
>
> Zdenek Kovlensky
>
> ----------------------------------------------------------------------
> Kup wlasne mieszkanie za 33 tys. zl!
> Sprawdz >>> http://link.interia.pl/f22f2
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/7c772e64/attachment.html>

From hal.rosenstock at gmail.com  Tue Aug 25 15:04:55 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Aug 2009 18:04:55 -0400
Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH 
	topology.
In-Reply-To: <4A92DFEC.3010300@Sun.COM>
References: <4A92DFEC.3010300@Sun.COM>
Message-ID: <f0e08f230908251504m4aec4233ke6aa5b009ce1232c@mail.gmail.com>

On 8/24/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>
> Hello,
>
> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics
> each, 8 qnems).
> They are configured in a MESH topology.
> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
>
> I'm booting PXE from IB, my initrd image is bringing the ib0 interface,
> getting the squashfs image and mounting with aufs.
>
> The problem is.. When booting more then 60 nodes, I start to get above
> errors on subnet manager.
> And the problem seems to be intermitent, because each time it gives errors
> on different path.
>
> Any ideas ?
>
> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713838 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d9381 LID range [78,78] of
> node:b03n06 HCA-1
> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713842 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d4689 LID range [76,76] of
> node:b03n04 HCA-1
> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713847 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008e5191 LID range [82,82] of
> node:b03n11 HCA-1
> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713866 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d94c9 LID range [80,80] of
> node:b03n08 HCA-1
> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713871 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008daedd LID range [83,83] of
> node:b03n12 HCA-1
> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
> Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) --
> dropping
> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
> Hop Ptr: 0x0
> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>                 Initial path = 0,0,0,0,0,0
>                 Return path  = 0,0,0,0,0,0
> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
>                 base_ver................0x1
>                 mgmt_class..............0x81
>                 class_ver...............0x1
>                 method..................0x1 (SubnGet)
>                 D bit...................0x0
>                 status..................0x0
>                 hop_ptr.................0x0
>                 hop_count...............0x5
>                 trans_id................0x36595
>                 attr_id.................0x15 (PortInfo)
>                 resv....................0x0
>                 attr_mod................0x0
>                 m_key...................0x0000000000000000
>                 dr_slid.................65535
>                 dr_dlid.................65535
>
>                 Initial path: 0,1,15,15,15,19
>                 Return path:  0,0,0,0,0,0
>                 Reserved:     [0][0][0][0][0][0][0]
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) --
> dropping
> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
> Hop Ptr: 0x0
> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>                 Initial path = 0,0,0,0,0,0
>                 Return path  = 0,0,0,0,0,0
> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
>                 base_ver................0x1
>                 mgmt_class..............0x81
>                 class_ver...............0x1
>                 method..................0x1 (SubnGet)
>                 D bit...................0x0
>                 status..................0x0
>                 hop_ptr.................0x0
>                 hop_count...............0x5
>                 trans_id................0x36596
>                 attr_id.................0x15 (PortInfo)
>                 resv....................0x0
> ....
>

These errors are transient as you indicate. They mean that some node has
brought the link physically up but there is no SMA at the remote side of the
link. The different paths are paths to the HCAs. This occurs during PXE boot
as the node transitions from the boot ROM to the Linux environment.

Other than these messages, do things seem to work in terms of the end nodes
?

-- Hal

_______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/1d36f13c/attachment.html>

From kalaiya.1 at buckeyemail.osu.edu  Tue Aug 25 16:03:41 2009
From: kalaiya.1 at buckeyemail.osu.edu (MANIKANTAN KALAIYA)
Date: Tue, 25 Aug 2009 23:03:41 +0000
Subject: [ofa-general] Number of devices returned by ibv_get_device_list() 
In-Reply-To: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com>
References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com>
Message-ID: <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com>

Resending to the mailing list...

We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code.

The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected.

However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop).
Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program.

We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated.

modprobe command - "sudo modprobe -r ib_mthca"

Test program:
=================================================
#include <stdio.h>
#include <infiniband/verbs.h>

int main(int argc, char **argv)
{
    int ret, num_devices;
    struct ibv_device      **dev_list;

    while(1) {

        dev_list = ibv_get_device_list(&num_devices);

        if (num_devices != 0) {
            printf("IB ADAPTER AVAILABLE:%d\n", num_devices);
        }
        else {
            printf("IB ADAPTER UNAVAILABLE\n");
        }
        sleep(2);
        ibv_free_device_list(dev_list);
    }

    return(0);
}
=================================================

Thanks,
Mani.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/e8ccba17/attachment.html>

From hal.rosenstock at gmail.com  Tue Aug 25 16:15:19 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Aug 2009 19:15:19 -0400
Subject: [ofa-general] Combined DR path with empty DR path, what is the 
	expected behavior?
In-Reply-To: <20090824185206.39e5e377.weiny2@llnl.gov>
References: <20090824185206.39e5e377.weiny2@llnl.gov>
Message-ID: <f0e08f230908251615x79f2f87cwcba95c0f7e743bfe@mail.gmail.com>

On 8/24/09, Ira Weiny <weiny2 at llnl.gov> wrote:

> If I send a combined DR path with a start lid but an empty (0 length) DR
> path.


Hop Count 0 ?


> What is the expected behavior?


Not sure what you mean by expected here. Are you referring to expectation
based on the spec ?


> I know this could be specified with LID routing, but I don't see anywhere
> in
> the specification which says this is an error.


I don't think it should be an error (certainly not for the form you are
using LID routed part followed by a DR part) but a null DR part is a little
funny/odd.


> I do however seem to have 2
> different implementations on 2 different switches.  For example:
>
> I have Switch A (Lid 1) and Switch B (Lid 7).  I attempt to query PortInfo
> of
> Port 1 of each switch using the LID followed by an empty DR path.
>
> 17:55:22 > ./smpquery -c portinfo 1 0 1
> ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1)
> ./smpquery: iberror: failed: operation portinfo: port info query failed


Is this a timeout ?


> 17:55:31 > ./smpquery -c portinfo 7 0 1
> # Port info: Lid 7 port 1
> Mkey:............................0x0000000000000000
> GidPrefix:.......................0x0000000000000000
> ...
> <normal output snipped>
>
> Detecting this special case in libibmad and turning the packet into a LID
> routed one


Ugh... Is this special case really needed ? I don't think the underlying
issue is understood sufficiently yet.


> succeeds but I wonder if this is an error in the SMI?


Switch SMI ? Is this a proprietary implementation ?


>   I also notice this is an error on the HCA I am running from (lid 2).


Is this HCA node OpenIB based ?

17:57:42 > ./smpquery -c portinfo 2 0 1
> ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2)
> ./smpquery: iberror: failed: operation portinfo: port info query failed


Is this also a timeout ?

Also, does the result differ based on where you source these from matter
(locally v. remotely)?


> Running with a simple DR path works,


You're referring to the same DR path here that fails in the combined route
examples above, right ?


> I guess because this is the loopback case mentioned on page 805.


Yes but that's the high level requirement rather than the SMI rules which
make that work.


> 17:58:16 > ./smpquery -D portinfo 0 1
> # Port info: DR path slid 65535; dlid 65535; 0 port 1
> Mkey:............................0x0000000000000000
> GidPrefix:.......................0x2007000000000000
> ...
> <snip>
>
> It guess that the comment "Since each part may be empty, there are eight
> combinations, although only four are really useful:" on line 36 Page 805
> can
> be interpreted to mean that only those 4 combinations need to be supported.
> Is this true?


Not all 4 combinations are supported/known to work. When this was added for
ibportstate, the only combined routing form that was important was LID
routed part followed by a DR part.


> On the other hand I think strictly this should be supported.


In an ideal world yes but are they all required or is it just the one form
most heavily used ?


>   Item 4 of C14-9
> (line 24 page 810) requires the SMI to handle the packet if the HopPointer
> equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1)


By handle, this means "The SMI *shall *output the packet on the port whose
number is in the entry indexed by Hop Pointer in the Initial Path. If that
port number is invalid, the SMI *shall *discard the SMP."

Are you sure the Hop Pointer is 1 ? Where do you see this ?

If so, what's the initial path at this point (or more specifically index 1
of the initial path) ? I think that needs to be port 0 (if a switch) but
this is a little weird as I would think it should be handed to the SMA which
is different cases in the spec.


> Then after processing


by the SMA and doing the required returning initialization

the SMI should return the packet as specified in C14-13
> item 3 on line 9 page 812.


I'm not sure it would use this case in the case of an empty DR pafh on
return.

Am I wrong?  In the end it does not matter as I have to make the software
> work
> for all the hardware I have; so I will change the software.


IMO it does matter as to where the problem lies (SMI or otherwise) and how
the layers are comprised in the implementation.

However, I wonder
> where exactly the spec falls on this, because I think it will influence
> where
> the fix resides.  If the spec does not allow this then I think it is fine
> to
> have libibmad return an error since the user specified an invalid combined
> DR
> path.  However, if this should be legal I think libibmad should work around
> the bad hardware out there.


Is it hardware or firmware that needs fixing ? I think it may depend on the
specific workaround for this as to whether it is acceptable as it might harm
something else or might violate the spec.

-- Hal


Thoughts?
> Ira
>
> --
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2 at llnl.gov
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/72e2b7cb/attachment.html>

From hnrose at comcast.net  Tue Aug 25 16:20:24 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 25 Aug 2009 19:20:24 -0400
Subject: [ofa-general] [PATCH] opensm/osm_helper.c: Only change method when >
	rather than >=
Message-ID: <20090825232024.GA17650@comcast.net>


Also, cosmetic formatting change to combine lines like:
	uint16_t host_attr;
	host_attr = cl_ntoh16(attr);

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 23392a4..3692474 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -458,12 +458,12 @@ const char *ib_get_sa_method_str(IN uint8_t method)
 {
 	if (method & 0x80) {
 		method = method & 0x7f;
-		if (method >= OSM_SA_METHOD_STR_UNKNOWN_VAL)
+		if (method > OSM_SA_METHOD_STR_UNKNOWN_VAL)
 			method = OSM_SA_METHOD_STR_UNKNOWN_VAL;
 		/* it is a response - use the response table */
 		return (__ib_sa_resp_method_str[method]);
 	} else {
-		if (method >= OSM_SA_METHOD_STR_UNKNOWN_VAL)
+		if (method > OSM_SA_METHOD_STR_UNKNOWN_VAL)
 			method = OSM_SA_METHOD_STR_UNKNOWN_VAL;
 		return (__ib_sa_method_str[method]);
 	}
@@ -475,7 +475,7 @@ const char *ib_get_sm_method_str(IN uint8_t method)
 {
 	if (method & 0x80)
 		method = (method & 0x0F) | 0x10;
-	if (method >= OSM_SM_METHOD_STR_UNKNOWN_VAL)
+	if (method > OSM_SM_METHOD_STR_UNKNOWN_VAL)
 		method = OSM_SM_METHOD_STR_UNKNOWN_VAL;
 	return (__ib_sm_method_str[method]);
 }
@@ -484,10 +484,9 @@ const char *ib_get_sm_method_str(IN uint8_t method)
  **********************************************************************/
 const char *ib_get_sm_attr_str(IN ib_net16_t attr)
 {
-	uint16_t host_attr;
-	host_attr = cl_ntoh16(attr);
+	uint16_t host_attr = cl_ntoh16(attr);
 
-	if (host_attr >= OSM_SM_ATTR_STR_UNKNOWN_VAL)
+	if (host_attr > OSM_SM_ATTR_STR_UNKNOWN_VAL)
 		host_attr = OSM_SM_ATTR_STR_UNKNOWN_VAL;
 
 	return (__ib_sm_attr_str[host_attr]);
@@ -497,10 +496,9 @@ const char *ib_get_sm_attr_str(IN ib_net16_t attr)
  **********************************************************************/
 const char *ib_get_sa_attr_str(IN ib_net16_t attr)
 {
-	uint16_t host_attr;
-	host_attr = cl_ntoh16(attr);
+	uint16_t host_attr = cl_ntoh16(attr);
 
-	if (host_attr >= OSM_SA_ATTR_STR_UNKNOWN_VAL)
+	if (host_attr > OSM_SA_ATTR_STR_UNKNOWN_VAL)
 		host_attr = OSM_SA_ATTR_STR_UNKNOWN_VAL;
 
 	return (__ib_sa_attr_str[host_attr]);


From kalaiya.1 at buckeyemail.osu.edu  Tue Aug 25 15:55:54 2009
From: kalaiya.1 at buckeyemail.osu.edu (MANIKANTAN KALAIYA)
Date: Tue, 25 Aug 2009 22:55:54 +0000
Subject: [ofa-general] Number of devices returned by ibv_get_device_list() 
Message-ID: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com>


We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code.

The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected.

However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop).
Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program.

We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated.

modprobe command - "sudo modprobe -r ib_mthca"

Test program:
=================================================
#include <stdio.h>
#include <infiniband/verbs.h>

int main(int argc, char **argv)
{
    int ret, num_devices;
    struct ibv_device      **dev_list;

    while(1) {

        dev_list = ibv_get_device_list(&num_devices);

        if (num_devices != 0) {
            printf("IB ADAPTER AVAILABLE:%d\n", num_devices);
        }
        else {
            printf("IB ADAPTER UNAVAILABLE\n");
        }
        sleep(2);
        ibv_free_device_list(dev_list);
    }

    return(0);
}
=================================================

Thanks,
Mani.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/a39930bd/attachment.html>

From jenos at ncsa.uiuc.edu  Tue Aug 25 17:31:30 2009
From: jenos at ncsa.uiuc.edu (Jeremy Enos)
Date: Tue, 25 Aug 2009 19:31:30 -0500
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A92A0C6.9030501@ncsa.uiuc.edu>
References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il>
	<4A92A0C6.9030501@ncsa.uiuc.edu>
Message-ID: <4A948262.7030508@ncsa.uiuc.edu>

Latest available 1.5 tar fails w/ this error building.  This an fc10 x64
machine up to date as of last week.
thx-

    Jeremy

Failed to build ofa_kernel RPM
See /tmp/OFED.26978.logs/ofa_kernel.rpmbuild.log

[root at ac27 OFED-1.5-20090825-0729]# tail -50
/tmp/OFED.26978.logs/ofa_kernel.rpmbuild.log
mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions ; rm -f
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/*
make -f scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5
make -f scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband
make -f scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core
  gcc
-Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d 
-nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.3.2/include
-D__KERNEL__ \
-D__OFED_BUILD__ \
-include include/linux/autoconf.h \
-include
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/
\
 \
 \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \
-I/usr/local/include/scst \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \
-Iinclude \
 \
-I/usr/src/kernels/2.6.27.29-170.2.79.fc10.x86_64/arch/x86/include \
 -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing
-fno-common -Werror-implicit-function-declaration
-fno-delete-null-pointer-checks -Os -m64 -mtune=generic -mno-red-zone
-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args
-DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare
-fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
-Iinclude/asm-x86/mach-default -fno-stack-protector
-fno-omit-frame-pointer -fno-optimize-sibling-calls -g
-Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow
-DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)" 
-D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c
In file included from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h:7,
                 from include/asm/paravirt.h:33,
                 from include/asm/page.h:159,
                 from include/asm/pda.h:9,
                 from include/asm/current.h:20,
                 from include/asm/processor.h:16,
                 from include/linux/prefetch.h:15,
                 from include/linux/list.h:7,
                 from include/linux/mutex.h:14,
                 from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:37:
include/asm/topology.h: In function 'cpu_to_node':
include/asm/topology.h:93: error: implicit declaration of function 'cpu_pda'
include/asm/topology.h:93: error: invalid type argument of '->' (have 'int')
include/asm/topology.h: In function 'early_cpu_to_node':
include/asm/topology.h:102: error: invalid type argument of '->' (have
'int')
make[4]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o]
Error 1
make[3]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core] Error 2
make[2]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband] Error 2
make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.27.29-170.2.79.fc10.x86_64'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.ynduey (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.ynduey (%build)
[root at ac27 OFED-1.5-20090825-0729]# uname -r
2.6.27.29-170.2.79.fc10.x86_64
[root at ac27 OFED-1.5-20090825-0729]#


Jeremy Enos wrote:
> 2.6.27.29-170.2.79 is the current fc10 x64 kernel.  I had tried the
> latest tarball for 1.5- perhaps that's too late?  I can try something
> older but would be great to have a starting point.  Thx-
>
>    Jeremy
>
> Tziporet Koren wrote:
>> Jeremy Enos wrote:
>>> Coming up on a year of Fedora 10 GA...  Fedora 9 no longer
>>> maintained. No OFED support for FC10 yet creates a tough spot if
>>> trying to stay
>>> secure.  Is there *any* version (1.5, etc) that will even build on
>>> FC10? thx-
>>>
>>>     Jeremy
>>>
>>>
>>>   
>>
>> I think OFED 1.5 might work on it but not sure. Which kernel version
>> FC10 use?
>> In general OFED 1.5 supports FC11
>>
>> Tziporet
>>
>>
>


From weiny2 at llnl.gov  Tue Aug 25 17:55:43 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 25 Aug 2009 17:55:43 -0700
Subject: [ofa-general] Combined DR path with empty DR path, what is the
	expected behavior?
In-Reply-To: <f0e08f230908251615x79f2f87cwcba95c0f7e743bfe@mail.gmail.com>
References: <20090824185206.39e5e377.weiny2@llnl.gov>
	<f0e08f230908251615x79f2f87cwcba95c0f7e743bfe@mail.gmail.com>
Message-ID: <20090825175543.4f929646.weiny2@llnl.gov>

On Tue, 25 Aug 2009 19:15:19 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On 8/24/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> 
> > If I send a combined DR path with a start lid but an empty (0 length) DR
> > path.
> 
> 
> Hop Count 0 ?

Yes

> 
> 
> > What is the expected behavior?
> 
> 
> Not sure what you mean by expected here. Are you referring to expectation
> based on the spec ?
> 

yes

> 
> > I know this could be specified with LID routing, but I don't see anywhere
> > in
> > the specification which says this is an error.
> 
> 
> I don't think it should be an error (certainly not for the form you are
> using LID routed part followed by a DR part) but a null DR part is a little
> funny/odd.

Yea I know.  It turns out that the new iblinkinfo issues queries like this
when it is removes recurses back from the last DR portion of the combined
route path.  It only showed up as an error when using the -S <guid> option of
iblinkinfo with this new switch I have.  Works fine with the old switches.

> 
> > I do however seem to have 2
> > different implementations on 2 different switches.  For example:
> >
> > I have Switch A (Lid 1) and Switch B (Lid 7).  I attempt to query PortInfo
> > of
> > Port 1 of each switch using the LID followed by an empty DR path.
> >
> > 17:55:22 > ./smpquery -c portinfo 1 0 1
> > ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1)
> > ./smpquery: iberror: failed: operation portinfo: port info query failed
> 
> 
> Is this a timeout ?

yes

16:26:25 > ./smpquery -e -c portinfo 1 0 1
ibwarn: [27150] _do_madrpc: retry 1 (timeout 1000 ms)
ibwarn: [27150] _do_madrpc: retry 2 (timeout 1000 ms)
ibwarn: [27150] _do_madrpc: timeout after 3 retries, 3000 ms
ibwarn: [27150] mad_rpc: _do_madrpc failed; dport (Lid 1)
./smpquery: iberror: failed: operation portinfo: port info query failed


> 
> 
> > 17:55:31 > ./smpquery -c portinfo 7 0 1
> > # Port info: Lid 7 port 1
> > Mkey:............................0x0000000000000000
> > GidPrefix:.......................0x0000000000000000
> > ...
> > <normal output snipped>
> >
> > Detecting this special case in libibmad and turning the packet into a LID
> > routed one
> 
> 
> Ugh... Is this special case really needed ? I don't think the underlying
> issue is understood sufficiently yet.

Well I just did it to prove that what I was doing would work with a "simple"
lid routed packet.  Like I said it might be that this portid which is being
specified to libibmad by libibnetdisc is not valid.  If that is true then
libibnetdisc should detect when the DR path is empty and go back to LID routed
requests.  That is a valid fix in my mind.

> 
> > succeeds but I wonder if this is an error in the SMI?
> 
> 
> Switch SMI ? Is this a proprietary implementation ?
> 

Yes I see the bug with 2 different vendors switches.  One is managed and the
other is not.  My "old" switches (3 different vendors) do not show this
behavior.  (Just to be clear I now I have 5 switches in my 5 node cluster!
;-)

> 
> 
> >   I also notice this is an error on the HCA I am running from (lid 2).
> 
> 
> Is this HCA node OpenIB based ?

yes

> 
> 17:57:42 > ./smpquery -c portinfo 2 0 1
> > ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2)
> > ./smpquery: iberror: failed: operation portinfo: port info query failed
> 
> 
> Is this also a timeout ?

yes

> 
> Also, does the result differ based on where you source these from matter
> (locally v. remotely)?

Same result local and remote.

> 
> 
> 
> > Running with a simple DR path works,
> 
> 
> You're referring to the same DR path here that fails in the combined route
> examples above, right ?
> 

No. the example below is a DR path with Hop Count == 0 but without the initial
LID routing.

> 
> > I guess because this is the loopback case mentioned on page 805.
> 
> 
> Yes but that's the high level requirement rather than the SMI rules which
> make that work.
> 
> 
> 
> > 17:58:16 > ./smpquery -D portinfo 0 1
> > # Port info: DR path slid 65535; dlid 65535; 0 port 1
> > Mkey:............................0x0000000000000000
> > GidPrefix:.......................0x2007000000000000
> > ...
> > <snip>
> >
> > It guess that the comment "Since each part may be empty, there are eight
> > combinations, although only four are really useful:" on line 36 Page 805
> > can
> > be interpreted to mean that only those 4 combinations need to be supported.
> > Is this true?
> 
> 
> Not all 4 combinations are supported/known to work. When this was added for
> ibportstate, the only combined routing form that was important was LID
> routed part followed by a DR part.
> 

When you say "known to work" you mean implemented with the diags?  Or known to
work in all hardware?

> 
> > On the other hand I think strictly this should be supported.
> 
> 
> In an ideal world yes but are they all required or is it just the one form
> most heavily used ?

That is what I am unclear on.  Does the spec require that all 8 combinations
are required to work?  I don't see a specific compliance which says that and I
am not sure if C14-9 and C14-13 cover all 8 combinations.

> 
> >   Item 4 of C14-9
> > (line 24 page 810) requires the SMI to handle the packet if the HopPointer
> > equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1)
> 
> 
> By handle, this means "The SMI *shall *output the packet on the port whose
> number is in the entry indexed by Hop Pointer in the Initial Path. If that
> port number is invalid, the SMI *shall *discard the SMP."
> 
> Are you sure the Hop Pointer is 1 ? Where do you see this ?

No I was wrong.  I think I read the wrong madeye packet as I see the packet
right before this one did have a hop pointer of 1.  I Added some debug prints
to mad_encode to get the following output:

17:26:10 > ./smpquery -e -c portinfo 1 0 1
trid 2a0f0cb5; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0
trid 2a0f0cb6; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0
trid 2a0f0cb7; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt 0
ibwarn: [27322] _do_madrpc: recv failed: Connection timed out
ibwarn: [27322] mad_rpc: _do_madrpc failed; dport (Lid 1)
./smpquery: iberror: failed: operation portinfo: port info query failed

madeye for these packets:

Aug 25 17:28:03 woprjr0 Madeye:recv SMP
Aug 25 17:28:03 woprjr0 MAD version....0x1
Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:28:03 woprjr0 Class version..0x1
Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response)
Aug 25 17:28:03 woprjr0 Status.........0x8000
Aug 25 17:28:03 woprjr0 Hop pointer....0x1
Aug 25 17:28:03 woprjr0 Hop counter....0x0
Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5
Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info)
Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
Aug 25 17:28:03 woprjr0 Mkey...........0x0
Aug 25 17:28:03 woprjr0 DR SLID........0xffff
Aug 25 17:28:03 woprjr0 DR DLID........0xffff
Aug 25 17:28:03 woprjr0 Madeye:sent SMP
Aug 25 17:28:03 woprjr0 MAD version....0x1
Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:28:03 woprjr0 Class version..0x1
Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
Aug 25 17:28:03 woprjr0 Status.........0x00
Aug 25 17:28:03 woprjr0 Hop pointer....0x1
Aug 25 17:28:03 woprjr0 Hop counter....0x0
Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5
Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info)
Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
Aug 25 17:28:03 woprjr0 Mkey...........0x0
Aug 25 17:28:03 woprjr0 DR SLID........0xffff
Aug 25 17:28:03 woprjr0 DR DLID........0xffff
Aug 25 17:28:03 woprjr0 Madeye:recv SMP
Aug 25 17:28:03 woprjr0 MAD version....0x1
Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:28:03 woprjr0 Class version..0x1
Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response)
Aug 25 17:28:03 woprjr0 Status.........0x8000
Aug 25 17:28:03 woprjr0 Hop pointer....0x1
Aug 25 17:28:03 woprjr0 Hop counter....0x0
Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6
Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
Aug 25 17:28:03 woprjr0 Mkey...........0x0
Aug 25 17:28:03 woprjr0 DR SLID........0xffff
Aug 25 17:28:03 woprjr0 DR DLID........0xffff
Aug 25 17:28:03 woprjr0 Madeye:sent SMP
Aug 25 17:28:03 woprjr0 MAD version....0x1
Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:28:03 woprjr0 Class version..0x1
Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
Aug 25 17:28:03 woprjr0 Status.........0x00
Aug 25 17:28:03 woprjr0 Hop pointer....0x1
Aug 25 17:28:03 woprjr0 Hop counter....0x0
Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6
Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
Aug 25 17:28:03 woprjr0 Mkey...........0x0
Aug 25 17:28:03 woprjr0 DR SLID........0xffff
Aug 25 17:28:03 woprjr0 DR DLID........0xffff
Aug 25 17:28:03 woprjr0 Madeye:sent SMP
Aug 25 17:28:03 woprjr0 MAD version....0x1
Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:28:03 woprjr0 Class version..0x1
Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
Aug 25 17:28:03 woprjr0 Status.........0x00
Aug 25 17:28:03 woprjr0 Hop pointer....0x0
Aug 25 17:28:03 woprjr0 Hop counter....0x0
Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb7
Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
Aug 25 17:28:03 woprjr0 Attr modifier..0x0001
Aug 25 17:28:03 woprjr0 Mkey...........0x0
Aug 25 17:28:03 woprjr0 DR SLID........0x02
Aug 25 17:28:03 woprjr0 DR DLID........0xffff

No response is shown for trid 0x1b9d2a0f0cb7...

As an aside I see the hop pointer is set to 1 at a lower level since
mad_encode does not do it.

So I guess the proper case for C14-9 would be "3) If Hop Pointer is equal to
Hop Count".  (They are both 0.)

> 
> If so, what's the initial path at this point (or more specifically index 1
> of the initial path) ? I think that needs to be port 0 (if a switch) but
> this is a little weird as I would think it should be handed to the SMA which
> is different cases in the spec.

Yes I think I was wrong on the case.  But still wouldn't the SMI detect that
this is the end of the DRPath and simply hand it to the SMA.

> 
> 
> > Then after processing
> 
> 
> by the SMA and doing the required returning initialization
> 
> the SMI should return the packet as specified in C14-13
> > item 3 on line 9 page 812.
> 
> 
> I'm not sure it would use this case in the case of an empty DR pafh on
> return.

Actually I think it will use this.  C14-9 item 3) states "the Hop Pointer
shall be incremented by 1"  Therefore when the response is handed back to the
SMI the Hop pointer will be 1 and the hop count 0.  And the SMI uses the
DRSLID to send the packet back to the requester.

> 
> Am I wrong?  In the end it does not matter as I have to make the software
> > work
> > for all the hardware I have; so I will change the software.
> 
> 
> IMO it does matter as to where the problem lies (SMI or otherwise) and how
> the layers are comprised in the implementation.

Agreed.  I am mainly confused because I have 2 different implementations of
this.  My "old" switches seem to handle this case just fine.  My "new"
switches do not.  So I am really wondering what is going on.

Here is the above output for the same query which works with an "old" switch.

17:28:04 > ./smpquery -e -c portinfo 7 0 1
...
trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt 0
...

Aug 25 17:46:40 woprjr0 Madeye:sent SMP
Aug 25 17:46:40 woprjr0 MAD version....0x1
Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:46:40 woprjr0 Class version..0x1
Aug 25 17:46:40 woprjr0 Method.........0x1 (Get)
Aug 25 17:46:40 woprjr0 Status.........0x00
Aug 25 17:46:40 woprjr0 Hop pointer....0x0
Aug 25 17:46:40 woprjr0 Hop counter....0x0
Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
Aug 25 17:46:40 woprjr0 Mkey...........0x0
Aug 25 17:46:40 woprjr0 DR SLID........0x02
Aug 25 17:46:40 woprjr0 DR DLID........0xffff
Aug 25 17:46:40 woprjr0 Madeye:recv SMP
Aug 25 17:46:40 woprjr0 MAD version....0x1
Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
Aug 25 17:46:40 woprjr0 Class version..0x1
Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response)
Aug 25 17:46:40 woprjr0 Status.........0x8000
Aug 25 17:46:40 woprjr0 Hop pointer....0x0
Aug 25 17:46:40 woprjr0 Hop counter....0x0
Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
Aug 25 17:46:40 woprjr0 Mkey...........0x0
Aug 25 17:46:40 woprjr0 DR SLID........0x02
Aug 25 17:46:40 woprjr0 DR DLID........0xffff

Hop Pointer and Count are both 0 and things work just fine...

> 
> However, I wonder
> > where exactly the spec falls on this, because I think it will influence
> > where
> > the fix resides.  If the spec does not allow this then I think it is fine
> > to
> > have libibmad return an error since the user specified an invalid combined
> > DR
> > path.  However, if this should be legal I think libibmad should work around
> > the bad hardware out there.
> 
> 
> Is it hardware or firmware that needs fixing ? I think it may depend on the
> specific workaround for this as to whether it is acceptable as it might harm
> something else or might violate the spec.

I agree, however, if the switch hardware needs fixing I fear it is too late
for the ones I have.  Firmware might be upgradable although I have had issues
with un-managed switches in the past.

So where do we put the fix in software?

Ira

> -- Hal
> 
> 
> Thoughts?
> > Ira
> >
> > --
> > Ira Weiny
> > Math Programmer/Computer Scientist
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://*openib.org/mailman/listinfo/openib-general
> >
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From poknam at gmail.com  Tue Aug 25 19:15:07 2009
From: poknam at gmail.com (PN)
Date: Wed, 26 Aug 2009 10:15:07 +0800
Subject: [ofa-general] ofed 1.3.2 opensmd failover
In-Reply-To: <f0e08f230908251459j2657ef17o1a0b7c5abc836267@mail.gmail.com>
References: <20090825162517.7955C21C827@f28.poczta.interia.pl> 
	<f0e08f230908251459j2657ef17o1a0b7c5abc836267@mail.gmail.com>
Message-ID: <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com>

HI,

I can think of a situation in which all servers have dual port IB cards and
need failover of OpenSM to achieve HA.
As I know, OpenSM can only bind to 1 port at a time, so do I need to start 2
OpenSM in server A and 2 OpenSM in server B? Will they use the same guid2lid
file? Do I need to set something in the config file or they will
automatically communcate each other?
Do I need to run sldd.sh manually or it will automatically sync with other
OpenSM?

Thanks a lot.

Regards,
PN


2009/8/26 Hal Rosenstock <hal.rosenstock at gmail.com>

>
>
> On 8/25/09, kovlensky at interia.pl <kovlensky at interia.pl> wrote:
>>
>> Hi all,
>>
>> Quick question - is there a need to run anything except opensmd deamons to
>> provide failover capability on ib network in ofed 1.3?
>
>
> In terms of SM failover, modulo bugs fixed relative to this feature since
> OFED 1.3 (there are a couple of things here which may affect your
> environment if I recall correctly), you only need to run more than 1 SM for
> this (one will become master, the other standby).
>
>  I'm aware that when master manager dies standby one comes in and manages
>> the network, but that does not necessary means that lids are preserved,
>> especially for nodes joining in. I used to run sldd.sh for distributing lids
>> list on ofed 1.2.5, but while this script seems to be in place noone
>> mentions necessity for it.
>
>
> So subnet manager failover is provided by running standby opensm.
>
>
> And how LID preservation is provided?
>
>
> If you want LIDs to be preserved, the guid2lid file needs to be sync'd
> (copied from the master SM once it's fully assembled to the node which is
> running the standby SM). That's what the sldd.sh script does.
>
> -- Hal
>
> Regards,
>>
>> Zdenek Kovlensky
>>
>> ----------------------------------------------------------------------
>> Kup wlasne mieszkanie za 33 tys. zl!
>> Sprawdz >>> http://link.interia.pl/f22f2
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


-- 
Best Regards,
PN Lai
HPC Specialist
Galactic Computng Corp.
Tel: 86-755-26733939 ext 826
Mobile: 86-13823161729
Fax: 86-755-26733780
URL: http://www.galactic.com.hk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/de4ea2aa/attachment.html>

From Rafael.Tinoco at Sun.COM  Tue Aug 25 19:33:34 2009
From: Rafael.Tinoco at Sun.COM (Rafael David Tinoco)
Date: Tue, 25 Aug 2009 23:33:34 -0300
Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH
	topology.
In-Reply-To: <f0e08f230908251504m4aec4233ke6aa5b009ce1232c@mail.gmail.com>
References: <4A92DFEC.3010300@Sun.COM>
	<f0e08f230908251504m4aec4233ke6aa5b009ce1232c@mail.gmail.com>
Message-ID: <4A949EFE.9000302@Sun.COM>

Hello Hal,

Bellow...

Hal Rosenstock wrote:
>
>
> On 8/24/09, *Rafael David Tinoco* <Rafael.Tinoco at sun.com 
> <mailto:Rafael.Tinoco at sun.com>> wrote:
>
>     Hello,
>
>     I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs
>     (2 asics each, 8 qnems).
>     They are configured in a MESH topology.
>     I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
>
>     I'm booting PXE from IB, my initrd image is bringing the ib0
>     interface, getting the squashfs image and mounting with aufs.
>
>     The problem is.. When booting more then 60 nodes, I start to get
>     above errors on subnet manager.
>     And the problem seems to be intermitent, because each time it
>     gives errors on different path.
>
>     Any ideas ?
>
>     Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice:
>     Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
>     GID:fe80::5080:200:8d:9931
>     Aug 24 15:36:19 713838 [48D7D940] 0x02 ->
>     __osm_state_mgr_report_new_ports: Discovered new port with
>     GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
>     Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice:
>     Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
>     GID:fe80::5080:200:8d:9931
>     Aug 24 15:36:19 713842 [48D7D940] 0x02 ->
>     __osm_state_mgr_report_new_ports: Discovered new port with
>     GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
>     Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice:
>     Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
>     GID:fe80::5080:200:8d:9931
>     Aug 24 15:36:19 713847 [48D7D940] 0x02 ->
>     __osm_state_mgr_report_new_ports: Discovered new port with
>     GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
>     Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice:
>     Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
>     GID:fe80::5080:200:8d:9931
>     Aug 24 15:36:19 713866 [48D7D940] 0x02 ->
>     __osm_state_mgr_report_new_ports: Discovered new port with
>     GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
>     Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice:
>     Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
>     GID:fe80::5080:200:8d:9931
>     Aug 24 15:36:19 713871 [48D7D940] 0x02 ->
>     __osm_state_mgr_report_new_ports: Discovered new port with
>     GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
>     Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
>     Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
>     __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
>     for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
>     NEM I4A) port 19. Adding to light sweep sampling list
>     Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4
>     hop path:
>                     Path = 0,1,15,15,15
>     Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
>     __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
>     for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
>     NEM I4A) port 21. Adding to light sweep sampling list
>     Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4
>     hop path:
>                     Path = 0,1,15,15,15
>     Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
>     __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
>     for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
>     NEM I4A) port 25. Adding to light sweep sampling list
>     Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4
>     hop path:
>                     Path = 0,1,15,15,15
>     Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409:
>     send completed with error (method=0x1 attr=0x15
>     trans_id=0x4700036595) -- dropping
>     Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411:
>     DR SMP Hop Ptr: 0x0
>     Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop
>     path:
>                     Initial path = 0,0,0,0,0,0
>                     Return path  = 0,0,0,0,0,0
>     Aug 24 15:36:20 514333 [4977E940] 0x01 ->
>     __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
>     (IB_TIMEOUT)
>     Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
>                     base_ver................0x1
>                     mgmt_class..............0x81
>                     class_ver...............0x1
>                     method..................0x1 (SubnGet)
>                     D bit...................0x0
>                     status..................0x0
>                     hop_ptr.................0x0
>                     hop_count...............0x5
>                     trans_id................0x36595
>                     attr_id.................0x15 (PortInfo)
>                     resv....................0x0
>                     attr_mod................0x0
>                     m_key...................0x0000000000000000
>                     dr_slid.................65535
>                     dr_dlid.................65535
>
>                     Initial path: 0,1,15,15,15,19
>                     Return path:  0,0,0,0,0,0
>                     Reserved:     [0][0][0][0][0][0][0]
>
>                     00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                     00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                     00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                     00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>     Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409:
>     send completed with error (method=0x1 attr=0x15
>     trans_id=0x4700036596) -- dropping
>     Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411:
>     DR SMP Hop Ptr: 0x0
>     Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop
>     path:
>                     Initial path = 0,0,0,0,0,0
>                     Return path  = 0,0,0,0,0,0
>     Aug 24 15:36:20 514375 [4977E940] 0x01 ->
>     __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
>     (IB_TIMEOUT)
>     Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
>                     base_ver................0x1
>                     mgmt_class..............0x81
>                     class_ver...............0x1
>                     method..................0x1 (SubnGet)
>                     D bit...................0x0
>                     status..................0x0
>                     hop_ptr.................0x0
>                     hop_count...............0x5
>                     trans_id................0x36596
>                     attr_id.................0x15 (PortInfo)
>                     resv....................0x0
>     ....
>
>  
> These errors are transient as you indicate. They mean that some node 
> has brought the link physically up but there is no SMA at the remote 
> side of the link. The different paths are paths to the HCAs. This 
> occurs during PXE boot as the node transitions from the boot ROM to 
> the Linux environment.
>  
They are transient.. but sometimes opensm hangs with the same message 
and loops this errors messages.
First I was using centos 5.3 kernel with updates .. and the IPoIB 
stopped working after these messages.
Using the "vanilla" centos 5.3 kernel solved this issue.
But SOMETIMES, booting the nodes, these messages appear and dont go away.
> Other than these messages, do things seem to work in terms of the end 
> nodes ?
They seem to work with vanilla kernel. Even with the messages, no 
problems reaching the nodes so far.

Tks

Rafael Tinoco
>  
> -- Hal
>
>     _______________________________________________
>     general mailing list
>     general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
>     http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>     To unsubscribe, please visit
>     http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/aad94895/attachment.html>

From ogerlitz at voltaire.com  Wed Aug 26 00:04:57 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 26 Aug 2009 10:04:57 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR	query
	to	QoS level with pkey
In-Reply-To: <4A910609.3040305@dev.mellanox.co.il>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>	<4A90DC04.3020906@voltaire.com>
	<4A910609.3040305@dev.mellanox.co.il>
Message-ID: <4A94DE99.5050308@voltaire.com>

Yevgeny Kliteynik wrote:
> False negatives. PR queries with PKeys (e.g. IPoIB interfaces) weren't 
> matched to their rules.
Yevgeny,

Our understanding is that the bug comes into play only for queries done 
on a partial membership pkey, do you agree?

Or.


From sashak at voltaire.com  Tue Aug 25 12:01:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 25 Aug 2009 22:01:41 +0300
Subject: [ofa-general] Re: [PATCHv3] opensm: Parallelize (Stripe) LFT sets
	across switches
In-Reply-To: <20090807110811.GA23431@comcast.net>
References: <20090807110811.GA23431@comcast.net>
Message-ID: <20090825190141.GG28379@me>

Hi Hal,

On 07:08 Fri 07 Aug     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

I'm applying this patch as it is, but have couple of comments below.
Actually I even prepared patches over those comment and will push it to
the list soon, please review.

> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 78a7031..e28752a 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c

[snip...]

> @@ -516,6 +471,101 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
>  	OSM_LOG_EXIT(p_mgr->p_log);
>  }
>  
> +static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item,
> +				  IN void *context)
> +{
> +	osm_ucast_mgr_t *p_mgr = context;
> +	osm_switch_t *const p_sw = (osm_switch_t *) p_map_item;
> +
> +	set_fwd_tbl_top(p_mgr, p_sw);
> +}
> +
> +static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm,
> +				    IN uint8_t * p_block,
> +				    IN osm_dr_path_t * p_path,
> +				    IN uint16_t block_id_ho,
> +				    IN osm_madw_context_t * p_context)
> +{
> +	ib_api_status_t status;
> +	boolean_t sts;
> +
> +	OSM_LOG_ENTER(p_sm->p_log);
> +
> +	for (;
> +	     (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block));
> +	     block_id_ho++) {
> +		if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> +		    !memcmp(p_block,
> +			    p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> +			    IB_SMP_DATA_SIZE))
> +			continue;

This function is called in loop with block number incremented. Inside it
loops by itself in looking for changed block, caller will repeat this
looping again and again. It would be really nice to avoid such useless
action. I prepared the patch, please review.

> @@ -940,6 +1025,9 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm)
>  
>  	osm->routing_engine_used = osm_routing_engine_type(r->name);
>  
> +	if (r->ucast_build_fwd_tables)
> +		osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
> +

Any reason to not simplify (and unify) fwd table decision flow over
routing engines with and without ucast_build_fwd_tables method?

The patch to follow.

Sasha


From kliteyn at dev.mellanox.co.il  Wed Aug 26 01:00:53 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 26 Aug 2009 11:00:53 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR	query
	to	QoS level with pkey
In-Reply-To: <4A94DE99.5050308@voltaire.com>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>	<4A90DC04.3020906@voltaire.com>
	<4A910609.3040305@dev.mellanox.co.il>
	<4A94DE99.5050308@voltaire.com>
Message-ID: <4A94EBB5.7050107@dev.mellanox.co.il>

Or Gerlitz wrote:
> Yevgeny Kliteynik wrote:
>> False negatives. PR queries with PKeys (e.g. IPoIB interfaces) weren't 
>> matched to their rules.
> Yevgeny,
> 
> Our understanding is that the bug comes into play only for queries done 
> on a partial membership pkey, do you agree?

Nope, just the other way around.
When some pkey is defined in QoS policy, it is stored
internally w/o the MSB.
When query comes with a full member pkey (such as 0xFFFF
for IPoIB), this pkey is not matched to the stored QoS
policy rule.
The fix was to treat any pkey that comes from request as
partial membership pkey. Note that this is done for the
QoS policy rules matching only. The two sides of this PR
query still have to comply to the usual IB spec pkey rules.

-- Yevgeny
 
> Or.
> 
> 


From ogerlitz at voltaire.com  Wed Aug 26 02:07:51 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 26 Aug 2009 12:07:51 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <20090821000431.GA5713@obsidianresearch.com>
References: <20090821000431.GA5713@obsidianresearch.com>
Message-ID: <4A94FB67.6050600@voltaire.com>

Jason Gunthorpe wrote:
> Check that the format of the multicast link address is correct before taking it from dev->mc_list to priv->multicast_list. This way we never try to send a bogus address to the SA, and prevents badness from erronous 'ip maddr addr add', broken bonding drivers, or whatever.

Jason,

This is great (and simple!) idea, lets go for it.

> Same problem Moni was working on, but lets just address it directly. There is work to try and fix the bonding driver but no fixed version is in mainline yet. This is a cheap and simple work around that is worth having even once the driver is fixed.
Moni,

isn't Jason's approach enough for the bonding case?! I saw that your 
patch ("bonding: clean muticast addresses when device changes type" 
commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10) is present in net-next 
and maybe also in mainline .31-rcX . However, it has the 
down-side-effect of e.g loosing routes already set for the the bond 
while adding the underline IPoIB devices, so if Jason's patch is enough 
we can just ask to revert the bonding fix saying we have something better.

Or.


From o.w.saastad at usit.uio.no  Wed Aug 26 02:09:21 2009
From: o.w.saastad at usit.uio.no (Ole Widar Saastad)
Date: Wed, 26 Aug 2009 11:09:21 +0200
Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards
Message-ID: <1251277761.28564.45.camel@pyren.uio.no>

I am experiencing problems using the Infinipath cards and the OFED
stack. (details are given below). 

It seems to be a problem somewhere when mpi packet size grows above 2k.
This is what I recall the changeover from one transport mechanism to
another ?

The test is easy to run and to test, it is just a bandwidth program :
(I got far better latency using the Pathscale stack that the OFED. Is this 
something that will be looked up in the newer releases?).

Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected
to a SilverStorm switch.

[olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o
Resolution (usec): 2.145767
Benchmark ping-pong
===================
        lenght     iterations   elapsed time  transfer rate        latency
       (bytes)        (count)      (seconds)     (Mbytes/s)         (usec)
--------------------------------------------------------------------------
             0          10046          0.121          0.000          6.011
             1          10261          0.124          0.166          6.026
<cut a few lines>
          1024           7695          0.140        112.615          9.093
          1536           6260          0.133        144.469         10.632
          2048           5275          0.128        168.420         12.160
[0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.  

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------
mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). 
[olews at login-0-2 bandwidth]$ 


Background information :


07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02)
        Subsystem: QLogic, Corp. InfiniPath PE-800
        Flags: bus master, fast devsel, latency 0, IRQ 66
        Memory at fde00000 (64-bit, non-prefetchable) [size=2M]
        Capabilities: [40] Power Management version 2
        Capabilities: [50] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable+
        Capabilities: [70] Express Endpoint IRQ 0

compute-1-0.local# uname -a
Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
compute-1-0.local# 


compute-1-0.local# rpm -qa| grep ofed
libibverbs-utils-1.1.2-1.ofed1.4.2
librdmacm-utils-1.0.8-1.ofed1.4.2
libcxgb3-1.2.2-1.ofed1.4.2
ofed-scripts-1.4.2-0
libmlx4-1.0-1.ofed1.4.2
libibverbs-devel-1.1.2-1.ofed1.4.2
ofed-docs-1.4.2-0
ibvexdmtools-0.0.1-1.ofed1.4.2
libmthca-1.0.5-1.ofed1.4.2
libipathverbs-1.1-1.ofed1.4.2
mstflint-1.4-1.ofed1.4.2
libibumad-1.2.3_20090314-1.ofed1.4.2
libnes-0.6-1.ofed1.4.2
libibcommon-1.1.2_20090314-1.ofed1.4.2
libibverbs-1.1.2-1.ofed1.4.2
librdmacm-1.0.8-1.ofed1.4.2
qlgc_vnic_daemon-0.0.1-1.ofed1.4.2
compute-1-0.local# 

OpenMPI is :
openmpi-1.2.8 compiled for gcc.

-- 
Ole W. Saastad, dr. scient.
Scientific Computing Group, USIT, University of Oslo
http://hpc.uio.no


From kliteyn at dev.mellanox.co.il  Wed Aug 26 02:25:45 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 26 Aug 2009 12:25:45 +0300
Subject: [ofa-general] [TRIVIAL PATCH] ibutils: fix regexp for pkey
	matching
In-Reply-To: <20090825211250.GI16590@sgi.com>
References: <20090825211250.GI16590@sgi.com>
Message-ID: <4A94FF99.5030501@dev.mellanox.co.il>

akepner at sgi.com wrote:
> There's an error in a regular expression for matching pkeys 
> in ibdebug.tcl. The following fixes it.

Thanks. Applied.

-- Yevgeny
 
> Signed-off-by: Arthur Kepner <akepner at sgi.com>
> ---


From sneha0930 at gmail.com  Wed Aug 26 02:34:31 2009
From: sneha0930 at gmail.com (Sneha Mistry)
Date: Wed, 26 Aug 2009 15:04:31 +0530
Subject: [ofa-general] OFED-1.5-alpha4 installation problem
Message-ID: <fde1733a0908260234s4fff91f4oe16e3186b708d5dc@mail.gmail.com>

Hi,

I am new be to Infiniband and trying to install OFED-1.5-alpha4 on
opensuse 10.3 .
Kernel version is  2.6.26-2-686 .

But it gives me error  message.

Failed to build ofa_kernel RPM
See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log

Regards,
sgm


From keshetti.mahesh at gmail.com  Wed Aug 26 02:43:30 2009
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Wed, 26 Aug 2009 15:13:30 +0530
Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards
Message-ID: <829ded920908260243g6a9e5217h4886cb7ec460fc35@mail.gmail.com>

There was a similar thread "Retry count error with ipath on OFED-1.3"
dated 27 May 2008.
And it turned out to be some hardware problem with Infinipath cards.

- Mahesh


From vlad at lists.openfabrics.org  Wed Aug 26 03:09:28 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 26 Aug 2009 03:09:28 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090826-0200 daily build status
Message-ID: <20090826100928.40648E28249@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sneha0930 at gmail.com  Wed Aug 26 03:17:51 2009
From: sneha0930 at gmail.com (Sneha Mistry)
Date: Wed, 26 Aug 2009 15:47:51 +0530
Subject: [ofa-general] Fwd: OFED-1.5-alpha4 installation problem
In-Reply-To: <fde1733a0908260234s4fff91f4oe16e3186b708d5dc@mail.gmail.com>
References: <fde1733a0908260234s4fff91f4oe16e3186b708d5dc@mail.gmail.com>
Message-ID: <fde1733a0908260317p3f754642jfd28077f93c15bd4@mail.gmail.com>

Hi,

I am new be to Infiniband and trying to install OFED-1.5-alpha4 on
opensuse 10.3 .
Kernel version is  2.6.26-2-686 .

But it gives me error  message.

Failed to build ofa_kernel RPM
See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log

I checked release note it says suse 10.3 is supported.

Output of uname -a is
Linux linux-ljhr 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC
i686 i686 i386 GNU/Linux

Last few line of log is as given.

make[1]: Entering directory `/usr/src/linux-2.6.22.5-31-obj/i386/default'
make -C ../../../linux-2.6.22.5-31
O=../linux-2.6.22.5-31-obj/i386/default modules
make -C /usr/src/linux-2.6.22.5-31-obj/i386/default \
	KBUILD_SRC=/usr/src/linux-2.6.22.5-31 \
	KBUILD_EXTMOD="/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5" -f
/usr/src/linux-2.6.22.5-31/Makefile modules
test -e include/linux/autoconf.h -a -e include/config/auto.conf || (		\
	echo;								\
	echo "  ERROR: Kernel configuration is invalid.";		\
	echo "         include/linux/autoconf.h or include/config/auto.conf
are missing.";	\
	echo "         Run 'make oldconfig && make prepare' on kernel src to
fix it.";	\
	echo;								\
	/bin/false)
mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions
rm -f /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/*
make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5
make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband
make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core
  gcc -m32 -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d
 -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.2.1/include
-D__KERNEL__ \
-D__OFED_BUILD__ \
-include include/linux/autoconf.h \
-include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.22_suse10_3/include/
\
 \
 \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \
-I/usr/local/include/scst \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \
-Iinclude \
-Iinclude2 -I/usr/src/linux-2.6.22.5-31/include \
-I/usr/src/linux-2.6.22.5-31/arch//include \
   -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core
-Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
-Werror-implicit-function-declaration -fno-strict-aliasing -fno-common
-Os -pipe -msoft-float -mregparm=3 -freg-struct-return
-mpreferred-stack-boundary=2 -march=i586 -mtune=generic -ffreestanding
-maccumulate-outgoing-args -DCONFIG_AS_CFI=1
-DCONFIG_AS_CFI_SIGNAL_FRAME=1
-I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-generic
-Iinclude/asm-i386/mach-generic
-I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-default
-Iinclude/asm-i386/mach-default -fomit-frame-pointer -g
-fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign
-DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)"
-D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.tmp_addr.o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c
In file included from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_addr.h:41,
                 from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:46:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
function ‘ib_dma_mapping_error’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677:
warning: passing argument 1 of ‘dma_mapping_error’ makes integer from
pointer without a cast
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677:
error: too many arguments to function ‘dma_mapping_error’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716:
warning: ‘struct dma_attrs’ declared inside parameter list
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716:
warning: its scope is only this definition or declaration, which is
probably not what you want
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
function ‘ib_dma_map_single_attrs’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1718:
error: implicit declaration of function ‘dma_map_single_attrs’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1725:
warning: ‘struct dma_attrs’ declared inside parameter list
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
function ‘ib_dma_unmap_single_attrs’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1727:
error: implicit declaration of function ‘dma_unmap_single_attrs’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1728:
warning: ‘return’ with a value, in function returning void
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1803:
warning: ‘struct dma_attrs’ declared inside parameter list
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
function ‘ib_dma_map_sg_attrs’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1805:
error: implicit declaration of function ‘dma_map_sg_attrs’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1811:
warning: ‘struct dma_attrs’ declared inside parameter list
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
function ‘ib_dma_unmap_sg_attrs’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1813:
error: implicit declaration of function ‘dma_unmap_sg_attrs’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
In function ‘rdma_translate_ip’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
error: ‘init_net’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
error: (Each undeclared identifier is reported only once
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
error: for each function it appears in.)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:123:
error: too many arguments to function ‘ip_dev_find’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:33:
error: macro "for_each_netdev" passed 2 arguments, but takes just 1
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:
error: ‘for_each_netdev’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:
error: expected ‘;’ before ‘{’ token
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
In function ‘addr_send_arp’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
error: ‘init_net’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
warning: passing argument 2 of ‘ip_route_output_key’ from incompatible
pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
error: too many arguments to function ‘ip_route_output_key’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:206:
error: too many arguments to function ‘ip6_route_output’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
In function ‘addr4_resolve_remote’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
error: ‘init_net’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
warning: passing argument 2 of ‘ip_route_output_key’ from incompatible
pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
error: too many arguments to function ‘ip_route_output_key’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
In function ‘addr6_resolve_remote’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281:
error: ‘init_net’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281:
error: too many arguments to function ‘ip6_route_output’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
In function ‘addr_resolve_local’:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368:
error: ‘init_net’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368:
error: too many arguments to function ‘ip_dev_find’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:372:
error: implicit declaration of function ‘ipv4_is_zeronet’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:376:
error: implicit declaration of function ‘ipv4_is_loopback’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:33:
error: macro "for_each_netdev" passed 2 arguments, but takes just 1
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:
error: ‘for_each_netdev’ undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:395:
error: expected ‘;’ before ‘if’
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:410:
error: implicit declaration of function ‘ipv6_addr_loopback’
make[6]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o]
Error 1
make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core]
Error 2
make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband]
Error 2
make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2
make[2]: *** [modules] Error 2
make[1]: *** [modules] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.22.5-31-obj/i386/default'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.64786 (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.64786 (%build)

Regards,
sgm


From fenkes at de.ibm.com  Wed Aug 26 04:37:55 2009
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 26 Aug 2009 13:37:55 +0200
Subject: [ofa-general] [PATCH] IB/ehca: Construct MAD redirect replies from
	request MAD
Message-ID: <200908261337.56128.fenkes@de.ibm.com>

The old code used a lot of hardcoded values, which might not be valid in all
environments (especially routed fabrics or partitioned subnets). Copy as
much information as possible from the incoming request to prevent that.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---

Hal, Jason -- here's the change I promised. Looks okay to you?
Roland -- if Hal and Jason don't object, please queue this up for the next
kernel. Thanks!

Regards,
  Joachim

 drivers/infiniband/hw/ehca/ehca_sqp.c |   47 ++++++++++++++++++++++++++++----
 1 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c
index c568b28..8c1213f 100644
--- a/drivers/infiniband/hw/ehca/ehca_sqp.c
+++ b/drivers/infiniband/hw/ehca/ehca_sqp.c
@@ -125,14 +125,30 @@ struct ib_perf {
 	u8 data[192];
 } __attribute__ ((packed));
 
+/* TC/SL/FL packed into 32 bits, as in ClassPortInfo */
+struct tcslfl {
+	u32 tc:8;
+	u32 sl:4;
+	u32 fl:20;
+} __attribute__ ((packed));
+
+/* IP Version/TC/FL packed into 32 bits, as in GRH */
+struct vertcfl {
+	u32 ver:4;
+	u32 tc:8;
+	u32 fl:20;
+} __attribute__ ((packed));
 
 static int ehca_process_perf(struct ib_device *ibdev, u8 port_num,
+			     struct ib_wc *in_wc, struct ib_grh *in_grh,
 			     struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
 	struct ib_perf *in_perf = (struct ib_perf *)in_mad;
 	struct ib_perf *out_perf = (struct ib_perf *)out_mad;
 	struct ib_class_port_info *poi =
 		(struct ib_class_port_info *)out_perf->data;
+	struct tcslfl *tcslfl =
+		(struct tcslfl *)&poi->redirect_tcslfl;
 	struct ehca_shca *shca =
 		container_of(ibdev, struct ehca_shca, ib_device);
 	struct ehca_sport *sport = &shca->sport[port_num - 1];
@@ -158,10 +174,29 @@ static int ehca_process_perf(struct ib_device *ibdev, u8 port_num,
 		poi->base_version = 1;
 		poi->class_version = 1;
 		poi->resp_time_value = 18;
-		poi->redirect_lid = sport->saved_attr.lid;
-		poi->redirect_qp = sport->pma_qp_nr;
+
+		/* copy local routing information from WC where applicable */
+		tcslfl->sl         = in_wc->sl;
+		poi->redirect_lid  =
+			sport->saved_attr.lid | in_wc->dlid_path_bits;
+		poi->redirect_qp   = sport->pma_qp_nr;
 		poi->redirect_qkey = IB_QP1_QKEY;
-		poi->redirect_pkey = IB_DEFAULT_PKEY_FULL;
+
+		ehca_query_pkey(ibdev, port_num, in_wc->pkey_index,
+				&poi->redirect_pkey);
+
+		/* if request was globally routed, copy route info */
+		if (in_grh) {
+			struct vertcfl *vertcfl =
+				(struct vertcfl *)&in_grh->version_tclass_flow;
+			memcpy(poi->redirect_gid, in_grh->dgid.raw,
+			       sizeof(poi->redirect_gid));
+			tcslfl->tc        = vertcfl->tc;
+			tcslfl->fl        = vertcfl->fl;
+		} else
+			/* else only fill in default GID */
+			ehca_query_gid(ibdev, port_num, 0,
+				       (union ib_gid *)&poi->redirect_gid);
 
 		ehca_dbg(ibdev, "ehca_pma_lid=%x ehca_pma_qp=%x",
 			 sport->saved_attr.lid, sport->pma_qp_nr);
@@ -183,8 +218,7 @@ perf_reply:
 
 int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
 		     struct ib_wc *in_wc, struct ib_grh *in_grh,
-		     struct ib_mad *in_mad,
-		     struct ib_mad *out_mad)
+		     struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
 	int ret;
 
@@ -196,7 +230,8 @@ int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
 		return IB_MAD_RESULT_SUCCESS;
 
 	ehca_dbg(ibdev, "port_num=%x src_qp=%x", port_num, in_wc->src_qp);
-	ret = ehca_process_perf(ibdev, port_num, in_mad, out_mad);
+	ret = ehca_process_perf(ibdev, port_num, in_wc, in_grh,
+				in_mad, out_mad);
 
 	return ret;
 }
-- 
1.6.0.4


From jean-vincent.ficet at bull.net  Wed Aug 26 05:03:04 2009
From: jean-vincent.ficet at bull.net (Vincent Ficet)
Date: Wed, 26 Aug 2009 14:03:04 +0200
Subject: [ofa-general] [PATCH] Duplicated file man/umad_get_mad.3 in
	libibumad/Makefile.am
Message-ID: <4A952478.7060407@bull.net>


Hello,

the file man/umad_get_mad.3 was listed twice in libibumad/Makefile.am resulting in the following error:

/usr/bin/install: will not overwrite just-created `/home/vficet/work/infiniband/I686/usr/share/man/man3/umad_get_mad.3' with `man/umad_get_mad.3'

This patch removes the duplicated entry.

Cheers,

Vincent


Signed-off-by: Jean-Vincent Ficet <jean-vincent.ficet at bull.net>
---
 libibumad/Makefile.am |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Duplicated-file-man-umad_get_mad.3-in-libibumad-Make.patch
Type: text/x-patch
Size: 622 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/3be3676d/attachment.bin>

From hnrose at comcast.net  Wed Aug 26 07:02:02 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 26 Aug 2009 10:02:02 -0400
Subject: [ofa-general] [PATCH] libibmad: Add support for MulticastFDBTop
Message-ID: <20090826140202.GA19158@comcast.net>


Add support for SwitchInfo:MulticastFDBTop and
PortInfo:CapabilityMask.IsMulticastFDBTopSupported

Added by MgtWG errata #4505-4508

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 3093fbd..5f3b52b 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2009 HNR Consulting.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -400,6 +401,7 @@ enum MAD_FIELDS {
 	IB_SW_FILTER_RAW_INB_F,
 	IB_SW_FILTER_RAW_OUTB_F,
 	IB_SW_ENHANCED_PORT0_F,
+	IB_SW_MCAST_FDB_TOP_F,
 	IB_SW_LAST_F,
 
 	/*
diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c
index 051c708..d97d359 100644
--- a/libibmad/src/dump.c
+++ b/libibmad/src/dump.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -518,6 +519,8 @@ void mad_dump_portcapmask(char *buf, int bufsz, void *val, int valsz)
 	if (mask & (1 << 27))
 		s += sprintf(s,
 			     "\t\t\t\tIsLinkSpeedWidthPairsTableSupported\n");
+	if (mask & (1 << 30))
+		s += sprintf(s, "\t\t\t\tIsMulticastFDBTopSupported\n");
 
 	if (s != buf)
 		*(--s) = 0;
diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index c8e4e79..5f30116 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2009 HNR Consulting.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -206,6 +207,7 @@ static const ib_field_t ib_mad_f[] = {
 	{BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint},
 	{BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint},
 	{BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint},
+	{BITSOFFS(144, 16), "MulticastFDBTop", mad_dump_hex},
 	{0, 0},			/* IB_SW_LAST_F */
 
 	/*


From hnrose at comcast.net  Wed Aug 26 07:04:50 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 26 Aug 2009 10:04:50 -0400
Subject: [ofa-general] [PATCH] opensm: Add infrastructure support for
	MulticastFDBTop
Message-ID: <20090826140450.GC19158@comcast.net>


Add support for SwitchInfo:MulticastFDBTop
Added by MgtWG errata #4505-4508

Add OpenSM infrastructure support to ib_types.h and osm_helper.c

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index fe3f051..e1e2bdb 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
@@ -4492,7 +4492,7 @@ typedef struct _ib_port_info {
 #define IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL (CL_HTON32(0x08000000))
 #define IB_PORT_CAP_RESV28        (CL_HTON32(0x10000000))
 #define IB_PORT_CAP_RESV29        (CL_HTON32(0x20000000))
-#define IB_PORT_CAP_RESV30        (CL_HTON32(0x40000000))
+#define IB_PORT_CAP_HAS_MCAST_FDB_TOP (CL_HTON32(0x40000000))
 #define IB_PORT_CAP_RESV31        (CL_HTON32(0x80000000))
 
 /****f* IBA Base: Types/ib_port_info_get_port_state
@@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info {
 	ib_net16_t lids_per_port;
 	ib_net16_t enforce_cap;
 	uint8_t flags;
+	uint8_t resvd;
+	ib_net16_t mcast_top;
 } PACK_SUFFIX ib_switch_info_t;
 #include <complib/cl_packoff.h>
 /************/
@@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record {
 	ib_net16_t lid;
 	uint16_t resv0;
 	ib_switch_info_t switch_info;
-	uint8_t pad[3];
+	uint8_t pad[1];
 } PACK_SUFFIX ib_switch_info_record_t;
 #include <complib/cl_packoff.h>
 
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 23392a4..b8a6523 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
@@ -766,9 +766,9 @@ static void dbg_get_capabilities_str(IN char *p_buf, IN const uint32_t buf_size,
 				&total_len) != IB_SUCCESS)
 			return;
 	}
-	if (p_pi->capability_mask & IB_PORT_CAP_RESV30) {
+	if (p_pi->capability_mask & IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
 		if (dbg_do_line(&p_local, buf_size, p_prefix_str,
-				"IB_PORT_CAP_RESV30\n",
+				"IB_PORT_CAP_HAS_MCAST_FDB_TOP\n",
 				&total_len) != IB_SUCCESS)
 			return;
 	}
@@ -1514,7 +1514,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log,
 			"\t\t\t\tlife_state..............0x%X\n"
 			"\t\t\t\tlids_per_port...........%u\n"
 			"\t\t\t\tpartition_enf_cap.......0x%X\n"
-			"\t\t\t\tflags...................0x%X\n",
+			"\t\t\t\tflags...................0x%X\n"
+			"\t\t\t\tmcast_top...............0x%X\n",
 			cl_ntoh16(p_si->lin_cap),
 			cl_ntoh16(p_si->rand_cap),
 			cl_ntoh16(p_si->mcast_cap),
@@ -1524,7 +1525,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log,
 			p_si->def_mcast_not_port,
 			p_si->life_state,
 			cl_ntoh16(p_si->lids_per_port),
-			cl_ntoh16(p_si->enforce_cap), p_si->flags);
+			cl_ntoh16(p_si->enforce_cap), p_si->flags,
+			cl_ntoh16(p_si->mcast_top));
 	}
 }
 

From hnrose at comcast.net  Wed Aug 26 07:03:50 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 26 Aug 2009 10:03:50 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibroute: Add support for
	MulticastFDBTop
Message-ID: <20090826140350.GB19158@comcast.net>


Add support for SwitchInfo:MulticastFDBTop
Added by MgtWG errata #4505-4508 and #4640

If MulticastFDBTop is set to other than 0, only fetch MulticastForwardingTable
blocks up through MulticastFDBTop rather than MulticastFDBCap

If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 106c934..f3ebe56 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 	char *s;
 	uint64_t nodeguid;
 	uint32_t mod;
-	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock;
+	unsigned block, i, j, e, nports, cap, top, chunks,
+		 startblock, lastblock;
 	int n = 0;
 
 	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
 		return s;
 
 	mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
+	mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
 
 	if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
 		endlid = IB_MIN_MCAST_LID + cap - 1;
+	if (!dump_all && top && top < endlid) {
+		if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff)
+			IBWARN("illegal top mlid %x", top);
+		else
+			endlid = top;
+	}
 
 	if (!startlid)
 		startlid = IB_MIN_MCAST_LID;
@@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 		printf(" MLid\n");
 	}
 	if (ibverbose)
-		printf("Switch multicast mlid capability is %d\n", cap);
+		printf("Switch multicast mlid capability is %d top is %d\n",
+		       cap, top);
 
 	chunks = ALIGN(nports + 1, 16) / 16;
 

From dorfman.eli at gmail.com  Wed Aug 26 07:37:30 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Wed, 26 Aug 2009 17:37:30 +0300
Subject: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery
	from switch node.
Message-ID: <4A9548AA.4020900@gmail.com>

Subject: [PATCH] Fix IB network discovery from switch node.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index c69467e..779e659 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -590,13 +590,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 	if (!port)
 		goto error;
 
-	rc = get_remote_node(ibmad_port, fabric, node, port, from,
-			     mad_get_field(node->info, 0,
-					   IB_NODE_LOCAL_PORT_F), 0);
-	if (rc < 0)
-		goto error;
-	if (rc > 0)		/* non-fatal error, nothing more to be done */
-		return ((ibnd_fabric_t *) fabric);
+	if (node->node.type != IB_NODE_SWITCH) { 
+		rc = get_remote_node(ibmad_port, fabric, node, port, from,
+				     mad_get_field(node->info, 0,
+						   IB_NODE_LOCAL_PORT_F), 0);
+		if (rc < 0)
+			goto error;
+		if (rc > 0)		/* non-fatal error, nothing more to be done */
+			return ((ibnd_fabric_t *) fabric);
+	}
 
 	for (dist = 0; dist <= max_hops; dist++) {
 
-- 
1.5.5


From hal.rosenstock at gmail.com  Wed Aug 26 07:55:41 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 26 Aug 2009 10:55:41 -0400
Subject: [ofa-general] Combined DR path with empty DR path, what is the 
	expected behavior?
In-Reply-To: <20090825175543.4f929646.weiny2@llnl.gov>
References: <20090824185206.39e5e377.weiny2@llnl.gov>
	<f0e08f230908251615x79f2f87cwcba95c0f7e743bfe@mail.gmail.com>
	<20090825175543.4f929646.weiny2@llnl.gov>
Message-ID: <f0e08f230908260755q24b29657t8149e1aa55224fd0@mail.gmail.com>

On 8/25/09, Ira Weiny <weiny2 at llnl.gov> wrote:
>
> On Tue, 25 Aug 2009 19:15:19 -0400
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
> > On 8/24/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> >
> > > If I send a combined DR path with a start lid but an empty (0 length)
> DR
> > > path.
> >
> >
> > Hop Count 0 ?
>
> Yes
>
> >
> >
> > > What is the expected behavior?
> >
> >
> > Not sure what you mean by expected here. Are you referring to expectation
> > based on the spec ?
> >
>
> yes
>
> >
> > > I know this could be specified with LID routing, but I don't see
> anywhere
> > > in
> > > the specification which says this is an error.
> >
> >
> > I don't think it should be an error (certainly not for the form you are
> > using LID routed part followed by a DR part) but a null DR part is a
> little
> > funny/odd.
>
> Yea I know.  It turns out that the new iblinkinfo issues queries like this
> when it is removes recurses back from the last DR portion of the combined
> route path.  It only showed up as an error when using the -S <guid> option
> of
> iblinkinfo with this new switch I have.  Works fine with the old switches.
>
> >
> > > I do however seem to have 2
> > > different implementations on 2 different switches.  For example:
> > >
> > > I have Switch A (Lid 1) and Switch B (Lid 7).  I attempt to query
> PortInfo
> > > of
> > > Port 1 of each switch using the LID followed by an empty DR path.
> > >
> > > 17:55:22 > ./smpquery -c portinfo 1 0 1
> > > ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1)
> > > ./smpquery: iberror: failed: operation portinfo: port info query failed
> >
> >
> > Is this a timeout ?
>
> yes
>
> 16:26:25 > ./smpquery -e -c portinfo 1 0 1
> ibwarn: [27150] _do_madrpc: retry 1 (timeout 1000 ms)
> ibwarn: [27150] _do_madrpc: retry 2 (timeout 1000 ms)
> ibwarn: [27150] _do_madrpc: timeout after 3 retries, 3000 ms
> ibwarn: [27150] mad_rpc: _do_madrpc failed; dport (Lid 1)
> ./smpquery: iberror: failed: operation portinfo: port info query failed
>
>
> >
> >
> > > 17:55:31 > ./smpquery -c portinfo 7 0 1
> > > # Port info: Lid 7 port 1
> > > Mkey:............................0x0000000000000000
> > > GidPrefix:.......................0x0000000000000000
> > > ...
> > > <normal output snipped>
> > >
> > > Detecting this special case in libibmad and turning the packet into a
> LID
> > > routed one
> >
> >
> > Ugh... Is this special case really needed ? I don't think the underlying
> > issue is understood sufficiently yet.
>
> Well I just did it to prove that what I was doing would work with a
> "simple"
> lid routed packet.  Like I said it might be that this portid which is being
> specified to libibmad by libibnetdisc is not valid.  If that is true then
> libibnetdisc should detect when the DR path is empty and go back to LID
> routed
> requests.  That is a valid fix in my mind.


Sure; there's no real need for combined route when the DR path is empty but
it should work (at least with switches).

>
> > > succeeds but I wonder if this is an error in the SMI?
> >
> >
> > Switch SMI ? Is this a proprietary implementation ?
> >
>
> Yes I see the bug with 2 different vendors switches.  One is managed and
> the
> other is not.  My "old" switches (3 different vendors) do not show this
> behavior.  (Just to be clear I now I have 5 switches in my 5 node cluster!
> ;-)
>
> >
> >
> > >   I also notice this is an error on the HCA I am running from (lid 2).
> >
> >
> > Is this HCA node OpenIB based ?
>
> yes


If I recall correctly, there is something in the spec that makes combined
routing not be allowed on HCA (and router) ports so this seems correct. I
can dig this out if really needed.

>
> > 17:57:42 > ./smpquery -c portinfo 2 0 1
> > > ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2)
> > > ./smpquery: iberror: failed: operation portinfo: port info query failed
> >
> >
> > Is this also a timeout ?
>
> yes
>
> >
> > Also, does the result differ based on where you source these from matter
> > (locally v. remotely)?
>
> Same result local and remote.
>
> >
> >
> >
> > > Running with a simple DR path works,
> >
> >
> > You're referring to the same DR path here that fails in the combined
> route
> > examples above, right ?
> >
>
> No. the example below is a DR path with Hop Count == 0 but without the
> initial
> LID routing.
>
> >
> > > I guess because this is the loopback case mentioned on page 805.
> >
> >
> > Yes but that's the high level requirement rather than the SMI rules which
> > make that work.
> >
> >
> >
> > > 17:58:16 > ./smpquery -D portinfo 0 1
> > > # Port info: DR path slid 65535; dlid 65535; 0 port 1
> > > Mkey:............................0x0000000000000000
> > > GidPrefix:.......................0x2007000000000000
> > > ...
> > > <snip>
> > >
> > > It guess that the comment "Since each part may be empty, there are
> eight
> > > combinations, although only four are really useful:" on line 36 Page
> 805
> > > can
> > > be interpreted to mean that only those 4 combinations need to be
> supported.
> > > Is this true?
> >
> >
> > Not all 4 combinations are supported/known to work. When this was added
> for
> > ibportstate, the only combined routing form that was important was LID
> > routed part followed by a DR part.
> >
>
> When you say "known to work" you mean implemented with the diags?  Or known
> to
> work in all hardware?


The former with most hardware up to some time ago. Note there is no
compliance testing of combined routing and heavy reliance on this makes some
a little nervous.

>
> > > On the other hand I think strictly this should be supported.
> >
> >
> > In an ideal world yes but are they all required or is it just the one
> form
> > most heavily used ?
>
> That is what I am unclear on.  Does the spec require that all 8
> combinations
> are required to work?  I don't see a specific compliance which says that
> and I
> am not sure if C14-9 and C14-13 cover all 8 combinations.


I don't think there's any compliance on this. It all appears to be
informative text. Perhaps a shortcoming of the spec. So there's nothing
definitive. It just says there are 8 combinations (2**3 as there are 3 parts
with 2 possibilities in each part) and that only 4 are really useful.

>
> > >   Item 4 of C14-9
> > > (line 24 page 810) requires the SMI to handle the packet if the
> HopPointer
> > > equals HopCount +1, which it is in my case (HopCount == 0, HopPointer
> == 1)
> >
> >
> > By handle, this means "The SMI *shall *output the packet on the port
> whose
> > number is in the entry indexed by Hop Pointer in the Initial Path. If
> that
> > port number is invalid, the SMI *shall *discard the SMP."
> >
> > Are you sure the Hop Pointer is 1 ? Where do you see this ?
>
> No I was wrong.  I think I read the wrong madeye packet as I see the packet
> right before this one did have a hop pointer of 1.  I Added some debug
> prints
> to mad_encode to get the following output:
>
> 17:26:10 > ./smpquery -e -c portinfo 1 0 1
> trid 2a0f0cb5; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0
> trid 2a0f0cb6; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0
> trid 2a0f0cb7; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt
> 0
> ibwarn: [27322] _do_madrpc: recv failed: Connection timed out
> ibwarn: [27322] mad_rpc: _do_madrpc failed; dport (Lid 1)
> ./smpquery: iberror: failed: operation portinfo: port info query failed
>
> madeye for these packets:
>
> Aug 25 17:28:03 woprjr0 Madeye:recv SMP
> Aug 25 17:28:03 woprjr0 MAD version....0x1
> Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:28:03 woprjr0 Class version..0x1
> Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response)
> Aug 25 17:28:03 woprjr0 Status.........0x8000
> Aug 25 17:28:03 woprjr0 Hop pointer....0x1
> Aug 25 17:28:03 woprjr0 Hop counter....0x0
> Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5
> Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info)
> Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
> Aug 25 17:28:03 woprjr0 Mkey...........0x0
> Aug 25 17:28:03 woprjr0 DR SLID........0xffff
> Aug 25 17:28:03 woprjr0 DR DLID........0xffff
> Aug 25 17:28:03 woprjr0 Madeye:sent SMP
> Aug 25 17:28:03 woprjr0 MAD version....0x1
> Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:28:03 woprjr0 Class version..0x1
> Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
> Aug 25 17:28:03 woprjr0 Status.........0x00
> Aug 25 17:28:03 woprjr0 Hop pointer....0x1
> Aug 25 17:28:03 woprjr0 Hop counter....0x0
> Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5
> Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info)
> Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
> Aug 25 17:28:03 woprjr0 Mkey...........0x0
> Aug 25 17:28:03 woprjr0 DR SLID........0xffff
> Aug 25 17:28:03 woprjr0 DR DLID........0xffff
> Aug 25 17:28:03 woprjr0 Madeye:recv SMP
> Aug 25 17:28:03 woprjr0 MAD version....0x1
> Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:28:03 woprjr0 Class version..0x1
> Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response)
> Aug 25 17:28:03 woprjr0 Status.........0x8000
> Aug 25 17:28:03 woprjr0 Hop pointer....0x1
> Aug 25 17:28:03 woprjr0 Hop counter....0x0
> Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6
> Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
> Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
> Aug 25 17:28:03 woprjr0 Mkey...........0x0
> Aug 25 17:28:03 woprjr0 DR SLID........0xffff
> Aug 25 17:28:03 woprjr0 DR DLID........0xffff
> Aug 25 17:28:03 woprjr0 Madeye:sent SMP
> Aug 25 17:28:03 woprjr0 MAD version....0x1
> Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:28:03 woprjr0 Class version..0x1
> Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
> Aug 25 17:28:03 woprjr0 Status.........0x00
> Aug 25 17:28:03 woprjr0 Hop pointer....0x1
> Aug 25 17:28:03 woprjr0 Hop counter....0x0
> Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6
> Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
> Aug 25 17:28:03 woprjr0 Attr modifier..0x0000
> Aug 25 17:28:03 woprjr0 Mkey...........0x0
> Aug 25 17:28:03 woprjr0 DR SLID........0xffff
> Aug 25 17:28:03 woprjr0 DR DLID........0xffff
> Aug 25 17:28:03 woprjr0 Madeye:sent SMP
> Aug 25 17:28:03 woprjr0 MAD version....0x1
> Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:28:03 woprjr0 Class version..0x1
> Aug 25 17:28:03 woprjr0 Method.........0x1 (Get)
> Aug 25 17:28:03 woprjr0 Status.........0x00
> Aug 25 17:28:03 woprjr0 Hop pointer....0x0
> Aug 25 17:28:03 woprjr0 Hop counter....0x0
> Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb7
> Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info)
> Aug 25 17:28:03 woprjr0 Attr modifier..0x0001
> Aug 25 17:28:03 woprjr0 Mkey...........0x0
> Aug 25 17:28:03 woprjr0 DR SLID........0x02
> Aug 25 17:28:03 woprjr0 DR DLID........0xffff
>
> No response is shown for trid 0x1b9d2a0f0cb7...
>
> As an aside I see the hop pointer is set to 1 at a lower level since
> mad_encode does not do it.


Right; the SMI would do that.

So I guess the proper case for C14-9 would be "3) If Hop Pointer is equal to
> Hop Count".  (They are both 0.)


I'm not sure; maybe C14-9 4)

>
> > If so, what's the initial path at this point (or more specifically index
> 1
> > of the initial path) ? I think that needs to be port 0 (if a switch) but
> > this is a little weird as I would think it should be handed to the SMA
> which
> > is different cases in the spec.
>
> Yes I think I was wrong on the case.  But still wouldn't the SMI detect
> that
> this is the end of the DRPath and simply hand it to the SMA.


Yes, that's what should happen.


>
> >
> > > Then after processing
> >
> >
> > by the SMA and doing the required returning initialization
> >
> > the SMI should return the packet as specified in C14-13
> > > item 3 on line 9 page 812.
> >
> >
> > I'm not sure it would use this case in the case of an empty DR pafh on
> > return.
>
> Actually I think it will use this.  C14-9 item 3) states "the Hop Pointer
> shall be incremented by 1"  Therefore when the response is handed back to
> the
> SMI the Hop pointer will be 1 and the hop count 0.  And the SMI uses the
> DRSLID to send the packet back to the requester.


It goes up to the SMA and then when the response is to be made it goes
through returning SMI initialization and handling.

-- Hal

>
> > Am I wrong?  In the end it does not matter as I have to make the software
> > > work
> > > for all the hardware I have; so I will change the software.
> >
> >
> > IMO it does matter as to where the problem lies (SMI or otherwise) and
> how
> > the layers are comprised in the implementation.
>
> Agreed.  I am mainly confused because I have 2 different implementations of
> this.  My "old" switches seem to handle this case just fine.  My "new"
> switches do not.  So I am really wondering what is going on.
>
> Here is the above output for the same query which works with an "old"
> switch.
>
> 17:28:04 > ./smpquery -e -c portinfo 7 0 1
> ...
> trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt
> 0
> ...
>
> Aug 25 17:46:40 woprjr0 Madeye:sent SMP
> Aug 25 17:46:40 woprjr0 MAD version....0x1
> Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:46:40 woprjr0 Class version..0x1
> Aug 25 17:46:40 woprjr0 Method.........0x1 (Get)
> Aug 25 17:46:40 woprjr0 Status.........0x00
> Aug 25 17:46:40 woprjr0 Hop pointer....0x0
> Aug 25 17:46:40 woprjr0 Hop counter....0x0
> Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
> Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
> Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
> Aug 25 17:46:40 woprjr0 Mkey...........0x0
> Aug 25 17:46:40 woprjr0 DR SLID........0x02
> Aug 25 17:46:40 woprjr0 DR DLID........0xffff
> Aug 25 17:46:40 woprjr0 Madeye:recv SMP
> Aug 25 17:46:40 woprjr0 MAD version....0x1
> Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
> Aug 25 17:46:40 woprjr0 Class version..0x1
> Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response)
> Aug 25 17:46:40 woprjr0 Status.........0x8000
> Aug 25 17:46:40 woprjr0 Hop pointer....0x0
> Aug 25 17:46:40 woprjr0 Hop counter....0x0
> Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
> Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
> Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
> Aug 25 17:46:40 woprjr0 Mkey...........0x0
> Aug 25 17:46:40 woprjr0 DR SLID........0x02
> Aug 25 17:46:40 woprjr0 DR DLID........0xffff
>
> Hop Pointer and Count are both 0 and things work just fine...
>
> >
> > However, I wonder
> > > where exactly the spec falls on this, because I think it will influence
> > > where
> > > the fix resides.  If the spec does not allow this then I think it is
> fine
> > > to
> > > have libibmad return an error since the user specified an invalid
> combined
> > > DR
> > > path.  However, if this should be legal I think libibmad should work
> around
> > > the bad hardware out there.
> >
> >
> > Is it hardware or firmware that needs fixing ? I think it may depend on
> the
> > specific workaround for this as to whether it is acceptable as it might
> harm
> > something else or might violate the spec.
>
> I agree, however, if the switch hardware needs fixing I fear it is too late
> for the ones I have.  Firmware might be upgradable although I have had
> issues
> with un-managed switches in the past.
>
> So where do we put the fix in software?
>
Ira
>
> > -- Hal
> >
> >
> > Thoughts?
> > > Ira
> > >
> > > --
> > > Ira Weiny
> > > Math Programmer/Computer Scientist
> > > Lawrence Livermore National Lab
> > > 925-423-8008
> > > weiny2 at llnl.gov
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > > http://*openib.org/mailman/listinfo/openib-general
> > >
> >
>
>
> --
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2 at llnl.gov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/05cbf37e/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug 26 08:15:03 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 26 Aug 2009 11:15:03 -0400
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <200908261337.56128.fenkes@de.ibm.com>
References: <200908261337.56128.fenkes@de.ibm.com>
Message-ID: <f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>

On 8/26/09, Joachim Fenkes <fenkes at de.ibm.com> wrote:
>
> The old code used a lot of hardcoded values, which might not be valid in
> all
> environments (especially routed fabrics or partitioned subnets). Copy as
> much information as possible from the incoming request to prevent that.
>
> Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
> ---
>
> Hal, Jason -- here's the change I promised. Looks okay to you?
> Roland -- if Hal and Jason don't object, please queue this up for the next
> kernel. Thanks!


Thanks for doing this. It looks sane to me. The only issue I recall that
appears to be remaining is a better setting of ClassPortInfo:RespTimeValue
rather than hardcoding. Perhaps using the value from PortInfo is the way to
go (ideally it would be that value from the port to which the the requester
is being redirected to but that might not be so easy to get from this port
(I guess that could be SA Get PortInfoRecord for that port but that is a
larger change and it likely to be same as local port issuing the redirect
response).

-- Hal

Regards,
> Joachim
>
> drivers/infiniband/hw/ehca/ehca_sqp.c |   47
> ++++++++++++++++++++++++++++----
> 1 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c
> b/drivers/infiniband/hw/ehca/ehca_sqp.c
> index c568b28..8c1213f 100644
> --- a/drivers/infiniband/hw/ehca/ehca_sqp.c
> +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c
> @@ -125,14 +125,30 @@ struct ib_perf {
>        u8 data[192];
> } __attribute__ ((packed));
>
> +/* TC/SL/FL packed into 32 bits, as in ClassPortInfo */
> +struct tcslfl {
> +       u32 tc:8;
> +       u32 sl:4;
> +       u32 fl:20;
> +} __attribute__ ((packed));
> +
> +/* IP Version/TC/FL packed into 32 bits, as in GRH */
> +struct vertcfl {
> +       u32 ver:4;
> +       u32 tc:8;
> +       u32 fl:20;
> +} __attribute__ ((packed));
>
> static int ehca_process_perf(struct ib_device *ibdev, u8 port_num,
> +                            struct ib_wc *in_wc, struct ib_grh *in_grh,
>                             struct ib_mad *in_mad, struct ib_mad *out_mad)
> {
>        struct ib_perf *in_perf = (struct ib_perf *)in_mad;
>        struct ib_perf *out_perf = (struct ib_perf *)out_mad;
>        struct ib_class_port_info *poi =
>                (struct ib_class_port_info *)out_perf->data;
> +       struct tcslfl *tcslfl =
> +               (struct tcslfl *)&poi->redirect_tcslfl;
>        struct ehca_shca *shca =
>                container_of(ibdev, struct ehca_shca, ib_device);
>        struct ehca_sport *sport = &shca->sport[port_num - 1];
> @@ -158,10 +174,29 @@ static int ehca_process_perf(struct ib_device *ibdev,
> u8 port_num,
>                poi->base_version = 1;
>                poi->class_version = 1;
>                poi->resp_time_value = 18;
> -               poi->redirect_lid = sport->saved_attr.lid;
> -               poi->redirect_qp = sport->pma_qp_nr;
> +
> +               /* copy local routing information from WC where applicable
> */
> +               tcslfl->sl         = in_wc->sl;
> +               poi->redirect_lid  =
> +                       sport->saved_attr.lid | in_wc->dlid_path_bits;
> +               poi->redirect_qp   = sport->pma_qp_nr;
>                poi->redirect_qkey = IB_QP1_QKEY;
> -               poi->redirect_pkey = IB_DEFAULT_PKEY_FULL;
> +
> +               ehca_query_pkey(ibdev, port_num, in_wc->pkey_index,
> +                               &poi->redirect_pkey);
> +
> +               /* if request was globally routed, copy route info */
> +               if (in_grh) {
> +                       struct vertcfl *vertcfl =
> +                               (struct vertcfl
> *)&in_grh->version_tclass_flow;
> +                       memcpy(poi->redirect_gid, in_grh->dgid.raw,
> +                              sizeof(poi->redirect_gid));
> +                       tcslfl->tc        = vertcfl->tc;
> +                       tcslfl->fl        = vertcfl->fl;
> +               } else
> +                       /* else only fill in default GID */
> +                       ehca_query_gid(ibdev, port_num, 0,
> +                                      (union ib_gid *)&poi->redirect_gid);
>
>                ehca_dbg(ibdev, "ehca_pma_lid=%x ehca_pma_qp=%x",
>                         sport->saved_attr.lid, sport->pma_qp_nr);
> @@ -183,8 +218,7 @@ perf_reply:
>
> int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
>                     struct ib_wc *in_wc, struct ib_grh *in_grh,
> -                    struct ib_mad *in_mad,
> -                    struct ib_mad *out_mad)
> +                    struct ib_mad *in_mad, struct ib_mad *out_mad)
> {
>        int ret;
>
> @@ -196,7 +230,8 @@ int ehca_process_mad(struct ib_device *ibdev, int
> mad_flags, u8 port_num,
>                return IB_MAD_RESULT_SUCCESS;
>
>        ehca_dbg(ibdev, "port_num=%x src_qp=%x", port_num, in_wc->src_qp);
> -       ret = ehca_process_perf(ibdev, port_num, in_mad, out_mad);
> +       ret = ehca_process_perf(ibdev, port_num, in_wc, in_grh,
> +                               in_mad, out_mad);
>
>        return ret;
> }
> --
> 1.6.0.4
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/d8db9af4/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug 26 08:20:16 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 26 Aug 2009 11:20:16 -0400
Subject: [ofa-general] ofed 1.3.2 opensmd failover
In-Reply-To: <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com>
References: <20090825162517.7955C21C827@f28.poczta.interia.pl>
	<f0e08f230908251459j2657ef17o1a0b7c5abc836267@mail.gmail.com>
	<92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com>
Message-ID: <f0e08f230908260820n2e2a37d0jdefc3e687d0248af@mail.gmail.com>

On 8/25/09, PN <poknam at gmail.com> wrote:
>
> HI,
>
> I can think of a situation in which all servers have dual port IB cards and
> need failover of OpenSM to achieve HA.
> As I know, OpenSM can only bind to 1 port at a time,


Yes.

 so do I need to start 2 OpenSM in server A and 2 OpenSM in server B?


That would be one valid configuration. I'm assuming all ports are connected
to same subnet.

Will they use the same guid2lid file?


Depends how the OpenSM configuration is done.

 Do I need to set something in the config file or they will automatically
> communcate each other?


What communication are you referring to ? The all need to share the same
subnet prefix.


> Do I need to run sldd.sh manually or it will automatically sync with other
> OpenSM?


You can either manually copy the guid2lid file around to the appropriate
places. I'm not that familiar with sldd.sh but I think it can either be run
manually or made to run automatically but I'm not familiar with the details.

-- Hal


Thanks a lot.
>
> Regards,
> PN
>
>
>
>
> 2009/8/26 Hal Rosenstock <hal.rosenstock at gmail.com>
>
>>
>>
>>  On 8/25/09, kovlensky at interia.pl <kovlensky at interia.pl> wrote:
>>>
>>> Hi all,
>>>
>>> Quick question - is there a need to run anything except opensmd deamons
>>> to provide failover capability on ib network in ofed 1.3?
>>
>>
>> In terms of SM failover, modulo bugs fixed relative to this feature since
>> OFED 1.3 (there are a couple of things here which may affect your
>> environment if I recall correctly), you only need to run more than 1 SM for
>> this (one will become master, the other standby).
>>
>> I'm aware that when master manager dies standby one comes in and manages
>>> the network, but that does not necessary means that lids are preserved,
>>> especially for nodes joining in. I used to run sldd.sh for distributing lids
>>> list on ofed 1.2.5, but while this script seems to be in place noone
>>> mentions necessity for it.
>>
>>
>> So subnet manager failover is provided by running standby opensm.
>>
>>
>> And how LID preservation is provided?
>>
>>
>> If you want LIDs to be preserved, the guid2lid file needs to be sync'd
>> (copied from the master SM once it's fully assembled to the node which is
>> running the standby SM). That's what the sldd.sh script does.
>>
>> -- Hal
>>
>> Regards,
>>>
>>> Zdenek Kovlensky
>>>
>>> ----------------------------------------------------------------------
>>> Kup wlasne mieszkanie za 33 tys. zl!
>>> Sprawdz >>> http://link.interia.pl/f22f2
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
>
> --
> Best Regards,
> PN Lai
> HPC Specialist
> Galactic Computng Corp.
> Tel: 86-755-26733939 ext 826
> Mobile: 86-13823161729
> Fax: 86-755-26733780
> URL: http://www.galactic.com.hk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/7daf7c54/attachment.html>

From hal.rosenstock at gmail.com  Wed Aug 26 08:23:53 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 26 Aug 2009 11:23:53 -0400
Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH 
	topology.
In-Reply-To: <4A949EFE.9000302@Sun.COM>
References: <4A92DFEC.3010300@Sun.COM>
	<f0e08f230908251504m4aec4233ke6aa5b009ce1232c@mail.gmail.com>
	<4A949EFE.9000302@Sun.COM>
Message-ID: <f0e08f230908260823t1d3a42edr14d8fcc32f102cc2@mail.gmail.com>

Hi Rafael,

On 8/25/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>
> Hello Hal,
>
> Bellow...
>
> Hal Rosenstock wrote:
>
>
>
> On 8/24/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>>
>> Hello,
>>
>> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics
>> each, 8 qnems).
>> They are configured in a MESH topology.
>> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
>>
>> I'm booting PXE from IB, my initrd image is bringing the ib0 interface,
>> getting the squashfs image and mounting with aufs.
>>
>> The problem is.. When booting more then 60 nodes, I start to get above
>> errors on subnet manager.
>> And the problem seems to be intermitent, because each time it gives errors
>> on different path.
>>
>> Any ideas ?
>>
>> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713838 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
>> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713842 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
>> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713847 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
>> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713866 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
>> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713871 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
>> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
>> Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
>> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) --
>> dropping
>> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
>> Hop Ptr: 0x0
>> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>>                 Initial path = 0,0,0,0,0,0
>>                 Return path  = 0,0,0,0,0,0
>> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
>> ERR 3113: MAD completed in error (IB_TIMEOUT)
>> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
>>                 base_ver................0x1
>>                 mgmt_class..............0x81
>>                 class_ver...............0x1
>>                 method..................0x1 (SubnGet)
>>                 D bit...................0x0
>>                 status..................0x0
>>                 hop_ptr.................0x0
>>                 hop_count...............0x5
>>                 trans_id................0x36595
>>                 attr_id.................0x15 (PortInfo)
>>                 resv....................0x0
>>                 attr_mod................0x0
>>                 m_key...................0x0000000000000000
>>                 dr_slid.................65535
>>                 dr_dlid.................65535
>>
>>                 Initial path: 0,1,15,15,15,19
>>                 Return path:  0,0,0,0,0,0
>>                 Reserved:     [0][0][0][0][0][0][0]
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
>> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) --
>> dropping
>> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
>> Hop Ptr: 0x0
>> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>>                 Initial path = 0,0,0,0,0,0
>>                 Return path  = 0,0,0,0,0,0
>> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
>> ERR 3113: MAD completed in error (IB_TIMEOUT)
>> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
>>                 base_ver................0x1
>>                 mgmt_class..............0x81
>>                 class_ver...............0x1
>>                 method..................0x1 (SubnGet)
>>                 D bit...................0x0
>>                 status..................0x0
>>                 hop_ptr.................0x0
>>                 hop_count...............0x5
>>                 trans_id................0x36596
>>                 attr_id.................0x15 (PortInfo)
>>                 resv....................0x0
>> ....
>>
>
> These errors are transient as you indicate. They mean that some node has
> brought the link physically up but there is no SMA at the remote side of the
> link. The different paths are paths to the HCAs. This occurs during PXE boot
> as the node transitions from the boot ROM to the Linux environment.
>
>
> They are transient.. but sometimes opensm hangs with the same message and
> loops this errors messages.
>

Are you sure OpenSM hangs ? If so, any idea where ?

 First I was using centos 5.3 kernel with updates .. and the IPoIB stopped
> working after these messages.
>

Any specifics ?

 Using the "vanilla" centos 5.3 kernel solved this issue.
> But SOMETIMES, booting the nodes, these messages appear and dont go away.
>

In those cases, do the nodes succesfully boot up ?


  Other than these messages, do things seem to work in terms of the end
> nodes ?
>
> They seem to work with vanilla kernel. Even with the messages, no problems
> reaching the nodes so far.
>

Do your ULPs work (like IPoIB, etc.) ?

-- Hal

 Tks
>
> Rafael Tinoco
>
>
> -- Hal
>
> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/30187f6b/attachment.html>

From dave.olson at qlogic.com  Wed Aug 26 08:36:49 2009
From: dave.olson at qlogic.com (Dave Olson)
Date: Wed, 26 Aug 2009 08:36:49 -0700
Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards
In-Reply-To: <1251277761.28564.45.camel@pyren.uio.no>
References: <1251277761.28564.45.camel@pyren.uio.no>
Message-ID: <alpine.LFD.1.10.0908260832340.460@topaz.mv.qlogic.com>

On Wed, 26 Aug 2009, Ole Widar Saastad wrote:

| I am experiencing problems using the Infinipath cards and the OFED
| stack. (details are given below). 
| 
| It seems to be a problem somewhere when mpi packet size grows above 2k.
| This is what I recall the changeover from one transport mechanism to
| another ?

The problem is that openmpi prior to fairly recent releases had
problems with MTUs that didn't match the config file (mellanox
as well as infinipath).  Since infinipath cards are defaulted in the
config file to 4KB MTU, this showed up most on our cards, if you were
running with 2K MTU.  So you either need to fix
	mca-btl-openib-device-params.ini
on all nodes to say 2048, or override in your local configs or
command lines.

More recent versions of openmpi have this fixed (1.3.2 for sure,
maybe all 1.3, I don't remember).

Dave Olson
dave.olson at qlogic.com


From poknam at gmail.com  Wed Aug 26 08:44:11 2009
From: poknam at gmail.com (PN)
Date: Wed, 26 Aug 2009 23:44:11 +0800
Subject: [ofa-general] ofed 1.3.2 opensmd failover
In-Reply-To: <f0e08f230908260820n2e2a37d0jdefc3e687d0248af@mail.gmail.com>
References: <20090825162517.7955C21C827@f28.poczta.interia.pl> 
	<f0e08f230908251459j2657ef17o1a0b7c5abc836267@mail.gmail.com> 
	<92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com> 
	<f0e08f230908260820n2e2a37d0jdefc3e687d0248af@mail.gmail.com>
Message-ID: <92daa7bf0908260844s7d0d5fat1215283cbc66965e@mail.gmail.com>

2009/8/26 Hal Rosenstock <hal.rosenstock at gmail.com>

>
>
> On 8/25/09, PN <poknam at gmail.com> wrote:
>>
>> HI,
>>
>> I can think of a situation in which all servers have dual port IB cards
>> and need failover of OpenSM to achieve HA.
>> As I know, OpenSM can only bind to 1 port at a time,
>
>
> Yes.
>
>  so do I need to start 2 OpenSM in server A and 2 OpenSM in server B?
>
>
> That would be one valid configuration. I'm assuming all ports are connected
> to same subnet.
>

In some cases, I will use IB bonding. While in another cases, I may use 1
port for calculation and another port to connect the storage.
I'm not sure which configuration will provide better performance.


> Will they use the same guid2lid file?
>
>
> Depends how the OpenSM configuration is done.
>
>  Do I need to set something in the config file or they will automatically
>> communcate each other?
>
>
> What communication are you referring to ? The all need to share the same
> subnet prefix.
>

I mean the handover mechanism. I remember in the previous OpenSM config file
(in OFED 1.2.x/1.3.x), there is a field about all the subnet manager in the
subnet, but this field is omitted in the new version. I wonder whether all
the OpenSM will automatically discover each other and do the handover
mechanism right.

Thanks.

PN


>
>
>
>> Do I need to run sldd.sh manually or it will automatically sync with other
>> OpenSM?
>
>
> You can either manually copy the guid2lid file around to the appropriate
> places. I'm not that familiar with sldd.sh but I think it can either be run
> manually or made to run automatically but I'm not familiar with the details.
>
> -- Hal
>
>
> Thanks a lot.
>>
>> Regards,
>> PN
>>
>>
>>
>>
>> 2009/8/26 Hal Rosenstock <hal.rosenstock at gmail.com>
>>
>>>
>>>
>>>  On 8/25/09, kovlensky at interia.pl <kovlensky at interia.pl> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Quick question - is there a need to run anything except opensmd deamons
>>>> to provide failover capability on ib network in ofed 1.3?
>>>
>>>
>>> In terms of SM failover, modulo bugs fixed relative to this feature since
>>> OFED 1.3 (there are a couple of things here which may affect your
>>> environment if I recall correctly), you only need to run more than 1 SM for
>>> this (one will become master, the other standby).
>>>
>>> I'm aware that when master manager dies standby one comes in and manages
>>>> the network, but that does not necessary means that lids are preserved,
>>>> especially for nodes joining in. I used to run sldd.sh for distributing lids
>>>> list on ofed 1.2.5, but while this script seems to be in place noone
>>>> mentions necessity for it.
>>>
>>>
>>> So subnet manager failover is provided by running standby opensm.
>>>
>>>
>>> And how LID preservation is provided?
>>>
>>>
>>> If you want LIDs to be preserved, the guid2lid file needs to be sync'd
>>> (copied from the master SM once it's fully assembled to the node which is
>>> running the standby SM). That's what the sldd.sh script does.
>>>
>>> -- Hal
>>>
>>> Regards,
>>>>
>>>> Zdenek Kovlensky
>>>>
>>>> ----------------------------------------------------------------------
>>>> Kup wlasne mieszkanie za 33 tys. zl!
>>>> Sprawdz >>> http://link.interia.pl/f22f2
>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>>
>>
>> --
>> Best Regards,
>> PN Lai
>> HPC Specialist
>> Galactic Computng Corp.
>> Tel: 86-755-26733939 ext 826
>> Mobile: 86-13823161729
>> Fax: 86-755-26733780
>> URL: http://www.galactic.com.hk
>>
>
>


-- 
Best Regards,
PN Lai
HPC Specialist
Galactic Computng Corp.
Tel: 86-755-26733939 ext 826
Mobile: 86-13823161729
Fax: 86-755-26733780
URL: http://www.galactic.com.hk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/9a672196/attachment.html>

From hnrose at comcast.net  Wed Aug 26 08:54:47 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 26 Aug 2009 11:54:47 -0400
Subject: [ofa-general] [PATCH] opensm/ib_types.h: Add CounterSelect2 field to
	PortCounters attribute
Message-ID: <20090826155447.GA25235@comcast.net>


Per MgtWG RefID #4527

Also, cosmetic commentary change

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index fe3f051..42ec794 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -4377,8 +4377,8 @@ ib_node_info_get_vendor_id(IN const ib_node_info_t * const p_ni)
 
 #include <complib/cl_packon.h>
 typedef struct _ib_node_desc {
-	// Node String is an array of UTF-8 character that
-	// describes the node in text format
+	// Node String is an array of UTF-8 characters
+	// that describe the node in text format
 	// Note that this string is NOT NULL TERMINATED!
 	uint8_t description[IB_NODE_DESCRIPTION_SIZE];
 } PACK_SUFFIX ib_node_desc_t;
@@ -7737,9 +7737,9 @@ typedef struct _ib_port_counters {
 	ib_net16_t xmit_discards;
 	uint8_t xmit_constraint_err;
 	uint8_t rcv_constraint_err;
-	uint8_t res1;
+	uint8_t counter_select2;
 	uint8_t link_int_buffer_overrun;
-	ib_net16_t res2;
+	ib_net16_t resv;
 	ib_net16_t vl15_dropped;
 	ib_net32_t xmit_data;
 	ib_net32_t rcv_data;


From hnrose at comcast.net  Wed Aug 26 09:12:23 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 26 Aug 2009 12:12:23 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/perfquery.c: Indicate whether
	PortXmitWait counter is supported
Message-ID: <20090826161223.GA30257@comcast.net>


Indicate extended v. (normal) port counters in output
Also, some cosmetic formatting changes and commentary typo fixed

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
index 39ae2f6..0fd083e 100644
--- a/infiniband-diags/src/perfquery.c
+++ b/infiniband-diags/src/perfquery.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2009 HNR Consulting.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -277,8 +278,8 @@ static void output_aggregate_perfcounters_ext(ib_portid_t * portid)
 
 	mad_dump_perfcounters_ext(buf, sizeof buf, pc, sizeof pc);
 
-	printf("# Port counters: %s port %d\n%s", portid2str(portid), ALL_PORTS,
-	       buf);
+	printf("# Port extended counters: %s port %d\n%s", portid2str(portid),
+	       ALL_PORTS, buf);
 }
 
 static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
@@ -291,7 +292,8 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
 				   IB_GSI_PORT_COUNTERS, srcport))
 			IBERROR("perfquery");
 		if (!(cap_mask & 0x1000)) {
-			/* if PortCounters:PortXmitWait not suppported clear this counter */
+			/* if PortCounters:PortXmitWait not supported clear this counter */
+			IBWARN("PortXmitWait not indicated so ignore this counter");
 			perf_count.xmtwait = 0;
 			mad_encode_field(pc, IB_PC_XMT_WAIT_F,
 					 &perf_count.xmtwait);
@@ -316,9 +318,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
 						  sizeof pc);
 	}
 
-	if (!aggregate)
-		printf("# Port counters: %s port %d\n%s", portid2str(portid),
-		       port, buf);
+	if (!aggregate) {
+		if (extended)
+			printf("# Port extended counters: %s port %d\n%s",
+			       portid2str(portid), port, buf);
+		else
+			printf("# Port counters: %s port %d\n%s",
+			       portid2str(portid), port, buf);
+	}
 }
 
 static void reset_counters(int extended, int timeout, int mask,
@@ -421,9 +428,8 @@ static int process_opt(void *context, int ch, char *optarg)
 
 int main(int argc, char **argv)
 {
-	int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS,
-		IB_PERFORMANCE_CLASS
-	};
+	int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS,
+			       IB_PERFORMANCE_CLASS};
 	ib_portid_t portid = { 0 };
 	int mask = 0xffff;
 	uint16_t cap_mask;
@@ -553,7 +559,6 @@ int main(int argc, char **argv)
 		goto done;
 
 do_reset:
-
 	if (argc <= 2 && !extended && (cap_mask & 0x1000))
 		mask |= (1 << 16);	/* reset portxmitwait */
 

From weiny2 at llnl.gov  Wed Aug 26 10:29:57 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 26 Aug 2009 10:29:57 -0700
Subject: [ofa-general] [PATCH] infiniband-diags/libibnetdisc: add missing
 '\n' to error message
Message-ID: <20090826102957.bed66987.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 21 Aug 2009 15:01:00 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index c69467e..bbb0fbb 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -615,7 +615,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port,
 				if (get_port_info(ibmad_port, fabric,
 						  &port_buf, i, path)) {
 					IBND_ERROR
-					    ("can't reach node %s port %d",
+					    ("can't reach node %s port %d\n",
 					     portid2str(path), i);
 					continue;
 				}
-- 
1.5.4.5


From weiny2 at llnl.gov  Wed Aug 26 10:31:20 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 26 Aug 2009 10:31:20 -0700
Subject: [ofa-general] Combined DR path with empty DR path, what is the
	expected behavior?
In-Reply-To: <f0e08f230908260755q24b29657t8149e1aa55224fd0@mail.gmail.com>
References: <20090824185206.39e5e377.weiny2@llnl.gov>
	<f0e08f230908251615x79f2f87cwcba95c0f7e743bfe@mail.gmail.com>
	<20090825175543.4f929646.weiny2@llnl.gov>
	<f0e08f230908260755q24b29657t8149e1aa55224fd0@mail.gmail.com>
Message-ID: <20090826103120.569b5deb.weiny2@llnl.gov>

On Wed, 26 Aug 2009 10:55:41 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On 8/25/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> >
> > On Tue, 25 Aug 2009 19:15:19 -0400
> > Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> >
> > > On 8/24/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> > >

[snip]

> > >
> > >
> > > Not all 4 combinations are supported/known to work. When this was added
> > for
> > > ibportstate, the only combined routing form that was important was LID
> > > routed part followed by a DR part.
> > >
> >
> > When you say "known to work" you mean implemented with the diags?  Or known
> > to
> > work in all hardware?
> 
> 
> The former with most hardware up to some time ago. Note there is no
> compliance testing of combined routing and heavy reliance on this makes some
> a little nervous.

Ok, Good to know.  With this, and the rest of your response, in mind I went
ahead and created a patch to libibnetdisc which will go back to LID routing
when the Hop Count is returned to 0.  Patch to follow.

> 
> >
> > > > On the other hand I think strictly this should be supported.
> > >
> > >
> > > In an ideal world yes but are they all required or is it just the one
> > form
> > > most heavily used ?
> >
> > That is what I am unclear on.  Does the spec require that all 8
> > combinations
> > are required to work?  I don't see a specific compliance which says that
> > and I
> > am not sure if C14-9 and C14-13 cover all 8 combinations.
> 
> 
> I don't think there's any compliance on this. It all appears to be
> informative text. Perhaps a shortcoming of the spec. So there's nothing
> definitive. It just says there are 8 combinations (2**3 as there are 3 parts
> with 2 possibilities in each part) and that only 4 are really useful.

Well I agree that only 4 are "useful".  It is just the algorithm which
libibnetdisc used which resulted in this "weird" case.

[snip]

> 
> >
> > > If so, what's the initial path at this point (or more specifically index
> > 1
> > > of the initial path) ? I think that needs to be port 0 (if a switch) but
> > > this is a little weird as I would think it should be handed to the SMA
> > which
> > > is different cases in the spec.
> >
> > Yes I think I was wrong on the case.  But still wouldn't the SMI detect
> > that
> > this is the end of the DRPath and simply hand it to the SMA.
> 
> 
> Yes, that's what should happen.

I am going to take this up with the switch vendors and see what their
interpretation is.  For the time being I think my patch will fix libibnetdisc
(iblinkinfo).

Thanks again!
Ira

> 
> >
> > >
> > > > Then after processing
> > >
> > >
> > > by the SMA and doing the required returning initialization
> > >
> > > the SMI should return the packet as specified in C14-13
> > > > item 3 on line 9 page 812.
> > >
> > >
> > > I'm not sure it would use this case in the case of an empty DR pafh on
> > > return.
> >
> > Actually I think it will use this.  C14-9 item 3) states "the Hop Pointer
> > shall be incremented by 1"  Therefore when the response is handed back to
> > the
> > SMI the Hop pointer will be 1 and the hop count 0.  And the SMI uses the
> > DRSLID to send the packet back to the requester.
> 
> 
> It goes up to the SMA and then when the response is to be made it goes
> through returning SMI initialization and handling.
> 
> -- Hal
> 
> >
> > > Am I wrong?  In the end it does not matter as I have to make the software
> > > > work
> > > > for all the hardware I have; so I will change the software.
> > >
> > >
> > > IMO it does matter as to where the problem lies (SMI or otherwise) and
> > how
> > > the layers are comprised in the implementation.
> >
> > Agreed.  I am mainly confused because I have 2 different implementations of
> > this.  My "old" switches seem to handle this case just fine.  My "new"
> > switches do not.  So I am really wondering what is going on.
> >
> > Here is the above output for the same query which works with an "old"
> > switch.
> >
> > 17:28:04 > ./smpquery -e -c portinfo 7 0 1
> > ...
> > trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt
> > 0
> > ...
> >
> > Aug 25 17:46:40 woprjr0 Madeye:sent SMP
> > Aug 25 17:46:40 woprjr0 MAD version....0x1
> > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
> > Aug 25 17:46:40 woprjr0 Class version..0x1
> > Aug 25 17:46:40 woprjr0 Method.........0x1 (Get)
> > Aug 25 17:46:40 woprjr0 Status.........0x00
> > Aug 25 17:46:40 woprjr0 Hop pointer....0x0
> > Aug 25 17:46:40 woprjr0 Hop counter....0x0
> > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
> > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
> > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
> > Aug 25 17:46:40 woprjr0 Mkey...........0x0
> > Aug 25 17:46:40 woprjr0 DR SLID........0x02
> > Aug 25 17:46:40 woprjr0 DR DLID........0xffff
> > Aug 25 17:46:40 woprjr0 Madeye:recv SMP
> > Aug 25 17:46:40 woprjr0 MAD version....0x1
> > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP)
> > Aug 25 17:46:40 woprjr0 Class version..0x1
> > Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response)
> > Aug 25 17:46:40 woprjr0 Status.........0x8000
> > Aug 25 17:46:40 woprjr0 Hop pointer....0x0
> > Aug 25 17:46:40 woprjr0 Hop counter....0x0
> > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de
> > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info)
> > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001
> > Aug 25 17:46:40 woprjr0 Mkey...........0x0
> > Aug 25 17:46:40 woprjr0 DR SLID........0x02
> > Aug 25 17:46:40 woprjr0 DR DLID........0xffff
> >
> > Hop Pointer and Count are both 0 and things work just fine...
> >
> > >
> > > However, I wonder
> > > > where exactly the spec falls on this, because I think it will influence
> > > > where
> > > > the fix resides.  If the spec does not allow this then I think it is
> > fine
> > > > to
> > > > have libibmad return an error since the user specified an invalid
> > combined
> > > > DR
> > > > path.  However, if this should be legal I think libibmad should work
> > around
> > > > the bad hardware out there.
> > >
> > >
> > > Is it hardware or firmware that needs fixing ? I think it may depend on
> > the
> > > specific workaround for this as to whether it is acceptable as it might
> > harm
> > > something else or might violate the spec.
> >
> > I agree, however, if the switch hardware needs fixing I fear it is too late
> > for the ones I have.  Firmware might be upgradable although I have had
> > issues
> > with un-managed switches in the past.
> >
> > So where do we put the fix in software?
> >
> Ira
> >
> > > -- Hal
> > >
> > >
> > > Thoughts?
> > > > Ira
> > > >
> > > > --
> > > > Ira Weiny
> > > > Math Programmer/Computer Scientist
> > > > Lawrence Livermore National Lab
> > > > 925-423-8008
> > > > weiny2 at llnl.gov
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://**lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > >
> > > > To unsubscribe, please visit
> > > > http://**openib.org/mailman/listinfo/openib-general
> > > >
> > >
> >
> >
> > --
> > Ira Weiny
> > Math Programmer/Computer Scientist
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov
> >
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From weiny2 at llnl.gov  Wed Aug 26 10:31:42 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 26 Aug 2009 10:31:42 -0700
Subject: [ofa-general] [PATCH] libibnetdisc: add retract_dpath function
Message-ID: <20090826103142.660ac83b.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 26 Aug 2009 09:25:00 -0700
Subject: [PATCH] libibnetdisc: add retract_dpath function

	When using combined routing some switches do not handle Hop Count of 0
	well.  Detect when the drpath count is 0 and return to lid based
	routing in this case.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index c69467e..da8572c 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -175,6 +175,16 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
+static int retract_dpath(ib_portid_t * path)
+{
+	path->drpath.cnt--;	/* restore path */
+	if (path->drpath.cnt == 0 && path->lid) {
+		/* return to lid based routing on this path */
+		path->drpath.drslid = 0;
+		path->drpath.drdlid = 0;
+	}
+}
+
 static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric,
 			ib_portid_t * portid, int nextport)
 {
@@ -502,7 +512,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
 	if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
 		IBND_ERROR("Query remote node (%s) failed, skipping port\n",
 			   portid2str(path));
-		path->drpath.cnt--;	/* restore path */
+		retract_dpath(path);
 		return 1;	/* positive == non-fatal error */
 	}
 
@@ -530,7 +540,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
 	link_ports(node, port, remotenode, remoteport);
 
 error:
-	path->drpath.cnt--;	/* restore path */
+	retract_dpath(path);
 	return (rc);
 }
 
-- 
1.5.4.5


From jgunthorpe at obsidianresearch.com  Wed Aug 26 11:04:57 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 26 Aug 2009 12:04:57 -0600
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <4A94FB67.6050600@voltaire.com>
References: <20090821000431.GA5713@obsidianresearch.com>
	<4A94FB67.6050600@voltaire.com>
Message-ID: <20090826180457.GR406@obsidianresearch.com>

On Wed, Aug 26, 2009 at 12:07:51PM +0300, Or Gerlitz wrote:

> isn't Jason's approach enough for the bonding case?! I saw that your  
> patch ("bonding: clean muticast addresses when device changes type"

I think working versions of all three patches are required:
 1) Fix the bonding driver. Otherwise the right groups might not be
    joined.
 2) Check the address format, to protect against 'ip maddr add' and
    other wakkyness
 3) Fix the timeout handling, so mlid exhaustion and other SA side
    errors are handled elegantly.

All are bugs..

> and maybe also in mainline .31-rcX . However, it has the  
> down-side-effect of e.g loosing routes already set for the the bond
> while adding the underline IPoIB devices, so if Jason's patch is
> enough

Is this true? That is pretty ugly, but probably manageable..

-- 
Jason Gunthorpe <jgunthorpe at obsidianresearch.com>        (780)4406067x832
Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada


From ralph.campbell at qlogic.com  Wed Aug 26 12:01:27 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 26 Aug 2009 12:01:27 -0700
Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards
In-Reply-To: <1251277761.28564.45.camel@pyren.uio.no>
References: <1251277761.28564.45.camel@pyren.uio.no>
Message-ID: <1251313287.3535.237.camel@chromite.mv.qlogic.com>

Is your switch configured for 4K MTU?
The default openmpi parameter for QLogic is to use a 4K MTU.
Try using a 2K MTU with:
"mpirun -mca btl_openib_mtu=4 ..." and see if that works.


On Wed, 2009-08-26 at 02:09 -0700, Ole Widar Saastad wrote:
> I am experiencing problems using the Infinipath cards and the OFED
> stack. (details are given below). 
> 
> It seems to be a problem somewhere when mpi packet size grows above 2k.
> This is what I recall the changeover from one transport mechanism to
> another ?
> 
> The test is easy to run and to test, it is just a bandwidth program :
> (I got far better latency using the Pathscale stack that the OFED. Is this 
> something that will be looked up in the newer releases?).
> 
> Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected
> to a SilverStorm switch.
> 
> [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o
> Resolution (usec): 2.145767
> Benchmark ping-pong
> ===================
>         lenght     iterations   elapsed time  transfer rate        latency
>        (bytes)        (count)      (seconds)     (Mbytes/s)         (usec)
> --------------------------------------------------------------------------
>              0          10046          0.121          0.000          6.011
>              1          10261          0.124          0.166          6.026
> <cut a few lines>
>           1024           7695          0.140        112.615          9.093
>           1536           6260          0.133        144.469         10.632
>           2048           5275          0.128        168.420         12.160
> [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>     The total number of times that the sender wishes the receiver to
>     retry timeout, packet sequence, etc. errors before posting a
>     completion error.
> 
> This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.  
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>   attempt to retry (defaulted to 7, the maximum value).
> 
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>   to 10).  The actual timeout value used is calculated as:
> 
>      4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> --------------------------------------------------------------------------
> mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). 
> [olews at login-0-2 bandwidth]$ 
> 
> 
> Background information :
> 
> 
> 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02)
>         Subsystem: QLogic, Corp. InfiniPath PE-800
>         Flags: bus master, fast devsel, latency 0, IRQ 66
>         Memory at fde00000 (64-bit, non-prefetchable) [size=2M]
>         Capabilities: [40] Power Management version 2
>         Capabilities: [50] Message Signalled Interrupts: 64bit+
> Queue=0/0 Enable+
>         Capabilities: [70] Express Endpoint IRQ 0
> 
> compute-1-0.local# uname -a
> Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05
> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> compute-1-0.local# 
> 
> 
> compute-1-0.local# rpm -qa| grep ofed
> libibverbs-utils-1.1.2-1.ofed1.4.2
> librdmacm-utils-1.0.8-1.ofed1.4.2
> libcxgb3-1.2.2-1.ofed1.4.2
> ofed-scripts-1.4.2-0
> libmlx4-1.0-1.ofed1.4.2
> libibverbs-devel-1.1.2-1.ofed1.4.2
> ofed-docs-1.4.2-0
> ibvexdmtools-0.0.1-1.ofed1.4.2
> libmthca-1.0.5-1.ofed1.4.2
> libipathverbs-1.1-1.ofed1.4.2
> mstflint-1.4-1.ofed1.4.2
> libibumad-1.2.3_20090314-1.ofed1.4.2
> libnes-0.6-1.ofed1.4.2
> libibcommon-1.1.2_20090314-1.ofed1.4.2
> libibverbs-1.1.2-1.ofed1.4.2
> librdmacm-1.0.8-1.ofed1.4.2
> qlgc_vnic_daemon-0.0.1-1.ofed1.4.2
> compute-1-0.local# 
> 
> OpenMPI is :
> openmpi-1.2.8 compiled for gcc.
> 


From ralph.campbell at qlogic.com  Wed Aug 26 12:06:37 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 26 Aug 2009 12:06:37 -0700
Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards
In-Reply-To: <1251313287.3535.237.camel@chromite.mv.qlogic.com>
References: <1251277761.28564.45.camel@pyren.uio.no>
	<1251313287.3535.237.camel@chromite.mv.qlogic.com>
Message-ID: <1251313597.3535.239.camel@chromite.mv.qlogic.com>

Sorry, I meant "mpirun -mca btl_openib_mtu 4 ..." (no equal).

On Wed, 2009-08-26 at 12:01 -0700, Ralph Campbell wrote:
> Is your switch configured for 4K MTU?
> The default openmpi parameter for QLogic is to use a 4K MTU.
> Try using a 2K MTU with:
> "mpirun -mca btl_openib_mtu=4 ..." and see if that works.
> 
> 
> On Wed, 2009-08-26 at 02:09 -0700, Ole Widar Saastad wrote:
> > I am experiencing problems using the Infinipath cards and the OFED
> > stack. (details are given below). 
> > 
> > It seems to be a problem somewhere when mpi packet size grows above 2k.
> > This is what I recall the changeover from one transport mechanism to
> > another ?
> > 
> > The test is easy to run and to test, it is just a bandwidth program :
> > (I got far better latency using the Pathscale stack that the OFED. Is this 
> > something that will be looked up in the newer releases?).
> > 
> > Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected
> > to a SilverStorm switch.
> > 
> > [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o
> > Resolution (usec): 2.145767
> > Benchmark ping-pong
> > ===================
> >         lenght     iterations   elapsed time  transfer rate        latency
> >        (bytes)        (count)      (seconds)     (Mbytes/s)         (usec)
> > --------------------------------------------------------------------------
> >              0          10046          0.121          0.000          6.011
> >              1          10261          0.124          0.166          6.026
> > <cut a few lines>
> >           1024           7695          0.140        112.615          9.093
> >           1536           6260          0.133        144.469         10.632
> >           2048           5275          0.128        168.420         12.160
> > [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1
> > --------------------------------------------------------------------------
> > The InfiniBand retry count between two MPI processes has been
> > exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> > (section 12.7.38):
> > 
> >     The total number of times that the sender wishes the receiver to
> >     retry timeout, packet sequence, etc. errors before posting a
> >     completion error.
> > 
> > This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the
> > InfiniBand fabric itself.  You should note the hosts on which this
> > error has occurred; it has been observed that rebooting or removing a
> > particular host from the job can sometimes resolve this issue.  
> > 
> > Two MCA parameters can be used to control Open MPI's behavior with
> > respect to the retry count:
> > 
> > * btl_openib_ib_retry_count - The number of times the sender will
> >   attempt to retry (defaulted to 7, the maximum value).
> > 
> > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> >   to 10).  The actual timeout value used is calculated as:
> > 
> >      4.096 microseconds * (2^btl_openib_ib_timeout)
> > 
> >   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> > --------------------------------------------------------------------------
> > mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). 
> > [olews at login-0-2 bandwidth]$ 
> > 
> > 
> > Background information :
> > 
> > 
> > 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02)
> >         Subsystem: QLogic, Corp. InfiniPath PE-800
> >         Flags: bus master, fast devsel, latency 0, IRQ 66
> >         Memory at fde00000 (64-bit, non-prefetchable) [size=2M]
> >         Capabilities: [40] Power Management version 2
> >         Capabilities: [50] Message Signalled Interrupts: 64bit+
> > Queue=0/0 Enable+
> >         Capabilities: [70] Express Endpoint IRQ 0
> > 
> > compute-1-0.local# uname -a
> > Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05
> > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> > compute-1-0.local# 
> > 
> > 
> > compute-1-0.local# rpm -qa| grep ofed
> > libibverbs-utils-1.1.2-1.ofed1.4.2
> > librdmacm-utils-1.0.8-1.ofed1.4.2
> > libcxgb3-1.2.2-1.ofed1.4.2
> > ofed-scripts-1.4.2-0
> > libmlx4-1.0-1.ofed1.4.2
> > libibverbs-devel-1.1.2-1.ofed1.4.2
> > ofed-docs-1.4.2-0
> > ibvexdmtools-0.0.1-1.ofed1.4.2
> > libmthca-1.0.5-1.ofed1.4.2
> > libipathverbs-1.1-1.ofed1.4.2
> > mstflint-1.4-1.ofed1.4.2
> > libibumad-1.2.3_20090314-1.ofed1.4.2
> > libnes-0.6-1.ofed1.4.2
> > libibcommon-1.1.2_20090314-1.ofed1.4.2
> > libibverbs-1.1.2-1.ofed1.4.2
> > librdmacm-1.0.8-1.ofed1.4.2
> > qlgc_vnic_daemon-0.0.1-1.ofed1.4.2
> > compute-1-0.local# 
> > 
> > OpenMPI is :
> > openmpi-1.2.8 compiled for gcc.
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From weiny2 at llnl.gov  Wed Aug 26 16:40:26 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 26 Aug 2009 16:40:26 -0700
Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5]
 infiniband-diags/libibnetdisc: Introduce a context object.)
In-Reply-To: <20090823120609.GG9547@me>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
	<20090823120609.GG9547@me>
Message-ID: <20090826164026.8dcce4b2.weiny2@llnl.gov>

On Sun, 23 Aug 2009 15:06:09 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 08:30 Mon 17 Aug     , Ira Weiny wrote:
> > 
> > The immediate benefit is coming with the multi-threaded implementation where
> > I plan on adding the following function.[*]
> 
> Ok, but could we discuss first how will multithreading architecture be

Of course!  :-)  But first I would like to mention some numbers from the
prototype code I have.  When running on a small fabric the additional overhead
of thread creation actually slows down the scan.  :-(

Current master:         Threaded version:
real    0m0.101s         0m0.266s
user    0m0.000s         0m0.000s
sys     0m0.011s         0m0.014s


But, as expected, on a large system (1152 nodes) there is a decent speed up.

Current Master:         Threaded version:
real    0m3.046s         0m1.748s
user    0m0.073s         0m0.331s
sys     0m0.158s         0m0.822s

However, the biggest speed up comes when there are errors on the fabric.  This
is the same 1152 node cluster with just 14 "bad" ports on the fabric.  This is
of course because the scan continues "around" the bad ports.

Current Master:         Threaded version:
real    0m33.051s        0m5.609s
user    0m0.071s         0m0.353s
sys     0m0.156s         0m1.113s

Since you are usually running these tools when things are bad I think there is
a big gain here.  Even running with a faster timeout of 200ms results in a big
difference.

Current Master:        Threaded version:
real    0m9.149s        0m2.223s
user    0m0.016s        0m0.374s
sys     0m0.372s        0m1.056s

With that in mind...

> implemented with libibnetdisc: goals (in particular is it support for
> multithreaded apps or just multithreaded discovery function), interaction
> with caller application, etc.?

My initial goal was to make the libibnetdisc safe for multithreaded apps and
make a multithreaded discovery function.  However, since libibmad itself is
not thread safe, and you expressed a desire to keep it that way[*], I reduced
that goal to just making the discovery function multithreaded (using
mad_[send|receive]_via).

Although I don't like this restriction I can see it as a valid design decision
as long as it is documented that the discover function is not thread safe in
regards to the ibmad_port object.  This is because the ibnd_discover_fabric
uses libibmad calls and would require a complicated API to allow the user app
to synchronize with those calls.

In order to make things thread safe for the user apps as well as the library I
can see 3 options.

   1) make libibmad thread safe (which you were hesitant to do)

   2) add a thread safe interface to libibmad.  User apps will need to know to
      use this interface while using libibnetdisc and libibnetdisc will use
      this interface.

   3) Create a wrapper lib which is thread safe.  In this case the apps and
      libibnetdisc would call into this wrapper lib and we would have to
      change the API to libibnetdisc.

Right now I have the multithreaded discover code separated out somewhat.  I
think it would not be hard to extract the multithreaded parts and either
create the wrapper lib or extend libibmad with thread safe calls.

That said, I personally do not like option 2.  I think it further complicates
an already overly complex API in libibmad.  As far as option 1 vs 3 I can see
arguments for and against each.  1 makes things very nice because it would be
taken care of for all apps currently using libibmad.  On the down side it
would add some overhead for single threaded apps.  Although I do not believe
too much.[$]

The downside of 3 is that to be done correctly it would change the
libibnetdisc API and apps which use it.

> 
> One of the desired feature of this I could think would be to keep API
> simple for single threaded stuff.

Agreed.  I don't think the API is going to get to complicated.  A big reason
for adding the context is to allow the API to be flexible without breaking
things.

Ira

[*] http://lists.openfabrics.org/pipermail/general/2009-July/060677.html

   "madrpc() is too primitive interface for such applications. There would be
   better to use umad_send/recv() directly or may be mad_send_via().  Example
   is mcast_storm.c distributed with ibsim."

[$] It is my opinion that mad_rpc is _not_ primitive.  In my mind it _is_
   the wrapper around the primitive umad_send/recv calls.  If you are
   interested perhaps I can try to explain what I wanted to do in the library
   to make it thread safe more clearly.  The point I might not have made clear
   was that I don't think the library will have to do any threading on it's
   own, just some locks and storing of responses.  Of course the down side to
   this is the libibmad code would be slightly slower.  But I don't think by
   very much.


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From jgunthorpe at obsidianresearch.com  Wed Aug 26 17:24:20 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 26 Aug 2009 18:24:20 -0600
Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5]
	infiniband-diags/libibnetdisc: Introduce a context object.)
In-Reply-To: <20090826164026.8dcce4b2.weiny2@llnl.gov>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
	<20090823120609.GG9547@me>
	<20090826164026.8dcce4b2.weiny2@llnl.gov>
Message-ID: <20090827002420.GT406@obsidianresearch.com>

On Wed, Aug 26, 2009 at 04:40:26PM -0700, Ira Weiny wrote:

> Of course!  :-)  But first I would like to mention some numbers from the
> prototype code I have.  When running on a small fabric the additional overhead
> of thread creation actually slows down the scan.  :-(

It seems strange to me to thread something like this (and alot of hard
work)..

FSM multiplexing the recv path usually gives much better performance,
something like net discovery is quite easy..

main loop:
 fill tx queue from next list
 recieve replies and correlate with next list

each entry:
 add to next list additional ports

Repeat until dead.

Where a 'next list' would be a set of actions along the lines of
'query node' or 'query port' the action on a 'query node' completion
is to generate 'query port' next list items for all the ports, and on
'query port' completion is to generate 'query node' items for all
enabled ports..

libumad is nonblocking, parallel, etc...

Jason


From FENKES at de.ibm.com  Thu Aug 27 02:44:30 2009
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Thu, 27 Aug 2009 11:44:30 +0200
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
References: <200908261337.56128.fenkes@de.ibm.com>
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
Message-ID: <OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>

Hal Rosenstock <hal.rosenstock at gmail.com> wrote on 26.08.2009 17:15:03:

> Thanks for doing this. It looks sane to me. The only issue I recall that 

> appears to be remaining is a better setting of 
ClassPortInfo:RespTimeValue 
> rather than hardcoding. Perhaps using the value from PortInfo is the way 
to go
> (ideally it would be that value from the port to which the the requester 
is 
> being redirected to but that might not be so easy to get from this port.

I don't think that effort will be necessary or even legal. The requestor 
will react to the redirection with another Get(ClassPortInfo) to the 
redirection target, which will reply with its own RespTimeValue, so our 
driver should speak for itself. Since we don't know when our MAD 
processing and sending of the response is going to be scheduled (we're not 
running on real-time constraints here), we play it safe and return 18, 
which amounts to roughly a second. 

Make sense?

Regards
  Joachim


From vlad at lists.openfabrics.org  Thu Aug 27 03:05:15 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 27 Aug 2009 03:05:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090827-0200 daily build status
Message-ID: <20090827100516.402E4E30266@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From aaron.knister at gmail.com  Thu Aug 27 05:30:52 2009
From: aaron.knister at gmail.com (Aaron Knister)
Date: Thu, 27 Aug 2009 08:30:52 -0400
Subject: [ofa-general] IPoIB connected vs datagram
Message-ID: <4A967C7C.7080509@gmail.com>

Hi!

I'm having some strange problems on an InfiniBand fabric at work. We 
have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012 
IB switch. There are also several Sun "thumpers" running solaris that 
are also connected to the infiniband fabric, however their HCAs are only 
SDR. There are several 20 odd terabyte nfs mounts exported from the 
thumpers and mounted to the compute nodes over IPoIB (we're not using 
NFS RDMA). Opensm is running on the head node and all of the compute 
nodes for redundancys sake. Things were running OK until yesterday when 
a user crashed the head node by sucking up all of its memory, and at the 
time the head node's subnet manager was in the master state. Well, a 
different node quickly picked up subnet management until the head node 
was rebooted at which point the head node became the subnet master.

Since logging back in to the cluster after rebooting the head node, the 
nfs mounts from the thumpers have been hanging periodically all over the 
place. I know that two of the thumpers and their nfs exports are being 
hit with an aggregate of about 120MB/s of nfs traffic from about 30 or 
so compute nodes, so I'm sure that's not helping things, however one of 
the other thumpers that has no active jobs hitting its exports 
periodically shows nfs server "not responding" message on the 
clients/compute nodes. I checked the log files for the past week- these 
nfs server not responding messages all started since the head node crash 
yesterday. From what I've been told, every time this happens the only 
fix is to reboot the switch.

Of course, any general debugging suggestions would be appreciated, but I 
have a few specific questions regarding IPoIB and connected vs datagram. 
All of the compute nodes and the head node (running ofed 1.4) are using 
"connected mode" for IPoIB ->

[root at headnode ~]# cat /sys/class/net/ib0/mode
connected

and the mtu of the interface is 65520

I don't know how to determine if the solaris (the thumpers) systems are 
using connected mode, but their MTUs are 2044 which leads me to believe 
they're probably not. I cannot log into these machines as I don't manage 
them, but is there a way to determine the IPoIB mtu using an ib* 
utility? Or am I misunderstanding IPoIB that such information wouldn't 
be useful.

And lastly, I recall that with TCP over ethernet if you have the mtu 
said to say 9000 and try and sling data to a box with an mtu of 1500 you 
get some weird performance hits. Is it likely that the compute nodes use 
of the larger MTU + connected mode paired with the thumpers much smaller 
MTU + probably datagram mode could be causing timeouts under heavy load? 
Does anybody think that settings the compute/head nodes to datagram mode 
and subsequently dropping the mtu to 2044 would help my situation?

Again, any suggestions are greatly appreciated, and thanks in advance 
for any replies!

-Aaron


From hal.rosenstock at gmail.com  Thu Aug 27 06:31:40 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 27 Aug 2009 09:31:40 -0400
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>
References: <200908261337.56128.fenkes@de.ibm.com>
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
	<OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>
Message-ID: <f0e08f230908270631j3e159f3fgb0034eb41acdac7b@mail.gmail.com>

On 8/27/09, Joachim Fenkes <FENKES at de.ibm.com> wrote:
>
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote on 26.08.2009 17:15:03:
>
> > Thanks for doing this. It looks sane to me. The only issue I recall that
>
> > appears to be remaining is a better setting of
> ClassPortInfo:RespTimeValue
> > rather than hardcoding. Perhaps using the value from PortInfo is the way
> to go
> > (ideally it would be that value from the port to which the the requester
> is
> > being redirected to but that might not be so easy to get from this port.
>
> I don't think that effort will be necessary or even legal. The requestor
> will react to the redirection with another Get(ClassPortInfo) to the
> redirection target, which will reply with its own RespTimeValue, so our
> driver should speak for itself.


I overreached with my comment on how this works.

 Since we don't know when our MAD
> processing and sending of the response is going to be scheduled (we're not
> running on real-time constraints here), we play it safe and return 18,
> which amounts to roughly a second.
>
> Make sense?


I don't think it should be hard coded. IMO it would be better to default to
18 and somehow able to be adjusted (via a (dynamic) module parameter ?).

-- Hal


> Regards
> Joachim
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090827/3ba73391/attachment.html>

From niftyompi at niftyegg.com  Thu Aug 27 08:26:34 2009
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Thu, 27 Aug 2009 08:26:34 -0700
Subject: [ofa-general] IPoIB connected vs datagram
In-Reply-To: <4A967C7C.7080509@gmail.com>
References: <4A967C7C.7080509@gmail.com>
Message-ID: <20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net>

On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:
> 
> Hi!
>
> I'm having some strange problems on an InfiniBand fabric at work. We  
> have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012  
> IB switch. There are also several Sun "thumpers" running solaris that  
> are also connected to the infiniband fabric, however their HCAs are only  
> SDR. There are several 20 odd terabyte nfs mounts exported from the  
> thumpers and mounted to the compute nodes over IPoIB (we're not using  
> NFS RDMA). Opensm is running on the head node and all of the compute  
> nodes for redundancys sake. Things were running OK until yesterday when  
> a user crashed the head node by sucking up all of its memory, and at the  
> time the head node's subnet manager was in the master state. Well, a  
> different node quickly picked up subnet management until the head node  
> was rebooted at which point the head node became the subnet master.
>
> Since logging back in to the cluster after rebooting the head node, the  
> nfs mounts from the thumpers have been hanging periodically all over the  
> place. I know that two of the thumpers and their nfs exports are being  
> hit with an aggregate of about 120MB/s of nfs traffic from about 30 or  
> so compute nodes, so I'm sure that's not helping things, however one of  
> the other thumpers that has no active jobs hitting its exports  
> periodically shows nfs server "not responding" message on the  
> clients/compute nodes. I checked the log files for the past week- these  
> nfs server not responding messages all started since the head node crash  
> yesterday. From what I've been told, every time this happens the only  
> fix is to reboot the switch.
>
> Of course, any general debugging suggestions would be appreciated, but I  
> have a few specific questions regarding IPoIB and connected vs datagram.  
> All of the compute nodes and the head node (running ofed 1.4) are using  
> "connected mode" for IPoIB ->
>
> [root at headnode ~]# cat /sys/class/net/ib0/mode
> connected
>
> and the mtu of the interface is 65520
>
> I don't know how to determine if the solaris (the thumpers) systems are  
> using connected mode, but their MTUs are 2044 which leads me to believe  
> they're probably not. I cannot log into these machines as I don't manage  
> them, but is there a way to determine the IPoIB mtu using an ib*  
> utility? Or am I misunderstanding IPoIB that such information wouldn't  
> be useful.
>
> And lastly, I recall that with TCP over ethernet if you have the mtu  
> said to say 9000 and try and sling data to a box with an mtu of 1500 you  
> get some weird performance hits. Is it likely that the compute nodes use  
> of the larger MTU + connected mode paired with the thumpers much smaller  
> MTU + probably datagram mode could be causing timeouts under heavy load?  
> Does anybody think that settings the compute/head nodes to datagram mode  
> and subsequently dropping the mtu to 2044 would help my situation?
>
> Again, any suggestions are greatly appreciated, and thanks in advance  
> for any replies!

Look at the MTU choices again.
With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited
to 2K by the switch firmware.   Larger MTUs are thus synthetic and force software to 
assemble and disassemble the transfers.  On a fabric the large MTU for IPoIB
works well because the fabric is quite reliable.  When data is routed 
to another network with a smaller MTU software needs to assemble and disassemble the
fragments.   Fragmentation can be expensive.  Dropped bits and fragmentation is 
a major performance hit.    Normal MTU discovery should make fragmentation go away.

Ethernet jumbo packets (larger than 1500) are real on the wire.
This is not the case on IB where the MTU is fixed.

Is the NFS NFS over UDP or TCP ?
What are the NFS read/ write sizes set to?

Double check routes (traceroute).  Dynamic routes and mixed MTUs is a tangle.
The minimum MTU for a route can be discovered with ping and the do not fragment flag
as long as ICMP packets are not filtered.

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From aaron.knister at gmail.com  Thu Aug 27 08:41:40 2009
From: aaron.knister at gmail.com (Aaron Knister)
Date: Thu, 27 Aug 2009 11:41:40 -0400
Subject: [ofa-general] IPoIB connected vs datagram
In-Reply-To: <20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net>
References: <4A967C7C.7080509@gmail.com>
	<20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net>
Message-ID: <eafd71280908270841n1bd55dcai5035408f71a6ca0b@mail.gmail.com>

Thanks for the reply!

Good to know about the "true" MTU vs the synthetic mtu. I wasn't aware of
that.

The NFS is NFS over TCP and the read/write sizes are both set to 32768.

I don't have any routes that I know of on the IB fabric- a traceroute seemed
to verify this. I used tracepath to show me the mtu information between the
two hosts. On the second attempt it looks like it "discovered" the correct
MTU -

[root at headnode ~]# tracepath thumper1-ib
 1:  headnode (10.0.1.1)                       0.133ms pmtu 65520
 1:  thumper1-ib (10.0.1.245)                0.161ms reached
     Resume: pmtu 2044 hops 1 back 1
[root at headnode ~]# tracepath thumper1-ib
 1:  headnode (10.0.1.1)                       0.122ms pmtu 2044
 1:  thumper1-ib (10.0.1.245)                0.121ms reached
     Resume: pmtu 2044 hops 1 back 1

We rebooted the infiniband switch which cleared up the NFS issues for now.
The one thing I noticed after the reboot was that the solars storage servers
were back in the multicast group (saquery -m). It's definitely an odd
situation...

Thanks again for your help

On Thu, Aug 27, 2009 at 11:26 AM, Nifty Tom Mitchell <niftyompi at niftyegg.com
> wrote:

> On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:
> >
> > Hi!
> >
> > I'm having some strange problems on an InfiniBand fabric at work. We
> > have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012
> > IB switch. There are also several Sun "thumpers" running solaris that
> > are also connected to the infiniband fabric, however their HCAs are only
> > SDR. There are several 20 odd terabyte nfs mounts exported from the
> > thumpers and mounted to the compute nodes over IPoIB (we're not using
> > NFS RDMA). Opensm is running on the head node and all of the compute
> > nodes for redundancys sake. Things were running OK until yesterday when
> > a user crashed the head node by sucking up all of its memory, and at the
> > time the head node's subnet manager was in the master state. Well, a
> > different node quickly picked up subnet management until the head node
> > was rebooted at which point the head node became the subnet master.
> >
> > Since logging back in to the cluster after rebooting the head node, the
> > nfs mounts from the thumpers have been hanging periodically all over the
> > place. I know that two of the thumpers and their nfs exports are being
> > hit with an aggregate of about 120MB/s of nfs traffic from about 30 or
> > so compute nodes, so I'm sure that's not helping things, however one of
> > the other thumpers that has no active jobs hitting its exports
> > periodically shows nfs server "not responding" message on the
> > clients/compute nodes. I checked the log files for the past week- these
> > nfs server not responding messages all started since the head node crash
> > yesterday. From what I've been told, every time this happens the only
> > fix is to reboot the switch.
> >
> > Of course, any general debugging suggestions would be appreciated, but I
> > have a few specific questions regarding IPoIB and connected vs datagram.
> > All of the compute nodes and the head node (running ofed 1.4) are using
> > "connected mode" for IPoIB ->
> >
> > [root at headnode ~]# cat /sys/class/net/ib0/mode
> > connected
> >
> > and the mtu of the interface is 65520
> >
> > I don't know how to determine if the solaris (the thumpers) systems are
> > using connected mode, but their MTUs are 2044 which leads me to believe
> > they're probably not. I cannot log into these machines as I don't manage
> > them, but is there a way to determine the IPoIB mtu using an ib*
> > utility? Or am I misunderstanding IPoIB that such information wouldn't
> > be useful.
> >
> > And lastly, I recall that with TCP over ethernet if you have the mtu
> > said to say 9000 and try and sling data to a box with an mtu of 1500 you
> > get some weird performance hits. Is it likely that the compute nodes use
> > of the larger MTU + connected mode paired with the thumpers much smaller
> > MTU + probably datagram mode could be causing timeouts under heavy load?
> > Does anybody think that settings the compute/head nodes to datagram mode
> > and subsequently dropping the mtu to 2044 would help my situation?
> >
> > Again, any suggestions are greatly appreciated, and thanks in advance
> > for any replies!
>
> Look at the MTU choices again.
> With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited
> to 2K by the switch firmware.   Larger MTUs are thus synthetic and force
> software to
> assemble and disassemble the transfers.  On a fabric the large MTU for
> IPoIB
> works well because the fabric is quite reliable.  When data is routed
> to another network with a smaller MTU software needs to assemble and
> disassemble the
> fragments.   Fragmentation can be expensive.  Dropped bits and
> fragmentation is
> a major performance hit.    Normal MTU discovery should make fragmentation
> go away.
>
> Ethernet jumbo packets (larger than 1500) are real on the wire.
> This is not the case on IB where the MTU is fixed.
>
> Is the NFS NFS over UDP or TCP ?
> What are the NFS read/ write sizes set to?
>
> Double check routes (traceroute).  Dynamic routes and mixed MTUs is a
> tangle.
> The minimum MTU for a route can be discovered with ping and the do not
> fragment flag
> as long as ICMP packets are not filtered.
>
> --
>        T o m  M i t c h e l l
>        Found me a new hat, now what?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090827/80cbf47d/attachment.html>

From monis at Voltaire.COM  Thu Aug 27 08:52:46 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 27 Aug 2009 18:52:46 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <20090826180457.GR406@obsidianresearch.com>
References: <20090821000431.GA5713@obsidianresearch.com>	<4A94FB67.6050600@voltaire.com>
	<20090826180457.GR406@obsidianresearch.com>
Message-ID: <4A96ABCE.2030204@Voltaire.COM>

Jason Gunthorpe wrote:
> On Wed, Aug 26, 2009 at 12:07:51PM +0300, Or Gerlitz wrote:
> 
>> isn't Jason's approach enough for the bonding case?! I saw that your  
>> patch ("bonding: clean muticast addresses when device changes type"
> 
> I think working versions of all three patches are required:
>  1) Fix the bonding driver. Otherwise the right groups might not be
>     joined.
>  2) Check the address format, to protect against 'ip maddr add' and
>     other wakkyness
>  3) Fix the timeout handling, so mlid exhaustion and other SA side
>     errors are handled elegantly.
> 
> All are bugs..
> 
>> and maybe also in mainline .31-rcX . However, it has the  
>> down-side-effect of e.g loosing routes already set for the the bond
>> while adding the underline IPoIB devices, so if Jason's patch is
>> enough
> 
> Is this true? That is pretty ugly, but probably manageable..
> 
Yes it's true but I'm not sure it's ugly. Changing device type is not a common event and requires device ops change which I think is better to do when the device is closed. Unfortunately, losing routes is a side effect of closing the device but it might be necessary.


From weiny2 at llnl.gov  Thu Aug 27 09:48:10 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 27 Aug 2009 09:48:10 -0700
Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5]
	infiniband-diags/libibnetdisc: Introduce a context object.)
In-Reply-To: <20090827002420.GT406@obsidianresearch.com>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
	<20090823120609.GG9547@me>
	<20090826164026.8dcce4b2.weiny2@llnl.gov>
	<20090827002420.GT406@obsidianresearch.com>
Message-ID: <20090827094810.6cfe02f5.weiny2@llnl.gov>

On Wed, 26 Aug 2009 18:24:20 -0600
Jason Gunthorpe <jgunthorpe at obsidianresearch.com> wrote:

> On Wed, Aug 26, 2009 at 04:40:26PM -0700, Ira Weiny wrote:
> 
> > Of course!  :-)  But first I would like to mention some numbers from the
> > prototype code I have.  When running on a small fabric the additional overhead
> > of thread creation actually slows down the scan.  :-(
> 
> It seems strange to me to thread something like this (and alot of hard
> work)..
> 
> FSM multiplexing the recv path usually gives much better performance,
> something like net discovery is quite easy..

Using the original algorithm and data structures lended itself to threading.
Now that I am neck deep in all this I have thought that rewriting it all might
be easier.

> main loop:
>  fill tx queue from next list
>  recieve replies and correlate with next list

This would still need additional code (or additional synchronization in the
API to libibnetdisc) if you wanted a user app to be multi-threaded.  Someone
has to be in charge of receiving all replies on that ibmad_port object and
handing them to the proper owner.  Of course one could open multiple
ibmad_port objects but how is the app writer to know to do that?  Digging
through the code to find out that libibnetdisc is consuming all the replies?

This is what got me on this in the first place.  smp_query_via (_do_madrpc) is
not thread safe.  Threading was the easy way to deal with multiple blocking
queries on the fabric.  Changing _do_madrpc to be thread safe allowed a very
quick multithreaded implementation on top of the current algorithm which
blocked on multiple queries.  I did not have to form the queries myself, it
was easy...  (I had that working months ago.)  Given that we don't want to
change libibmad things got more complicated and your algorithm seems much
better... (except [see below])

Also, I feel that someone down the road might fall into the same trap that I
did thinking that smp_query_via is thread safe and I would like to fix that.

> 
> each entry:
>  add to next list additional ports
> 
> Repeat until dead.
> 
> Where a 'next list' would be a set of actions along the lines of
> 'query node' or 'query port' the action on a 'query node' completion
> is to generate 'query port' next list items for all the ports, and on
> 'query port' completion is to generate 'query node' items for all
> enabled ports..
> 
> libumad is nonblocking, parallel, etc...

Yes, and libibmad layers on top of it an easier interface to issue common
queries.  Why should we ask the user to re-implement that code?

For example, mad_rpc now handles redirection.  My implementation does not yet.
So now I have to handle that on my own as well...  :-(

Ira

> 
> Jason


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov


From jgunthorpe at obsidianresearch.com  Thu Aug 27 11:20:56 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 27 Aug 2009 12:20:56 -0600
Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5]
	infiniband-diags/libibnetdisc: Introduce a context object.)
In-Reply-To: <20090827094810.6cfe02f5.weiny2@llnl.gov>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
	<20090823120609.GG9547@me>
	<20090826164026.8dcce4b2.weiny2@llnl.gov>
	<20090827002420.GT406@obsidianresearch.com>
	<20090827094810.6cfe02f5.weiny2@llnl.gov>
Message-ID: <20090827182056.GV406@obsidianresearch.com>

On Thu, Aug 27, 2009 at 09:48:10AM -0700, Ira Weiny wrote:

> > FSM multiplexing the recv path usually gives much better performance,
> > something like net discovery is quite easy..
> 
> Using the original algorithm and data structures lended itself to
> threading.  Now that I am neck deep in all this I have thought that
> rewriting it all might be easier.

Yah. mayhaps..

> > main loop:
> >  fill tx queue from next list
> >  recieve replies and correlate with next list

> This would still need additional code (or additional synchronization in the
> API to libibnetdisc) if you wanted a user app to be multi-threaded.  Someone
> has to be in charge of receiving all replies on that ibmad_port object and
> handing them to the proper owner.  Of course one could open multiple
> ibmad_port objects but how is the app writer to know to do that?  Digging
> through the code to find out that libibnetdisc is consuming all the replies?

What is the use case here? I thought the app would be something like:

main()
{
  foo = libibnetdisc_setup();
  libibnetdisc_discover_all(foo,res);
  // Do interesting things with res.
} 

Where the goal is to have libibnetdisc_discover_all complete
expediently.

As long as the context 'foo' is re-entrant in all ways with all other
libraries and contexts I think useful threaded apps can be created.

> This is what got me on this in the first place.  smp_query_via
> (_do_madrpc) is not thread safe. 

Sure, the entire library is not thread safe around the ibmad_port
context. But who cares? If the caller to libibnetdisc wants to thread
that way they need to open another context.

> Also, I feel that someone down the road might fall into the same
> trap that I did thinking that smp_query_via is thread safe and I
> would like to fix that.

Well.. How can it be threaded? umad_send/umad_recv are inherently
single threaded APIs. You have to layer a TID based threading dispatch
mechanism on top of it. Much better to let the kernel do that and open
multiple umad fds.
 
> > each entry:
> >  add to next list additional ports
> > 
> > Repeat until dead.
> > 
> > Where a 'next list' would be a set of actions along the lines of
> > 'query node' or 'query port' the action on a 'query node' completion
> > is to generate 'query port' next list items for all the ports, and on
> > 'query port' completion is to generate 'query node' items for all
> > enabled ports..
> > 
> > libumad is nonblocking, parallel, etc...
> 
> Yes, and libibmad layers on top of it an easier interface to issue common
> queries.  Why should we ask the user to re-implement that code?

Well, the very best way to do this is to have a FSM engine API at the
core of the MAD libary:
  mad_ctx->callback = done_this;
  mad_post(mad,mad_ctx)

done_this(reply):
  ...

> For example, mad_rpc now handles redirection.  My implementation
> does not yet.  So now I have to handle that on my own as well...
> :-(

To be honest, I don't like the libibmad/libibumad APIs one bit - I'm
not surprised they don't work for you..

Frankly, we really need a usable MAD libary with sane APIs, and very
high level APIs on top of that. You cannot make an IB application
without doing SA queries at a minimum and the current process is
HORRID.

I see nothing of value in libimad and libibumad to support that :|

Jason


From rdreier at cisco.com  Thu Aug 27 13:34:01 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 27 Aug 2009 13:34:01 -0700
Subject: [ofa-general] Re: [PATCH v2] mlx4_core: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <4A77A430.2020106@sgi.com> (Arputham Benjamin's message of "Mon, 
	03 Aug 2009 20:00:00 -0700")
References: <4A77A430.2020106@sgi.com>
Message-ID: <adamy5lro7a.fsf@cisco.com>

Thanks, at long last I applied both the mthca and mlx4 versions of these
patches (with some cleanups).

 - R.


From FENKES at de.ibm.com  Thu Aug 27 02:44:30 2009
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Thu, 27 Aug 2009 11:44:30 +0200
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
References: <200908261337.56128.fenkes@de.ibm.com>
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
Message-ID: <OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>


Hal Rosenstock <hal.rosenstock at gmail.com> wrote on 26.08.2009 17:15:03:

> Thanks for doing this. It looks sane to me. The only issue I recall that 

> appears to be remaining is a better setting of 
ClassPortInfo:RespTimeValue 
> rather than hardcoding. Perhaps using the value from PortInfo is the way 
to go
> (ideally it would be that value from the port to which the the requester 
is 
> being redirected to but that might not be so easy to get from this port.

I don't think that effort will be necessary or even legal. The requestor 
will react to the redirection with another Get(ClassPortInfo) to the 
redirection target, which will reply with its own RespTimeValue, so our 
driver should speak for itself. Since we don't know when our MAD 
processing and sending of the response is going to be scheduled (we're not 
running on real-time constraints here), we play it safe and return 18, 
which amounts to roughly a second. 

Make sense?

Regards
  Joachim
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev at lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


From jenos at ncsa.uiuc.edu  Thu Aug 27 15:54:00 2009
From: jenos at ncsa.uiuc.edu (Jeremy Enos)
Date: Thu, 27 Aug 2009 17:54:00 -0500
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A948262.7030508@ncsa.uiuc.edu>
References: <4A8E4854.2060909@ncsa.uiuc.edu>
	<4A90FAD8.6000701@mellanox.co.il>	<4A92A0C6.9030501@ncsa.uiuc.edu>
	<4A948262.7030508@ncsa.uiuc.edu>
Message-ID: <4A970E88.2020505@ncsa.uiuc.edu>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090827/665f18aa/attachment.html>

From FENKES at de.ibm.com  Thu Aug 27 02:44:30 2009
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Thu, 27 Aug 2009 11:44:30 +0200
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
References: <200908261337.56128.fenkes@de.ibm.com>
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
Message-ID: <OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>


Hal Rosenstock <hal.rosenstock at gmail.com> wrote on 26.08.2009 17:15:03:

> Thanks for doing this. It looks sane to me. The only issue I recall that 

> appears to be remaining is a better setting of 
ClassPortInfo:RespTimeValue 
> rather than hardcoding. Perhaps using the value from PortInfo is the way 
to go
> (ideally it would be that value from the port to which the the requester 
is 
> being redirected to but that might not be so easy to get from this port.

I don't think that effort will be necessary or even legal. The requestor 
will react to the redirection with another Get(ClassPortInfo) to the 
redirection target, which will reply with its own RespTimeValue, so our 
driver should speak for itself. Since we don't know when our MAD 
processing and sending of the response is going to be scheduled (we're not 
running on real-time constraints here), we play it safe and return 18, 
which amounts to roughly a second. 

Make sense?

Regards
  Joachim
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev at lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


From klakshman03 at hotmail.com  Fri Aug 28 00:55:54 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Fri, 28 Aug 2009 13:25:54 +0530
Subject: [ofa-general] QDR IB cards supports card back to back connectivity
Message-ID: <COL123-W51707D7C7D382D953B48EAB8F50@phx.gbl>


 Dear All,

 I would like know the QDR Infinibad cards will support to back to back connectivity  or not ie with out IB swicth to enable the IB communication between the two machines .


Regards
laxman


_________________________________________________________________
We all see it as it is. But on MSN India, the difference lies in perspective.
http://in.msn.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090828/1465e262/attachment.html>

From sashak at voltaire.com  Fri Aug 28 01:07:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 28 Aug 2009 11:07:56 +0300
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables
	setup flow
In-Reply-To: <20090825190141.GG28379@me>
References: <20090807110811.GA23431@comcast.net>
 <20090825190141.GG28379@me>
Message-ID: <20090828080756.GH28379@me>


Simplify (and unify) forwarding tables setup decision flow.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_mgr.c |    7 +------
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 629f628..8ba78f8 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
 		}
 	}
 
-	set_fwd_tbl_top(p_mgr, p_sw);
-
 	if (p_mgr->p_subn->opt.lmc)
 		free_ports_priv(p_mgr);
 
@@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
 	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl,
 			   p_mgr);
 
-	ucast_mgr_pipeline_fwd_tbl(p_mgr);
-
 	cl_qlist_remove_all(&p_mgr->port_order_list);
 
 	return 0;
@@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm)
 
 	osm->routing_engine_used = osm_routing_engine_type(r->name);
 
-	if (r->ucast_build_fwd_tables)
-		osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
+	osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
 
 	return 0;
 }
-- 
1.6.4


From sashak at voltaire.com  Fri Aug 28 01:10:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 28 Aug 2009 11:10:02 +0300
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr: better lft setup
In-Reply-To: <20090828080756.GH28379@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
Message-ID: <20090828081002.GI28379@me>


The function set_next_lft_block() is called in loop with block number
incremented, inside it loops by itself in looking for changed block,
caller will call this function with original block number incremented
so this internal loop could be repeated again and again. This patch
cleans this ineffectiveness.

Also rename it to set_lft_block() since block number is treated as
parameters and *not* next block is processed and merges some code.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_ucast_mgr.h |    1 +
 opensm/opensm/osm_ucast_mgr.c         |  126 +++++++++++----------------------
 2 files changed, 43 insertions(+), 84 deletions(-)

diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
index 4ef045c..78a88f0 100644
--- a/opensm/include/opensm/osm_ucast_mgr.h
+++ b/opensm/include/opensm/osm_ucast_mgr.h
@@ -95,6 +95,7 @@ typedef struct osm_ucast_mgr {
 	osm_subn_t *p_subn;
 	osm_log_t *p_log;
 	cl_plock_t *p_lock;
+	uint16_t max_lid;
 	cl_qlist_t port_order_list;
 	boolean_t is_dor;
 	boolean_t some_hop_count_set;
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 8ba78f8..a111c10 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -336,6 +336,9 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw)
 
 	CL_ASSERT(p_node);
 
+	if (p_mgr->max_lid < p_sw->max_lid_ho)
+		p_mgr->max_lid = p_sw->max_lid_ho;
+
 	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0));
 
 	/*
@@ -478,65 +481,13 @@ static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item,
 	set_fwd_tbl_top(p_mgr, p_sw);
 }
 
-static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm,
-				    IN uint8_t * p_block,
-				    IN osm_dr_path_t * p_path,
-				    IN uint16_t block_id_ho,
-				    IN osm_madw_context_t * p_context)
-{
-	ib_api_status_t status;
-	boolean_t sts;
-
-	OSM_LOG_ENTER(p_sm->p_log);
-
-	for (;
-	     (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block));
-	     block_id_ho++) {
-		if (!p_sw->need_update && !p_sm->p_subn->need_update &&
-		    !memcmp(p_block,
-			    p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
-			    IB_SMP_DATA_SIZE))
-			continue;
-
-		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
-			"Writing FT block %u to switch 0x%" PRIx64 "\n",
-			block_id_ho,
-			cl_ntoh64(p_context->lft_context.node_guid));
-
-		status = osm_req_set(p_sm, p_path,
-				     p_sw->new_lft +
-				     block_id_ho * IB_SMP_DATA_SIZE,
-				     IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
-				     cl_hton32(block_id_ho),
-				     CL_DISP_MSGID_NONE, p_context);
-
-		if (status != IB_SUCCESS)
-			OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: "
-				"Sending linear fwd. tbl. block failed (%s)\n",
-				ib_get_err_str(status));
-		break;
-	}
-
-	OSM_LOG_EXIT(p_sm->p_log);
-	return sts;
-}
-
-static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw,
-					 IN osm_ucast_mgr_t *p_mgr,
-					 IN uint16_t block_id_ho)
+static int set_lft_block(IN osm_switch_t *p_sw, IN osm_ucast_mgr_t *p_mgr,
+			 IN uint16_t block_id_ho)
 {
-	osm_dr_path_t *p_path;
-	osm_madw_context_t context;
 	uint8_t block[IB_SMP_DATA_SIZE];
-	boolean_t status;
-
-	OSM_LOG_ENTER(p_mgr->p_log);
-
-	CL_ASSERT(p_sw && p_sw->p_node);
-
-	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
-		"Processing switch 0x%" PRIx64 "\n",
-		cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
+	osm_madw_context_t context;
+	osm_dr_path_t *p_path;
+	ib_api_status_t status;
 
 	/*
 	   Send linear forwarding table blocks to the switch
@@ -547,8 +498,7 @@ static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw,
 		/* any routing should provide the new_lft */
 		CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
 			  p_mgr->cache_valid && !p_sw->need_update);
-		status = FALSE;
-		goto Exit;
+		return -1;
 	}
 
 	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
@@ -556,12 +506,29 @@ static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw,
 	context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node);
 	context.lft_context.set_method = TRUE;
 
-	status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path,
-				    block_id_ho, &context);
+	if (!osm_switch_get_lft_block(p_sw, block_id_ho, block) ||
+	    (!p_sw->need_update && !p_mgr->p_subn->need_update &&
+	     !memcmp(block, p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
+		     IB_SMP_DATA_SIZE)))
+		return 0;
 
-Exit:
-	OSM_LOG_EXIT(p_mgr->p_log);
-	return status;
+	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
+		"Writing FT block %u to switch 0x%" PRIx64 "\n", block_id_ho,
+		cl_ntoh64(context.lft_context.node_guid));
+
+	status = osm_req_set(p_mgr->sm, p_path,
+			     p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
+			     IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
+			     cl_hton32(block_id_ho),
+			     CL_DISP_MSGID_NONE, &context);
+	if (status != IB_SUCCESS) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
+			"Sending linear fwd. tbl. block failed (%s)\n",
+			ib_get_err_str(status));
+		return -1;
+	}
+
+	return 0;
 }
 
 /**********************************************************************
@@ -919,26 +886,15 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m)
 
 static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr)
 {
-	cl_qmap_t *p_sw_tbl;
-	osm_switch_t *p_sw;
-	uint16_t block_id_ho = 0;
-	int sws_notdone;
-	boolean_t sts;
-
-	p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
-	while (1) {
-		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
-		sws_notdone = 0;
-		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
-			sts = pipeline_next_lft_block(p_sw, p_mgr, block_id_ho);
-			if (sts)
-				sws_notdone++;
-			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
-		}
-		if (!sws_notdone)
-			break;
-		block_id_ho++;
-	}
+	cl_qmap_t *tbl;
+	cl_map_item_t *item;
+	unsigned i, max_block = p_mgr->max_lid / 64 + 1;
+
+	tbl = &p_mgr->p_subn->sw_guid_tbl;
+	for (i = 0; i < max_block; i++)
+		for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl);
+		     item = cl_qmap_next(item))
+			set_lft_block((osm_switch_t *)item, p_mgr, i);
 }
 
 static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
@@ -984,6 +940,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
  **********************************************************************/
 void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr)
 {
+	p_mgr->max_lid = 0;
+
 	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
 			   ucast_mgr_process_top, p_mgr);
 
-- 
1.6.4


From Lars.Paul.Huse at Sun.COM  Fri Aug 28 02:47:44 2009
From: Lars.Paul.Huse at Sun.COM (Lars Paul Huse)
Date: Fri, 28 Aug 2009 11:47:44 +0200
Subject: [ofa-general] [PATCH] ibdm/ibnl/* ibnl definition files for Sun IB
	QDR products
Message-ID: <4A97A7C0.8010808@Sun.COM>

ibnl definition files for Sun IB QDR products:
- 48 port QNEM
- 36 port Switch
- 72 port Switch
- 648 port Switch

Signed-off-by: Lars Paul Huse <Lars.Paul.Huse at sun.com>

---
diff --git a/ibdm/ibnl/SUNBQNEM48.ibnl b/ibdm/ibnl/SUNBQNEM48.ibnl
new file mode 100644
index 0000000..e722bfa
--- /dev/null
+++ b/ibdm/ibnl/SUNBQNEM48.ibnl
@@ -0,0 +1,117 @@
+SYSTEM LEAF,LEAF:4x,LEAF:4X
+
+NODE SW 36 MT48436 U1
+1  -10G-> P1
+2  -10G-> P2
+3  -10G-> P3
+4  -10G-> P4
+5  -10G-> P5
+6  -10G-> P6
+7  -10G-> P7
+8  -10G-> P8
+9  -10G-> P9
+10 -10G-> P10
+11 -10G-> P11
+12 -10G-> P12
+13 -10G-> P13
+14 -10G-> P14
+15 -10G-> P15
+16 -10G-> P16
+17 -10G-> P17
+18 -10G-> P18
+19 -10G-> P19
+20 -10G-> P20
+21 -10G-> P21
+22 -10G-> P22
+23 -10G-> P23
+24 -10G-> P24
+25 -10G-> P25
+26 -10G-> P26
+27 -10G-> P27
+28 -10G-> P28
+29 -10G-> P29
+30 -10G-> P30
+31 -10G-> P31
+32 -10G-> P32
+33 -10G-> P33
+34 -10G-> P34
+35 -10G-> P35
+36 -10G-> P36
+
+TOPSYSTEM SUNBQNEM48,SUN-QNEM
+
+SUBSYSTEM LEAF SW-A
+   P1 -10G-> C-A0
+   P2 -10G-> C-A1
+   P3 -10G-> C-A2
+   P4 -10G-> C-A3
+   P5 -10G-> C-A4
+   P6 -10G-> C-A5
+   P7 -10G-> C-A6
+   P8 -10G-> C-A7
+   P9 -10G-> C-A8
+   P10 -10G-> C-A9
+   P11 -10G-> C-A10
+   P12 -10G-> C-A11
+   P13 -10G-> C-A12
+   P14 -10G-> C-A13
+   P15 -10G-> C-A14
+   P16 -10G-> P1
+   P17 -10G-> P2
+   P18 -10G-> P3
+   P19 -10G-> P4
+   P20 -10G-> P5
+   P21 -10G-> P6
+   P22 -10G-> P7
+   P23 -10G-> P8
+   P24 -10G-> P9
+   P25 -10G-> P10
+   P26 -10G-> P11
+   P27 -10G-> P12
+   P28 -10G-> SW-B P28
+   P29 -10G-> SW-B P29
+   P30 -10G-> SW-B P30
+   P31 -10G-> SW-B P31
+   P32 -10G-> SW-B P32
+   P33 -10G-> SW-B P33
+   P34 -10G-> SW-B P34
+   P35 -10G-> SW-B P35
+   P36 -10G-> SW-B P36
+
+SUBSYSTEM LEAF SW-B
+   P1 -10G-> C-B0
+   P2 -10G-> C-B1
+   P3 -10G-> C-B2
+   P4 -10G-> C-B3
+   P5 -10G-> C-B4
+   P6 -10G-> C-B5
+   P7 -10G-> C-B6
+   P8 -10G-> C-B7
+   P9 -10G-> C-B8
+   P10 -10G-> C-B9
+   P11 -10G-> C-B10
+   P12 -10G-> C-B11
+   P13 -10G-> C-B12
+   P14 -10G-> C-B13
+   P15 -10G-> C-B14
+   P16 -10G-> P13
+   P17 -10G-> P14
+   P18 -10G-> P15
+   P19 -10G-> P16
+   P20 -10G-> P17
+   P21 -10G-> P18
+   P22 -10G-> P19
+   P23 -10G-> P20
+   P24 -10G-> P21
+   P25 -10G-> P22
+   P26 -10G-> P23
+   P27 -10G-> P24
+   P28 -10G-> SW-A P28
+   P29 -10G-> SW-A P29
+   P30 -10G-> SW-A P30
+   P31 -10G-> SW-A P31
+   P32 -10G-> SW-A P32
+   P33 -10G-> SW-A P33
+   P34 -10G-> SW-A P34
+   P35 -10G-> SW-A P35
+   P36 -10G-> SW-A P36
diff --git a/ibdm/ibnl/SUNDCS36QDR.ibnl b/ibdm/ibnl/SUNDCS36QDR.ibnl
new file mode 100644
index 0000000..aa33d53
--- /dev/null
+++ b/ibdm/ibnl/SUNDCS36QDR.ibnl
@@ -0,0 +1,42 @@
+
+TOPSYSTEM SUNDCS36QDR,NM2-36P
+
+U1=isChaBma
+
+NODE SW 36 SUNDCS36QDR U1
+    1  -> C-17A
+    2  -> C-17B
+    3  -> C-16A
+    4  -> C-16B
+    5  -> C-15A
+    6  -> C-15B
+    7  -> C-14A
+    8  -> C-14B
+    9  -> C-13A
+    10 -> C-13B
+    11 -> C-12A
+    12 -> C-12B
+    13 -> C-9B
+    14 -> C-9A
+    15 -> C-10B
+    16 -> C-10A
+    17 -> C-11B
+    18 -> C-11A
+    19 -> C-0B
+    20 -> C-0A
+    21 -> C-1B
+    22 -> C-1A
+    23 -> C-2B
+    24 -> C-2A
+    25 -> C-3B
+    26 -> C-3A
+    27 -> C-4B
+    28 -> C-4A
+    29 -> C-5B
+    30 -> C-5A
+    31 -> C-8A
+    32 -> C-8B
+    33 -> C-7A
+    34 -> C-7B
+    35 -> C-6A
+    36 -> C-6B
diff --git a/ibdm/ibnl/SUNDCS648QDR.ibnl b/ibdm/ibnl/SUNDCS648QDR.ibnl
new file mode 100644
index 0000000..a8b6558
--- /dev/null
+++ b/ibdm/ibnl/SUNDCS648QDR.ibnl
@@ -0,0 +1,2133 @@
+SYSTEM LEAF,LEAF:4x,LEAF:4X
+
+NODE SW 36 MT48436 U1
+1  -10G-> P1
+2  -10G-> P2
+3  -10G-> P3
+4  -10G-> P4
+5  -10G-> P5
+6  -10G-> P6
+7  -10G-> P7
+8  -10G-> P8
+9  -10G-> P9
+10 -10G-> P10
+11 -10G-> P11
+12 -10G-> P12
+13 -10G-> P13
+14 -10G-> P14
+15 -10G-> P15
+16 -10G-> P16
+17 -10G-> P17
+18 -10G-> P18
+19 -10G-> P19
+20 -10G-> P20
+21 -10G-> P21
+22 -10G-> P22
+23 -10G-> P23
+24 -10G-> P24
+25 -10G-> P25
+26 -10G-> P26
+27 -10G-> P27
+28 -10G-> P28
+29 -10G-> P29
+30 -10G-> P30
+31 -10G-> P31
+32 -10G-> P32
+33 -10G-> P33
+34 -10G-> P34
+35 -10G-> P35
+36 -10G-> P36
+
+SYSTEM SPINE,SPINE:4x,SPINE:4X
+
+NODE SW 36 MT48436 U1
+1  -10G-> P1
+2  -10G-> P2
+3  -10G-> P3
+4  -10G-> P4
+5  -10G-> P5
+6  -10G-> P6
+7  -10G-> P7
+8  -10G-> P8
+9  -10G-> P9
+10 -10G-> P10
+11 -10G-> P11
+12 -10G-> P12
+13 -10G-> P13
+14 -10G-> P14
+15 -10G-> P15
+16 -10G-> P16
+17 -10G-> P17
+18 -10G-> P18
+19 -10G-> P19
+20 -10G-> P20
+21 -10G-> P21
+22 -10G-> P22
+23 -10G-> P23
+24 -10G-> P24
+25 -10G-> P25
+26 -10G-> P26
+27 -10G-> P27
+28 -10G-> P28
+29 -10G-> P29
+30 -10G-> P30
+31 -10G-> P31
+32 -10G-> P32
+33 -10G-> P33
+34 -10G-> P34
+35 -10G-> P35
+36 -10G-> P36
+
+TOPSYSTEM SUNDCS648QDR,SUN-M9-648
+
+SUBSYSTEM SPINE fc1A
+   P1 -10G-> lc1A P13
+   P2 -10G-> lc1B P14
+   P3 -10G-> lc1C P13
+   P4 -10G-> lc1D P14
+   P5 -10G-> lc9A P13
+   P6 -10G-> lc9C P13
+   P7 -10G-> lc9B P14
+   P8 -10G-> lc8A P13
+   P9 -10G-> lc9D P14
+   P10 -10G-> lc8C P13
+   P11 -10G-> lc8B P140
+   P12 -10G-> lc7A P13
+   P13 -10G-> lc6B P14
+   P14 -10G-> lc6A P13
+   P15 -10G-> lc7D P14
+   P16 -10G-> lc7C P13
+   P17 -10G-> lc7B P14
+   P18 -10G-> lc8D P14
+   P19 -10G-> lc2D P14
+   P20 -10G-> lc2C P13
+   P21 -10G-> lc2B P14
+   P22 -10G-> lc2A P13
+   P23 -10G-> lc3D P14
+   P24 -10G-> lc3B P14
+   P25 -10G-> lc3C P13
+   P26 -10G-> lc4D P14
+   P27 -10G-> lc3A P13
+   P28 -10G-> lc4B P14
+   P29 -10G-> lc4C P13
+   P30 -10G-> lc5D P14
+   P31 -10G-> lc6C P13
+   P32 -10G-> lc6D P14
+   P33 -10G-> lc5A P13
+   P34 -10G-> lc5B P14
+   P35 -10G-> lc5C P13
+   P36 -10G-> lc4A P13
+
+SUBSYSTEM SPINE fc1B
+   P1 -10G-> lc8D P13
+   P2 -10G-> lc8A P14
+   P3 -10G-> lc8B P13
+   P4 -10G-> lc8C P14
+   P5 -10G-> lc7D P13
+   P6 -10G-> lc7B P13
+   P7 -10G-> lc7A P14
+   P8 -10G-> lc6D P13
+   P9 -10G-> lc7C P14
+   P10 -10G-> lc6B P13
+   P11 -10G-> lc6A P14
+   P12 -10G-> lc5D P13
+   P13 -10G-> lc4A P14
+   P14 -10G-> lc4D P13
+   P15 -10G-> lc5C P14
+   P16 -10G-> lc5B P13
+   P17 -10G-> lc5A P14
+   P18 -10G-> lc6C P14
+   P19 -10G-> lc9C P14
+   P20 -10G-> lc9B P13
+   P21 -10G-> lc9A P14
+   P22 -10G-> lc9D P13
+   P23 -10G-> lc1C P14
+   P24 -10G-> lc1A P14
+   P25 -10G-> lc1B P13
+   P26 -10G-> lc2C P14
+   P27 -10G-> lc1D P13
+   P28 -10G-> lc2A P14
+   P29 -10G-> lc2B P13
+   P30 -10G-> lc3C P14
+   P31 -10G-> lc4B P13
+   P32 -10G-> lc4C P14
+   P33 -10G-> lc3D P13
+   P34 -10G-> lc3A P14
+   P35 -10G-> lc3B P13
+   P36 -10G-> lc2D P13
+
+SUBSYSTEM SPINE fc2A
+   P1 -10G-> lc1A P15
+   P2 -10G-> lc1B P16
+   P3 -10G-> lc1C P15
+   P4 -10G-> lc1D P16
+   P5 -10G-> lc9A P15
+   P6 -10G-> lc9C P15
+   P7 -10G-> lc9B P16
+   P8 -10G-> lc8A P15
+   P9 -10G-> lc9D P16
+   P10 -10G-> lc8C P15
+   P11 -10G-> lc8B P16
+   P12 -10G-> lc7A P15
+   P13 -10G-> lc6B P16
+   P14 -10G-> lc6A P15
+   P15 -10G-> lc7D P16
+   P16 -10G-> lc7C P15
+   P17 -10G-> lc7B P16
+   P18 -10G-> lc8D P16
+   P19 -10G-> lc2D P16
+   P20 -10G-> lc2C P15
+   P21 -10G-> lc2B P16
+   P22 -10G-> lc2A P15
+   P23 -10G-> lc3D P16
+   P24 -10G-> lc3B P16
+   P25 -10G-> lc3C P15
+   P26 -10G-> lc4D P16
+   P27 -10G-> lc3A P15
+   P28 -10G-> lc4B P16
+   P29 -10G-> lc4C P15
+   P30 -10G-> lc5D P16
+   P31 -10G-> lc6C P15
+   P32 -10G-> lc6D P16
+   P33 -10G-> lc5A P15
+   P34 -10G-> lc5B P16
+   P35 -10G-> lc5C P15
+   P36 -10G-> lc4A P15
+
+SUBSYSTEM SPINE fc2B
+   P1 -10G-> lc8D P15
+   P2 -10G-> lc8A P16
+   P3 -10G-> lc8B P15
+   P4 -10G-> lc8C P16
+   P5 -10G-> lc7D P15
+   P6 -10G-> lc7B P15
+   P7 -10G-> lc7A P16
+   P8 -10G-> lc6D P15
+   P9 -10G-> lc7C P16
+   P10 -10G-> lc6B P15
+   P11 -10G-> lc6A P16
+   P12 -10G-> lc5D P15
+   P13 -10G-> lc4A P16
+   P14 -10G-> lc4D P15
+   P15 -10G-> lc5C P16
+   P16 -10G-> lc5B P15
+   P17 -10G-> lc5A P16
+   P18 -10G-> lc6C P16
+   P19 -10G-> lc9C P16
+   P20 -10G-> lc9B P15
+   P21 -10G-> lc9A P16
+   P22 -10G-> lc9D P15
+   P23 -10G-> lc1C P16
+   P24 -10G-> lc1A P16
+   P25 -10G-> lc1B P15
+   P26 -10G-> lc2C P16
+   P27 -10G-> lc1D P15
+   P28 -10G-> lc2A P16
+   P29 -10G-> lc2B P15
+   P30 -10G-> lc3C P16
+   P31 -10G-> lc4B P15
+   P32 -10G-> lc4C P16
+   P33 -10G-> lc3D P15
+   P34 -10G-> lc3A P16
+   P35 -10G-> lc3B P15
+   P36 -10G-> lc2D P15
+
+SUBSYSTEM SPINE fc3A
+   P1 -10G-> lc1A P17
+   P2 -10G-> lc1B P18
+   P3 -10G-> lc1C P17
+   P4 -10G-> lc1D P18
+   P5 -10G-> lc9A P17
+   P6 -10G-> lc9C P17
+   P7 -10G-> lc9B P18
+   P8 -10G-> lc8A P17
+   P9 -10G-> lc9D P18
+   P10 -10G-> lc8C P17
+   P11 -10G-> lc8B P18
+   P12 -10G-> lc7A P17
+   P13 -10G-> lc6B P18
+   P14 -10G-> lc6A P17
+   P15 -10G-> lc7D P18
+   P16 -10G-> lc7C P17
+   P17 -10G-> lc7B P18
+   P18 -10G-> lc8D P18
+   P19 -10G-> lc2D P18
+   P20 -10G-> lc2C P17
+   P21 -10G-> lc2B P18
+   P22 -10G-> lc2A P17
+   P23 -10G-> lc3D P18
+   P24 -10G-> lc3B P18
+   P25 -10G-> lc3C P17
+   P26 -10G-> lc4D P18
+   P27 -10G-> lc3A P17
+   P28 -10G-> lc4B P18
+   P29 -10G-> lc4C P17
+   P30 -10G-> lc5D P18
+   P31 -10G-> lc6C P17
+   P32 -10G-> lc6D P18
+   P33 -10G-> lc5A P17
+   P34 -10G-> lc5B P18
+   P35 -10G-> lc5C P17
+   P36 -10G-> lc4A P17
+
+SUBSYSTEM SPINE fc3B
+   P1 -10G-> lc8D P17
+   P2 -10G-> lc8A P18
+   P3 -10G-> lc8B P17
+   P4 -10G-> lc8C P18
+   P5 -10G-> lc7D P17
+   P6 -10G-> lc7B P17
+   P7 -10G-> lc7A P18
+   P8 -10G-> lc6D P17
+   P9 -10G-> lc7C P18
+   P10 -10G-> lc6B P17
+   P11 -10G-> lc6A P18
+   P12 -10G-> lc5D P17
+   P13 -10G-> lc4A P18
+   P14 -10G-> lc4D P17
+   P15 -10G-> lc5C P18
+   P16 -10G-> lc5B P17
+   P17 -10G-> lc5A P18
+   P18 -10G-> lc6C P18
+   P19 -10G-> lc9C P18
+   P20 -10G-> lc9B P17
+   P21 -10G-> lc9A P18
+   P22 -10G-> lc9D P17
+   P23 -10G-> lc1C P18
+   P24 -10G-> lc1A P18
+   P25 -10G-> lc1B P17
+   P26 -10G-> lc2C P18
+   P27 -10G-> lc1D P17
+   P28 -10G-> lc2A P18
+   P29 -10G-> lc2B P17
+   P30 -10G-> lc3C P18
+   P31 -10G-> lc4B P17
+   P32 -10G-> lc4C P18
+   P33 -10G-> lc3D P17
+   P34 -10G-> lc3A P18
+   P35 -10G-> lc3B P17
+   P36 -10G-> lc2D P17
+
+SUBSYSTEM SPINE fc4A
+   P1 -10G-> lc1A P12
+   P2 -10G-> lc1B P11
+   P3 -10G-> lc1C P12
+   P4 -10G-> lc1D P11
+   P5 -10G-> lc9A P12
+   P6 -10G-> lc9C P12
+   P7 -10G-> lc9B P11
+   P8 -10G-> lc8A P12
+   P9 -10G-> lc9D P11
+   P10 -10G-> lc8C P12
+   P11 -10G-> lc8B P11
+   P12 -10G-> lc7A P12
+   P13 -10G-> lc6B P11
+   P14 -10G-> lc6A P12
+   P15 -10G-> lc7D P11
+   P16 -10G-> lc7C P12
+   P17 -10G-> lc7B P11
+   P18 -10G-> lc8D P11
+   P19 -10G-> lc2D P11
+   P20 -10G-> lc2C P12
+   P21 -10G-> lc2B P11
+   P22 -10G-> lc2A P12
+   P23 -10G-> lc3D P11
+   P24 -10G-> lc3B P11
+   P25 -10G-> lc3C P12
+   P26 -10G-> lc4D P11
+   P27 -10G-> lc3A P12
+   P28 -10G-> lc4B P11
+   P29 -10G-> lc4C P12
+   P30 -10G-> lc5D P11
+   P31 -10G-> lc6C P12
+   P32 -10G-> lc6D P11
+   P33 -10G-> lc5A P12
+   P34 -10G-> lc5B P11
+   P35 -10G-> lc5C P12
+   P36 -10G-> lc4A P12
+
+SUBSYSTEM SPINE fc4B
+   P1 -10G-> lc8D P12
+   P2 -10G-> lc8A P11
+   P3 -10G-> lc8B P12
+   P4 -10G-> lc8C P11
+   P5 -10G-> lc7D P12
+   P6 -10G-> lc7B P12
+   P7 -10G-> lc7A P11
+   P8 -10G-> lc6D P12
+   P9 -10G-> lc7C P11
+   P10 -10G-> lc6B P12
+   P11 -10G-> lc6A P11
+   P12 -10G-> lc5D P12
+   P13 -10G-> lc4A P11
+   P14 -10G-> lc4D P12
+   P15 -10G-> lc5C P11
+   P16 -10G-> lc5B P12
+   P17 -10G-> lc5A P11
+   P18 -10G-> lc6C P11
+   P19 -10G-> lc9C P11
+   P20 -10G-> lc9B P12
+   P21 -10G-> lc9A P11
+   P22 -10G-> lc9D P12
+   P23 -10G-> lc1C P11
+   P24 -10G-> lc1A P11
+   P25 -10G-> lc1B P12
+   P26 -10G-> lc2C P11
+   P27 -10G-> lc1D P12
+   P28 -10G-> lc2A P11
+   P29 -10G-> lc2B P12
+   P30 -10G-> lc3C P11
+   P31 -10G-> lc4B P12
+   P32 -10G-> lc4C P11
+   P33 -10G-> lc3D P12
+   P34 -10G-> lc3A P11
+   P35 -10G-> lc3B P12
+   P36 -10G-> lc2D P12
+
+SUBSYSTEM SPINE fc5A
+   P1 -10G-> lc1A P10
+   P2 -10G-> lc1B P9
+   P3 -10G-> lc1C P10
+   P4 -10G-> lc1D P9
+   P5 -10G-> lc9A P10
+   P6 -10G-> lc9C P10
+   P7 -10G-> lc9B P9
+   P8 -10G-> lc8A P10
+   P9 -10G-> lc9D P9
+   P10 -10G-> lc8C P10
+   P11 -10G-> lc8B P9
+   P12 -10G-> lc7A P10
+   P13 -10G-> lc6B P9
+   P14 -10G-> lc6A P10
+   P15 -10G-> lc7D P9
+   P16 -10G-> lc7C P10
+   P17 -10G-> lc7B P9
+   P18 -10G-> lc8D P9
+   P19 -10G-> lc2D P9
+   P20 -10G-> lc2C P10
+   P21 -10G-> lc2B P9
+   P22 -10G-> lc2A P10
+   P23 -10G-> lc3D P9
+   P24 -10G-> lc3B P9
+   P25 -10G-> lc3C P10
+   P26 -10G-> lc4D P9
+   P27 -10G-> lc3A P10
+   P28 -10G-> lc4B P9
+   P29 -10G-> lc4C P10
+   P30 -10G-> lc5D P9
+   P31 -10G-> lc6C P10
+   P32 -10G-> lc6D P9
+   P33 -10G-> lc5A P10
+   P34 -10G-> lc5B P9
+   P35 -10G-> lc5C P10
+   P36 -10G-> lc4A P10
+
+SUBSYSTEM SPINE fc5B
+   P1 -10G-> lc8D P10
+   P2 -10G-> lc8A P9
+   P3 -10G-> lc8B P10
+   P4 -10G-> lc8C P9
+   P5 -10G-> lc7D P10
+   P6 -10G-> lc7B P10
+   P7 -10G-> lc7A P9
+   P8 -10G-> lc6D P10
+   P9 -10G-> lc7C P9
+   P10 -10G-> lc6B P10
+   P11 -10G-> lc6A P9
+   P12 -10G-> lc5D P10
+   P13 -10G-> lc4A P9
+   P14 -10G-> lc4D P10
+   P15 -10G-> lc5C P9
+   P16 -10G-> lc5B P10
+   P17 -10G-> lc5A P9
+   P18 -10G-> lc6C P9
+   P19 -10G-> lc9C P9
+   P20 -10G-> lc9B P10
+   P21 -10G-> lc9A P9
+   P22 -10G-> lc9D P10
+   P23 -10G-> lc1C P9
+   P24 -10G-> lc1A P9
+   P25 -10G-> lc1B P10
+   P26 -10G-> lc2C P9
+   P27 -10G-> lc1D P10
+   P28 -10G-> lc2A P9
+   P29 -10G-> lc2B P10
+   P30 -10G-> lc3C P9
+   P31 -10G-> lc4B P10
+   P32 -10G-> lc4C P9
+   P33 -10G-> lc3D P10
+   P34 -10G-> lc3A P9
+   P35 -10G-> lc3B P10
+   P36 -10G-> lc2D P10
+
+SUBSYSTEM SPINE fc6A
+   P1 -10G-> lc1A P8
+   P2 -10G-> lc1B P7
+   P3 -10G-> lc1C P8
+   P4 -10G-> lc1D P7
+   P5 -10G-> lc9A P8
+   P6 -10G-> lc9C P8
+   P7 -10G-> lc9B P7
+   P8 -10G-> lc8A P8
+   P9 -10G-> lc9D P7
+   P10 -10G-> lc8C P8
+   P11 -10G-> lc8B P7
+   P12 -10G-> lc7A P8
+   P13 -10G-> lc6B P7
+   P14 -10G-> lc6A P8
+   P15 -10G-> lc7D P7
+   P16 -10G-> lc7C P8
+   P17 -10G-> lc7B P7
+   P18 -10G-> lc8D P7
+   P19 -10G-> lc2D P7
+   P20 -10G-> lc2C P8
+   P21 -10G-> lc2B P7
+   P22 -10G-> lc2A P8
+   P23 -10G-> lc3D P7
+   P24 -10G-> lc3B P7
+   P25 -10G-> lc3C P8
+   P26 -10G-> lc4D P7
+   P27 -10G-> lc3A P8
+   P28 -10G-> lc4B P7
+   P29 -10G-> lc4C P8
+   P30 -10G-> lc5D P7
+   P31 -10G-> lc6C P8
+   P32 -10G-> lc6D P7
+   P33 -10G-> lc5A P8
+   P34 -10G-> lc5B P7
+   P35 -10G-> lc5C P8
+   P36 -10G-> lc4A P8
+
+SUBSYSTEM SPINE fc6B
+   P1 -10G-> lc8D P8
+   P2 -10G-> lc8A P7
+   P3 -10G-> lc8B P8
+   P4 -10G-> lc8C P7
+   P5 -10G-> lc7D P8
+   P6 -10G-> lc7B P8
+   P7 -10G-> lc7A P7
+   P8 -10G-> lc6D P8
+   P9 -10G-> lc7C P7
+   P10 -10G-> lc6B P8
+   P11 -10G-> lc6A P7
+   P12 -10G-> lc5D P8
+   P13 -10G-> lc4A P7
+   P14 -10G-> lc4D P8
+   P15 -10G-> lc5C P7
+   P16 -10G-> lc5B P8
+   P17 -10G-> lc5A P7
+   P18 -10G-> lc6C P7
+   P19 -10G-> lc9C P7
+   P20 -10G-> lc9B P8
+   P21 -10G-> lc9A P7
+   P22 -10G-> lc9D P8
+   P23 -10G-> lc1C P7
+   P24 -10G-> lc1A P7
+   P25 -10G-> lc1B P8
+   P26 -10G-> lc2C P7
+   P27 -10G-> lc1D P8
+   P28 -10G-> lc2A P7
+   P29 -10G-> lc2B P8
+   P30 -10G-> lc3C P7
+   P31 -10G-> lc4B P8
+   P32 -10G-> lc4C P7
+   P33 -10G-> lc3D P8
+   P34 -10G-> lc3A P7
+   P35 -10G-> lc3B P8
+   P36 -10G-> lc2D P8
+
+SUBSYSTEM SPINE fc7A
+   P1 -10G-> lc1A P6
+   P2 -10G-> lc1B P5
+   P3 -10G-> lc1C P6
+   P4 -10G-> lc1D P5
+   P5 -10G-> lc9A P6
+   P6 -10G-> lc9C P6
+   P7 -10G-> lc9B P5
+   P8 -10G-> lc8A P6
+   P9 -10G-> lc9D P5
+   P10 -10G-> lc8C P6
+   P11 -10G-> lc8B P5
+   P12 -10G-> lc7A P6
+   P13 -10G-> lc6B P5
+   P14 -10G-> lc6A P6
+   P15 -10G-> lc7D P5
+   P16 -10G-> lc7C P6
+   P17 -10G-> lc7B P5
+   P18 -10G-> lc8D P5
+   P19 -10G-> lc2D P5
+   P20 -10G-> lc2C P6
+   P21 -10G-> lc2B P5
+   P22 -10G-> lc2A P6
+   P23 -10G-> lc3D P5
+   P24 -10G-> lc3B P5
+   P25 -10G-> lc3C P6
+   P26 -10G-> lc4D P5
+   P27 -10G-> lc3A P6
+   P28 -10G-> lc4B P5
+   P29 -10G-> lc4C P6
+   P30 -10G-> lc5D P5
+   P31 -10G-> lc6C P6
+   P32 -10G-> lc6D P5
+   P33 -10G-> lc5A P6
+   P34 -10G-> lc5B P5
+   P35 -10G-> lc5C P6
+   P36 -10G-> lc4A P6
+
+SUBSYSTEM SPINE fc7B
+   P1 -10G-> lc8D P6
+   P2 -10G-> lc8A P5
+   P3 -10G-> lc8B P6
+   P4 -10G-> lc8C P5
+   P5 -10G-> lc7D P6
+   P6 -10G-> lc7B P6
+   P7 -10G-> lc7A P5
+   P8 -10G-> lc6D P6
+   P9 -10G-> lc7C P5
+   P10 -10G-> lc6B P6
+   P11 -10G-> lc6A P5
+   P12 -10G-> lc5D P6
+   P13 -10G-> lc4A P5
+   P14 -10G-> lc4D P6
+   P15 -10G-> lc5C P5
+   P16 -10G-> lc5B P6
+   P17 -10G-> lc5A P5
+   P18 -10G-> lc6C P5
+   P19 -10G-> lc9C P5
+   P20 -10G-> lc9B P6
+   P21 -10G-> lc9A P5
+   P22 -10G-> lc9D P6
+   P23 -10G-> lc1C P5
+   P24 -10G-> lc1A P5
+   P25 -10G-> lc1B P6
+   P26 -10G-> lc2C P5
+   P27 -10G-> lc1D P6
+   P28 -10G-> lc2A P5
+   P29 -10G-> lc2B P6
+   P30 -10G-> lc3C P5
+   P31 -10G-> lc4B P6
+   P32 -10G-> lc4C P5
+   P33 -10G-> lc3D P6
+   P34 -10G-> lc3A P5
+   P35 -10G-> lc3B P6
+   P36 -10G-> lc2D P6
+
+SUBSYSTEM SPINE fc8A
+   P1 -10G-> lc1A P4
+   P2 -10G-> lc1B P3
+   P3 -10G-> lc1C P4
+   P4 -10G-> lc1D P3
+   P5 -10G-> lc9A P4
+   P6 -10G-> lc9C P4
+   P7 -10G-> lc9B P3
+   P8 -10G-> lc8A P4
+   P9 -10G-> lc9D P3
+   P10 -10G-> lc8C P4
+   P11 -10G-> lc8B P3
+   P12 -10G-> lc7A P4
+   P13 -10G-> lc6B P3
+   P14 -10G-> lc6A P4
+   P15 -10G-> lc7D P3
+   P16 -10G-> lc7C P4
+   P17 -10G-> lc7B P3
+   P18 -10G-> lc8D P3
+   P19 -10G-> lc2D P3
+   P20 -10G-> lc2C P4
+   P21 -10G-> lc2B P3
+   P22 -10G-> lc2A P4
+   P23 -10G-> lc3D P3
+   P24 -10G-> lc3B P3
+   P25 -10G-> lc3C P4
+   P26 -10G-> lc4D P3
+   P27 -10G-> lc3A P4
+   P28 -10G-> lc4B P3
+   P29 -10G-> lc4C P4
+   P30 -10G-> lc5D P3
+   P31 -10G-> lc6C P4
+   P32 -10G-> lc6D P3
+   P33 -10G-> lc5A P4
+   P34 -10G-> lc5B P3
+   P35 -10G-> lc5C P4
+   P36 -10G-> lc4A P4
+
+SUBSYSTEM SPINE fc8B
+   P1 -10G-> lc8D P4
+   P2 -10G-> lc8A P3
+   P3 -10G-> lc8B P4
+   P4 -10G-> lc8C P3
+   P5 -10G-> lc7D P4
+   P6 -10G-> lc7B P4
+   P7 -10G-> lc7A P3
+   P8 -10G-> lc6D P4
+   P9 -10G-> lc7C P3
+   P10 -10G-> lc6B P4
+   P11 -10G-> lc6A P3
+   P12 -10G-> lc5D P4
+   P13 -10G-> lc4A P3
+   P14 -10G-> lc4D P4
+   P15 -10G-> lc5C P3
+   P16 -10G-> lc5B P4
+   P17 -10G-> lc5A P3
+   P18 -10G-> lc6C P3
+   P19 -10G-> lc9C P3
+   P20 -10G-> lc9B P4
+   P21 -10G-> lc9A P3
+   P22 -10G-> lc9D P4
+   P23 -10G-> lc1C P3
+   P24 -10G-> lc1A P3
+   P25 -10G-> lc1B P4
+   P26 -10G-> lc2C P3
+   P27 -10G-> lc1D P4
+   P28 -10G-> lc2A P3
+   P29 -10G-> lc2B P4
+   P30 -10G-> lc3C P3
+   P31 -10G-> lc4B P4
+   P32 -10G-> lc4C P3
+   P33 -10G-> lc3D P4
+   P34 -10G-> lc3A P3
+   P35 -10G-> lc3B P4
+   P36 -10G-> lc2D P4
+
+SUBSYSTEM SPINE fc9A
+   P1 -10G-> lc1A P2
+   P2 -10G-> lc1B P1
+   P3 -10G-> lc1C P2
+   P4 -10G-> lc1D P1
+   P5 -10G-> lc9A P2
+   P6 -10G-> lc9C P2
+   P7 -10G-> lc9B P1
+   P8 -10G-> lc8A P2
+   P9 -10G-> lc9D P1
+   P10 -10G-> lc8C P2
+   P11 -10G-> lc8B P1
+   P12 -10G-> lc7A P2
+   P13 -10G-> lc6B P1
+   P14 -10G-> lc6A P2
+   P15 -10G-> lc7D P1
+   P16 -10G-> lc7C P2
+   P17 -10G-> lc7B P1
+   P18 -10G-> lc8D P1
+   P19 -10G-> lc2D P1
+   P20 -10G-> lc2C P2
+   P21 -10G-> lc2B P1
+   P22 -10G-> lc2A P2
+   P23 -10G-> lc3D P1
+   P24 -10G-> lc3B P1
+   P25 -10G-> lc3C P2
+   P26 -10G-> lc4D P1
+   P27 -10G-> lc3A P2
+   P28 -10G-> lc4B P1
+   P29 -10G-> lc4C P2
+   P30 -10G-> lc5D P1
+   P31 -10G-> lc6C P2
+   P32 -10G-> lc6D P1
+   P33 -10G-> lc5A P2
+   P34 -10G-> lc5B P1
+   P35 -10G-> lc5C P2
+   P36 -10G-> lc4A P2
+
+SUBSYSTEM SPINE fc9B
+   P1 -10G-> lc8D P2
+   P2 -10G-> lc8A P1
+   P3 -10G-> lc8B P2
+   P4 -10G-> lc8C P1
+   P5 -10G-> lc7D P2
+   P6 -10G-> lc7B P2
+   P7 -10G-> lc7A P1
+   P8 -10G-> lc6D P2
+   P9 -10G-> lc7C P1
+   P10 -10G-> lc6B P2
+   P11 -10G-> lc6A P1
+   P12 -10G-> lc5D P2
+   P13 -10G-> lc4A P1
+   P14 -10G-> lc4D P2
+   P15 -10G-> lc5C P1
+   P16 -10G-> lc5B P2
+   P17 -10G-> lc5A P1
+   P18 -10G-> lc6C P1
+   P19 -10G-> lc9C P1
+   P20 -10G-> lc9B P2
+   P21 -10G-> lc9A P1
+   P22 -10G-> lc9D P2
+   P23 -10G-> lc1C P1
+   P24 -10G-> lc1A P1
+   P25 -10G-> lc1B P2
+   P26 -10G-> lc2C P1
+   P27 -10G-> lc1D P2
+   P28 -10G-> lc2A P1
+   P29 -10G-> lc2B P2
+   P30 -10G-> lc3C P1
+   P31 -10G-> lc4B P2
+   P32 -10G-> lc4C P1
+   P33 -10G-> lc3D P2
+   P34 -10G-> lc3A P1
+   P35 -10G-> lc3B P2
+   P36 -10G-> lc2D P2
+
+SUBSYSTEM LEAF lc1A
+   P1 -10G-> fc9B P24
+   P2 -10G-> fc9A P1
+   P3 -10G-> fc8B P24
+   P4 -10G-> fc8A P1
+   P5 -10G-> fc7B P24
+   P6 -10G-> fc7A P1
+   P7 -10G-> fc6B P24
+   P8 -10G-> fc6A P1
+   P9 -10G-> fc5B P24
+   P10 -10G-> fc5A P1
+   P11 -10G-> fc4B P24
+   P12 -10G-> fc4A P1
+   P13 -10G-> fc1A P1
+   P14 -10G-> fc1B P24
+   P15 -10G-> fc2A P1
+   P16 -10G-> fc2B P24
+   P17 -10G-> fc3A P1
+   P18 -10G-> fc3B P24
+   P19 -10G-> lc1-0A/P3
+   P20 -10G-> lc1-0B/P3
+   P21 -10G-> lc1-0B/P2
+   P22 -10G-> lc1-0B/P1
+   P23 -10G-> lc1-0A/P2
+   P24 -10G-> lc1-0A/P1
+   P25 -10G-> lc1-1A/P3
+   P26 -10G-> lc1-1B/P3
+   P27 -10G-> lc1-1B/P2
+   P28 -10G-> lc1-1B/P1
+   P29 -10G-> lc1-1A/P2
+   P30 -10G-> lc1-1A/P1
+   P31 -10G-> lc1-2A/P1
+   P32 -10G-> lc1-2A/P2
+   P33 -10G-> lc1-2B/P1
+   P34 -10G-> lc1-2B/P2
+   P35 -10G-> lc1-2B/P3
+   P36 -10G-> lc1-2A/P3
+
+SUBSYSTEM LEAF lc1B
+   P1 -10G-> fc9A P2
+   P2 -10G-> fc9B P25
+   P3 -10G-> fc8A P2
+   P4 -10G-> fc8B P25
+   P5 -10G-> fc7A P2
+   P6 -10G-> fc7B P25
+   P7 -10G-> fc6A P2
+   P8 -10G-> fc6B P25
+   P9 -10G-> fc5A P2
+   P10 -10G-> fc5B P25
+   P11 -10G-> fc4A P2
+   P12 -10G-> fc4B P25
+   P13 -10G-> fc1B P25
+   P14 -10G-> fc1A P2
+   P15 -10G-> fc2B P25
+   P16 -10G-> fc2A P2
+   P17 -10G-> fc3B P25
+   P18 -10G-> fc3A P2
+   P19 -10G-> lc1-3A/P3
+   P20 -10G-> lc1-3B/P3
+   P21 -10G-> lc1-3B/P2
+   P22 -10G-> lc1-3B/P1
+   P23 -10G-> lc1-3A/P2
+   P24 -10G-> lc1-3A/P1
+   P25 -10G-> lc1-4A/P3
+   P26 -10G-> lc1-4B/P3
+   P27 -10G-> lc1-4B/P2
+   P28 -10G-> lc1-4B/P1
+   P29 -10G-> lc1-4A/P2
+   P30 -10G-> lc1-4A/P1
+   P31 -10G-> lc1-5A/P1
+   P32 -10G-> lc1-5A/P2
+   P33 -10G-> lc1-5B/P1
+   P34 -10G-> lc1-5B/P2
+   P35 -10G-> lc1-5B/P3
+   P36 -10G-> lc1-5A/P3
+
+SUBSYSTEM LEAF lc1C
+   P1 -10G-> fc9B P23
+   P2 -10G-> fc9A P3
+   P3 -10G-> fc8B P23
+   P4 -10G-> fc8A P3
+   P5 -10G-> fc7B P23
+   P6 -10G-> fc7A P3
+   P7 -10G-> fc6B P23
+   P8 -10G-> fc6A P3
+   P9 -10G-> fc5B P23
+   P10 -10G-> fc5A P3
+   P11 -10G-> fc4B P23
+   P12 -10G-> fc4A P3
+   P13 -10G-> fc1A P3
+   P14 -10G-> fc1B P23
+   P15 -10G-> fc2A P3
+   P16 -10G-> fc2B P23
+   P17 -10G-> fc3A P3
+   P18 -10G-> fc3B P23
+   P19 -10G-> lc1-6A/P3
+   P20 -10G-> lc1-6B/P3
+   P21 -10G-> lc1-6B/P2
+   P22 -10G-> lc1-6B/P1
+   P23 -10G-> lc1-6A/P2
+   P24 -10G-> lc1-6A/P1
+   P25 -10G-> lc1-7A/P3
+   P26 -10G-> lc1-7B/P3
+   P27 -10G-> lc1-7B/P2
+   P28 -10G-> lc1-7B/P1
+   P29 -10G-> lc1-7A/P2
+   P30 -10G-> lc1-7A/P1
+   P31 -10G-> lc1-8A/P1
+   P32 -10G-> lc1-8A/P2
+   P33 -10G-> lc1-8B/P1
+   P34 -10G-> lc1-8B/P2
+   P35 -10G-> lc1-8B/P3
+   P36 -10G-> lc1-8A/P3
+
+SUBSYSTEM LEAF lc1D
+   P1 -10G-> fc9A P4
+   P2 -10G-> fc9B P27
+   P3 -10G-> fc8A P4
+   P4 -10G-> fc8B P27
+   P5 -10G-> fc7A P4
+   P6 -10G-> fc7B P27
+   P7 -10G-> fc6A P4
+   P8 -10G-> fc6B P27
+   P9 -10G-> fc5A P4
+   P10 -10G-> fc5B P27
+   P11 -10G-> fc4A P4
+   P12 -10G-> fc4B P27
+   P13 -10G-> fc1B P27
+   P14 -10G-> fc1A P4
+   P15 -10G-> fc2B P27
+   P16 -10G-> fc2A P4
+   P17 -10G-> fc3B P27
+   P18 -10G-> fc3A P4
+   P19 -10G-> lc1-9A/P3
+   P20 -10G-> lc1-9B/P3
+   P21 -10G-> lc1-9B/P2
+   P22 -10G-> lc1-9B/P1
+   P23 -10G-> lc1-9A/P2
+   P24 -10G-> lc1-9A/P1
+   P25 -10G-> lc1-10A/P3
+   P26 -10G-> lc1-10B/P3
+   P27 -10G-> lc1-10B/P2
+   P28 -10G-> lc1-10B/P1
+   P29 -10G-> lc1-10A/P2
+   P30 -10G-> lc1-10A/P1
+   P31 -10G-> lc1-11A/P1
+   P32 -10G-> lc1-11A/P2
+   P33 -10G-> lc1-11B/P1
+   P34 -10G-> lc1-11B/P2
+   P35 -10G-> lc1-11B/P3
+   P36 -10G-> lc1-11A/P3
+
+SUBSYSTEM LEAF lc2A
+   P1 -10G-> fc9B P28
+   P2 -10G-> fc9A P22
+   P3 -10G-> fc8B P28
+   P4 -10G-> fc8A P22
+   P5 -10G-> fc7B P28
+   P6 -10G-> fc7A P22
+   P7 -10G-> fc6B P28
+   P8 -10G-> fc6A P22
+   P9 -10G-> fc5B P28
+   P10 -10G-> fc5A P22
+   P11 -10G-> fc4B P28
+   P12 -10G-> fc4A P22
+   P13 -10G-> fc1A P22
+   P14 -10G-> fc1B P28
+   P15 -10G-> fc2A P22
+   P16 -10G-> fc2B P28
+   P17 -10G-> fc3A P22
+   P18 -10G-> fc3B P28
+   P19 -10G-> lc2-0A/P3
+   P20 -10G-> lc2-0B/P3
+   P21 -10G-> lc2-0B/P2
+   P22 -10G-> lc2-0B/P1
+   P23 -10G-> lc2-0A/P2
+   P24 -10G-> lc2-0A/P1
+   P25 -10G-> lc2-1A/P3
+   P26 -10G-> lc2-1B/P3
+   P27 -10G-> lc2-1B/P2
+   P28 -10G-> lc2-1B/P1
+   P29 -10G-> lc2-1A/P2
+   P30 -10G-> lc2-1A/P1
+   P31 -10G-> lc2-2A/P1
+   P32 -10G-> lc2-2A/P2
+   P33 -10G-> lc2-2B/P1
+   P34 -10G-> lc2-2B/P2
+   P35 -10G-> lc2-2B/P3
+   P36 -10G-> lc2-2A/P3
+
+SUBSYSTEM LEAF lc2B
+   P1 -10G-> fc9A P21
+   P2 -10G-> fc9B P29
+   P3 -10G-> fc8A P21
+   P4 -10G-> fc8B P29
+   P5 -10G-> fc7A P21
+   P6 -10G-> fc7B P29
+   P7 -10G-> fc6A P21
+   P8 -10G-> fc6B P29
+   P9 -10G-> fc5A P21
+   P10 -10G-> fc5B P29
+   P11 -10G-> fc4A P21
+   P12 -10G-> fc4B P29
+   P13 -10G-> fc1B P29
+   P14 -10G-> fc1A P21
+   P15 -10G-> fc2B P29
+   P16 -10G-> fc2A P21
+   P17 -10G-> fc3B P29
+   P18 -10G-> fc3A P21
+   P19 -10G-> lc2-3A/P3
+   P20 -10G-> lc2-3B/P3
+   P21 -10G-> lc2-3B/P2
+   P22 -10G-> lc2-3B/P1
+   P23 -10G-> lc2-3A/P2
+   P24 -10G-> lc2-3A/P1
+   P25 -10G-> lc2-4A/P3
+   P26 -10G-> lc2-4B/P3
+   P27 -10G-> lc2-4B/P2
+   P28 -10G-> lc2-4B/P1
+   P29 -10G-> lc2-4A/P2
+   P30 -10G-> lc2-4A/P1
+   P31 -10G-> lc2-5A/P1
+   P32 -10G-> lc2-5A/P2
+   P33 -10G-> lc2-5B/P1
+   P34 -10G-> lc2-5B/P2
+   P35 -10G-> lc2-5B/P3
+   P36 -10G-> lc2-5A/P3
+
+SUBSYSTEM LEAF lc2C
+   P1 -10G-> fc9B P26
+   P2 -10G-> fc9A P20
+   P3 -10G-> fc8B P26
+   P4 -10G-> fc8A P20
+   P5 -10G-> fc7B P26
+   P6 -10G-> fc7A P20
+   P7 -10G-> fc6B P26
+   P8 -10G-> fc6A P20
+   P9 -10G-> fc5B P26
+   P10 -10G-> fc5A P20
+   P11 -10G-> fc4B P26
+   P12 -10G-> fc4A P20
+   P13 -10G-> fc1A P20
+   P14 -10G-> fc1B P26
+   P15 -10G-> fc2A P20
+   P16 -10G-> fc2B P26
+   P17 -10G-> fc3A P20
+   P18 -10G-> fc3B P26
+   P19 -10G-> lc2-6A/P3
+   P20 -10G-> lc2-6B/P3
+   P21 -10G-> lc2-6B/P2
+   P22 -10G-> lc2-6B/P1
+   P23 -10G-> lc2-6A/P2
+   P24 -10G-> lc2-6A/P1
+   P25 -10G-> lc2-7A/P3
+   P26 -10G-> lc2-7B/P3
+   P27 -10G-> lc2-7B/P2
+   P28 -10G-> lc2-7B/P1
+   P29 -10G-> lc2-7A/P2
+   P30 -10G-> lc2-7A/P1
+   P31 -10G-> lc2-8A/P1
+   P32 -10G-> lc2-8A/P2
+   P33 -10G-> lc2-8B/P1
+   P34 -10G-> lc2-8B/P2
+   P35 -10G-> lc2-8B/P3
+   P36 -10G-> lc2-8A/P3
+
+SUBSYSTEM LEAF lc2D
+   P1 -10G-> fc9A P19
+   P2 -10G-> fc9B P36
+   P3 -10G-> fc8A P19
+   P4 -10G-> fc8B P36
+   P5 -10G-> fc7A P19
+   P6 -10G-> fc7B P36
+   P7 -10G-> fc6A P19
+   P8 -10G-> fc6B P36
+   P9 -10G-> fc5A P19
+   P10 -10G-> fc5B P36
+   P11 -10G-> fc4A P19
+   P12 -10G-> fc4B P36
+   P13 -10G-> fc1B P36
+   P14 -10G-> fc1A P19
+   P15 -10G-> fc2B P36
+   P16 -10G-> fc2A P19
+   P17 -10G-> fc3B P36
+   P18 -10G-> fc3A P19
+   P19 -10G-> lc2-9A/P3
+   P20 -10G-> lc2-9B/P3
+   P21 -10G-> lc2-9B/P2
+   P22 -10G-> lc2-9B/P1
+   P23 -10G-> lc2-9A/P2
+   P24 -10G-> lc2-9A/P1
+   P25 -10G-> lc2-10A/P3
+   P26 -10G-> lc2-10B/P3
+   P27 -10G-> lc2-10B/P2
+   P28 -10G-> lc2-10B/P1
+   P29 -10G-> lc2-10A/P2
+   P30 -10G-> lc2-10A/P1
+   P31 -10G-> lc2-11A/P1
+   P32 -10G-> lc2-11A/P2
+   P33 -10G-> lc2-11B/P1
+   P34 -10G-> lc2-11B/P2
+   P35 -10G-> lc2-11B/P3
+   P36 -10G-> lc2-11A/P3
+
+SUBSYSTEM LEAF lc3A
+   P1 -10G-> fc9B P34
+   P2 -10G-> fc9A P27
+   P3 -10G-> fc8B P34
+   P4 -10G-> fc8A P27
+   P5 -10G-> fc7B P34
+   P6 -10G-> fc7A P27
+   P7 -10G-> fc6B P34
+   P8 -10G-> fc6A P27
+   P9 -10G-> fc5B P34
+   P10 -10G-> fc5A P27
+   P11 -10G-> fc4B P34
+   P12 -10G-> fc4A P27
+   P13 -10G-> fc1A P27
+   P14 -10G-> fc1B P34
+   P15 -10G-> fc2A P27
+   P16 -10G-> fc2B P34
+   P17 -10G-> fc3A P27
+   P18 -10G-> fc3B P34
+   P19 -10G-> lc3-0A/P3
+   P20 -10G-> lc3-0B/P3
+   P21 -10G-> lc3-0B/P2
+   P22 -10G-> lc3-0B/P1
+   P23 -10G-> lc3-0A/P2
+   P24 -10G-> lc3-0A/P1
+   P25 -10G-> lc3-1A/P3
+   P26 -10G-> lc3-1B/P3
+   P27 -10G-> lc3-1B/P2
+   P28 -10G-> lc3-1B/P1
+   P29 -10G-> lc3-1A/P2
+   P30 -10G-> lc3-1A/P1
+   P31 -10G-> lc3-2A/P1
+   P32 -10G-> lc3-2A/P2
+   P33 -10G-> lc3-2B/P1
+   P34 -10G-> lc3-2B/P2
+   P35 -10G-> lc3-2B/P3
+   P36 -10G-> lc3-2A/P3
+
+SUBSYSTEM LEAF lc3B
+   P1 -10G-> fc9A P24
+   P2 -10G-> fc9B P35
+   P3 -10G-> fc8A P24
+   P4 -10G-> fc8B P35
+   P5 -10G-> fc7A P24
+   P6 -10G-> fc7B P35
+   P7 -10G-> fc6A P24
+   P8 -10G-> fc6B P35
+   P9 -10G-> fc5A P24
+   P10 -10G-> fc5B P35
+   P11 -10G-> fc4A P24
+   P12 -10G-> fc4B P35
+   P13 -10G-> fc1B P35
+   P14 -10G-> fc1A P24
+   P15 -10G-> fc2B P35
+   P16 -10G-> fc2A P24
+   P17 -10G-> fc3B P35
+   P18 -10G-> fc3A P24
+   P19 -10G-> lc3-3A/P3
+   P20 -10G-> lc3-3B/P3
+   P21 -10G-> lc3-3B/P2
+   P22 -10G-> lc3-3B/P1
+   P23 -10G-> lc3-3A/P2
+   P24 -10G-> lc3-3A/P1
+   P25 -10G-> lc3-4A/P3
+   P26 -10G-> lc3-4B/P3
+   P27 -10G-> lc3-4B/P2
+   P28 -10G-> lc3-4B/P1
+   P29 -10G-> lc3-4A/P2
+   P30 -10G-> lc3-4A/P1
+   P31 -10G-> lc3-5A/P1
+   P32 -10G-> lc3-5A/P2
+   P33 -10G-> lc3-5B/P1
+   P34 -10G-> lc3-5B/P2
+   P35 -10G-> lc3-5B/P3
+   P36 -10G-> lc3-5A/P3
+
+SUBSYSTEM LEAF lc3C
+   P1 -10G-> fc9B P30
+   P2 -10G-> fc9A P25
+   P3 -10G-> fc8B P30
+   P4 -10G-> fc8A P25
+   P5 -10G-> fc7B P30
+   P6 -10G-> fc7A P25
+   P7 -10G-> fc6B P30
+   P8 -10G-> fc6A P25
+   P9 -10G-> fc5B P30
+   P10 -10G-> fc5A P25
+   P11 -10G-> fc4B P30
+   P12 -10G-> fc4A P25
+   P13 -10G-> fc1A P25
+   P14 -10G-> fc1B P30
+   P15 -10G-> fc2A P25
+   P16 -10G-> fc2B P30
+   P17 -10G-> fc3A P25
+   P18 -10G-> fc3B P30
+   P19 -10G-> lc3-6A/P3
+   P20 -10G-> lc3-6B/P3
+   P21 -10G-> lc3-6B/P2
+   P22 -10G-> lc3-6B/P1
+   P23 -10G-> lc3-6A/P2
+   P24 -10G-> lc3-6A/P1
+   P25 -10G-> lc3-7A/P3
+   P26 -10G-> lc3-7B/P3
+   P27 -10G-> lc3-7B/P2
+   P28 -10G-> lc3-7B/P1
+   P29 -10G-> lc3-7A/P2
+   P30 -10G-> lc3-7A/P1
+   P31 -10G-> lc3-8A/P1
+   P32 -10G-> lc3-8A/P2
+   P33 -10G-> lc3-8B/P1
+   P34 -10G-> lc3-8B/P2
+   P35 -10G-> lc3-8B/P3
+   P36 -10G-> lc3-8A/P3
+
+SUBSYSTEM LEAF lc3D
+   P1 -10G-> fc9A P23
+   P2 -10G-> fc9B P33
+   P3 -10G-> fc8A P23
+   P4 -10G-> fc8B P33
+   P5 -10G-> fc7A P23
+   P6 -10G-> fc7B P33
+   P7 -10G-> fc6A P23
+   P8 -10G-> fc6B P33
+   P9 -10G-> fc5A P23
+   P10 -10G-> fc5B P33
+   P11 -10G-> fc4A P23
+   P12 -10G-> fc4B P33
+   P13 -10G-> fc1B P33
+   P14 -10G-> fc1A P23
+   P15 -10G-> fc2B P33
+   P16 -10G-> fc2A P23
+   P17 -10G-> fc3B P33
+   P18 -10G-> fc3A P23
+   P19 -10G-> lc3-9A/P3
+   P20 -10G-> lc3-9B/P3
+   P21 -10G-> lc3-9B/P2
+   P22 -10G-> lc3-9B/P1
+   P23 -10G-> lc3-9A/P2
+   P24 -10G-> lc3-9A/P1
+   P25 -10G-> lc3-10A/P3
+   P26 -10G-> lc3-10B/P3
+   P27 -10G-> lc3-10B/P2
+   P28 -10G-> lc3-10B/P1
+   P29 -10G-> lc3-10A/P2
+   P30 -10G-> lc3-10A/P1
+   P31 -10G-> lc3-11A/P1
+   P32 -10G-> lc3-11A/P2
+   P33 -10G-> lc3-11B/P1
+   P34 -10G-> lc3-11B/P2
+   P35 -10G-> lc3-11B/P3
+   P36 -10G-> lc3-11A/P3
+
+SUBSYSTEM LEAF lc4A
+   P1 -10G-> fc9B P13
+   P2 -10G-> fc9A P36
+   P3 -10G-> fc8B P13
+   P4 -10G-> fc8A P36
+   P5 -10G-> fc7B P13
+   P6 -10G-> fc7A P36
+   P7 -10G-> fc6B P13
+   P8 -10G-> fc6A P36
+   P9 -10G-> fc5B P13
+   P10 -10G-> fc5A P36
+   P11 -10G-> fc4B P13
+   P12 -10G-> fc4A P36
+   P13 -10G-> fc1A P36
+   P14 -10G-> fc1B P13
+   P15 -10G-> fc2A P36
+   P16 -10G-> fc2B P13
+   P17 -10G-> fc3A P36
+   P18 -10G-> fc3B P13
+   P19 -10G-> lc4-0A/P3
+   P20 -10G-> lc4-0B/P3
+   P21 -10G-> lc4-0B/P2
+   P22 -10G-> lc4-0B/P1
+   P23 -10G-> lc4-0A/P2
+   P24 -10G-> lc4-0A/P1
+   P25 -10G-> lc4-1A/P3
+   P26 -10G-> lc4-1B/P3
+   P27 -10G-> lc4-1B/P2
+   P28 -10G-> lc4-1B/P1
+   P29 -10G-> lc4-1A/P2
+   P30 -10G-> lc4-1A/P1
+   P31 -10G-> lc4-2A/P1
+   P32 -10G-> lc4-2A/P2
+   P33 -10G-> lc4-2B/P1
+   P34 -10G-> lc4-2B/P2
+   P35 -10G-> lc4-2B/P3
+   P36 -10G-> lc4-2A/P3
+
+SUBSYSTEM LEAF lc4B
+   P1 -10G-> fc9A P28
+   P2 -10G-> fc9B P31
+   P3 -10G-> fc8A P28
+   P4 -10G-> fc8B P31
+   P5 -10G-> fc7A P28
+   P6 -10G-> fc7B P31
+   P7 -10G-> fc6A P28
+   P8 -10G-> fc6B P31
+   P9 -10G-> fc5A P28
+   P10 -10G-> fc5B P31
+   P11 -10G-> fc4A P28
+   P12 -10G-> fc4B P31
+   P13 -10G-> fc1B P31
+   P14 -10G-> fc1A P28
+   P15 -10G-> fc2B P31
+   P16 -10G-> fc2A P28
+   P17 -10G-> fc3B P31
+   P18 -10G-> fc3A P28
+   P19 -10G-> lc4-3A/P3
+   P20 -10G-> lc4-3B/P3
+   P21 -10G-> lc4-3B/P2
+   P22 -10G-> lc4-3B/P1
+   P23 -10G-> lc4-3A/P2
+   P24 -10G-> lc4-3A/P1
+   P25 -10G-> lc4-4A/P3
+   P26 -10G-> lc4-4B/P3
+   P27 -10G-> lc4-4B/P2
+   P28 -10G-> lc4-4B/P1
+   P29 -10G-> lc4-4A/P2
+   P30 -10G-> lc4-4A/P1
+   P31 -10G-> lc4-5A/P1
+   P32 -10G-> lc4-5A/P2
+   P33 -10G-> lc4-5B/P1
+   P34 -10G-> lc4-5B/P2
+   P35 -10G-> lc4-5B/P3
+   P36 -10G-> lc4-5A/P3
+
+SUBSYSTEM LEAF lc4C
+   P1 -10G-> fc9B P32
+   P2 -10G-> fc9A P29
+   P3 -10G-> fc8B P32
+   P4 -10G-> fc8A P29
+   P5 -10G-> fc7B P32
+   P6 -10G-> fc7A P29
+   P7 -10G-> fc6B P32
+   P8 -10G-> fc6A P29
+   P9 -10G-> fc5B P32
+   P10 -10G-> fc5A P29
+   P11 -10G-> fc4B P32
+   P12 -10G-> fc4A P29
+   P13 -10G-> fc1A P29
+   P14 -10G-> fc1B P32
+   P15 -10G-> fc2A P29
+   P16 -10G-> fc2B P32
+   P17 -10G-> fc3A P29
+   P18 -10G-> fc3B P32
+   P19 -10G-> lc4-6A/P3
+   P20 -10G-> lc4-6B/P3
+   P21 -10G-> lc4-6B/P2
+   P22 -10G-> lc4-6B/P1
+   P23 -10G-> lc4-6A/P2
+   P24 -10G-> lc4-6A/P1
+   P25 -10G-> lc4-7A/P3
+   P26 -10G-> lc4-7B/P3
+   P27 -10G-> lc4-7B/P2
+   P28 -10G-> lc4-7B/P1
+   P29 -10G-> lc4-7A/P2
+   P30 -10G-> lc4-7A/P1
+   P31 -10G-> lc4-8A/P1
+   P32 -10G-> lc4-8A/P2
+   P33 -10G-> lc4-8B/P1
+   P34 -10G-> lc4-8B/P2
+   P35 -10G-> lc4-8B/P3
+   P36 -10G-> lc4-8A/P3
+
+SUBSYSTEM LEAF lc4D
+   P1 -10G-> fc9A P26
+   P2 -10G-> fc9B P14
+   P3 -10G-> fc8A P26
+   P4 -10G-> fc8B P14
+   P5 -10G-> fc7A P26
+   P6 -10G-> fc7B P14
+   P7 -10G-> fc6A P26
+   P8 -10G-> fc6B P14
+   P9 -10G-> fc5A P26
+   P10 -10G-> fc5B P14
+   P11 -10G-> fc4A P26
+   P12 -10G-> fc4B P14
+   P13 -10G-> fc1B P14
+   P14 -10G-> fc1A P26
+   P15 -10G-> fc2B P14
+   P16 -10G-> fc2A P26
+   P17 -10G-> fc3B P14
+   P18 -10G-> fc3A P26
+   P19 -10G-> lc4-9A/P3
+   P20 -10G-> lc4-9B/P3
+   P21 -10G-> lc4-9B/P2
+   P22 -10G-> lc4-9B/P1
+   P23 -10G-> lc4-9A/P2
+   P24 -10G-> lc4-9A/P1
+   P25 -10G-> lc4-10A/P3
+   P26 -10G-> lc4-10B/P3
+   P27 -10G-> lc4-10B/P2
+   P28 -10G-> lc4-10B/P1
+   P29 -10G-> lc4-10A/P2
+   P30 -10G-> lc4-10A/P1
+   P31 -10G-> lc4-11A/P1
+   P32 -10G-> lc4-11A/P2
+   P33 -10G-> lc4-11B/P1
+   P34 -10G-> lc4-11B/P2
+   P35 -10G-> lc4-11B/P3
+   P36 -10G-> lc4-11A/P3
+
+SUBSYSTEM LEAF lc5A
+   P1 -10G-> fc9B P17
+   P2 -10G-> fc9A P33
+   P3 -10G-> fc8B P17
+   P4 -10G-> fc8A P33
+   P5 -10G-> fc7B P17
+   P6 -10G-> fc7A P33
+   P7 -10G-> fc6B P17
+   P8 -10G-> fc6A P33
+   P9 -10G-> fc5B P17
+   P10 -10G-> fc5A P33
+   P11 -10G-> fc4B P17
+   P12 -10G-> fc4A P33
+   P13 -10G-> fc1A P33
+   P14 -10G-> fc1B P17
+   P15 -10G-> fc2A P33
+   P16 -10G-> fc2B P17
+   P17 -10G-> fc3A P33
+   P18 -10G-> fc3B P17
+   P19 -10G-> lc5-0A/P3
+   P20 -10G-> lc5-0B/P3
+   P21 -10G-> lc5-0B/P2
+   P22 -10G-> lc5-0B/P1
+   P23 -10G-> lc5-0A/P2
+   P24 -10G-> lc5-0A/P1
+   P25 -10G-> lc5-1A/P3
+   P26 -10G-> lc5-1B/P3
+   P27 -10G-> lc5-1B/P2
+   P28 -10G-> lc5-1B/P1
+   P29 -10G-> lc5-1A/P2
+   P30 -10G-> lc5-1A/P1
+   P31 -10G-> lc5-2A/P1
+   P32 -10G-> lc5-2A/P2
+   P33 -10G-> lc5-2B/P1
+   P34 -10G-> lc5-2B/P2
+   P35 -10G-> lc5-2B/P3
+   P36 -10G-> lc5-2A/P3
+
+SUBSYSTEM LEAF lc5B
+   P1 -10G-> fc9A P34
+   P2 -10G-> fc9B P16
+   P3 -10G-> fc8A P34
+   P4 -10G-> fc8B P16
+   P5 -10G-> fc7A P34
+   P6 -10G-> fc7B P16
+   P7 -10G-> fc6A P34
+   P8 -10G-> fc6B P16
+   P9 -10G-> fc5A P34
+   P10 -10G-> fc5B P16
+   P11 -10G-> fc4A P34
+   P12 -10G-> fc4B P16
+   P13 -10G-> fc1B P16
+   P14 -10G-> fc1A P34
+   P15 -10G-> fc2B P16
+   P16 -10G-> fc2A P34
+   P17 -10G-> fc3B P16
+   P18 -10G-> fc3A P34
+   P19 -10G-> lc5-3A/P3
+   P20 -10G-> lc5-3B/P3
+   P21 -10G-> lc5-3B/P2
+   P22 -10G-> lc5-3B/P1
+   P23 -10G-> lc5-3A/P2
+   P24 -10G-> lc5-3A/P1
+   P25 -10G-> lc5-4A/P3
+   P26 -10G-> lc5-4B/P3
+   P27 -10G-> lc5-4B/P2
+   P28 -10G-> lc5-4B/P1
+   P29 -10G-> lc5-4A/P2
+   P30 -10G-> lc5-4A/P1
+   P31 -10G-> lc5-5A/P1
+   P32 -10G-> lc5-5A/P2
+   P33 -10G-> lc5-5B/P1
+   P34 -10G-> lc5-5B/P2
+   P35 -10G-> lc5-5B/P3
+   P36 -10G-> lc5-5A/P3
+
+SUBSYSTEM LEAF lc5C
+   P1 -10G-> fc9B P15
+   P2 -10G-> fc9A P35
+   P3 -10G-> fc8B P15
+   P4 -10G-> fc8A P35
+   P5 -10G-> fc7B P15
+   P6 -10G-> fc7A P35
+   P7 -10G-> fc6B P15
+   P8 -10G-> fc6A P35
+   P9 -10G-> fc5B P15
+   P10 -10G-> fc5A P35
+   P11 -10G-> fc4B P15
+   P12 -10G-> fc4A P35
+   P13 -10G-> fc1A P35
+   P14 -10G-> fc1B P15
+   P15 -10G-> fc2A P35
+   P16 -10G-> fc2B P15
+   P17 -10G-> fc3A P35
+   P18 -10G-> fc3B P15
+   P19 -10G-> lc5-6A/P3
+   P20 -10G-> lc5-6B/P3
+   P21 -10G-> lc5-6B/P2
+   P22 -10G-> lc5-6B/P1
+   P23 -10G-> lc5-6A/P2
+   P24 -10G-> lc5-6A/P1
+   P25 -10G-> lc5-7A/P3
+   P26 -10G-> lc5-7B/P3
+   P27 -10G-> lc5-7B/P2
+   P28 -10G-> lc5-7B/P1
+   P29 -10G-> lc5-7A/P2
+   P30 -10G-> lc5-7A/P1
+   P31 -10G-> lc5-8A/P1
+   P32 -10G-> lc5-8A/P2
+   P33 -10G-> lc5-8B/P1
+   P34 -10G-> lc5-8B/P2
+   P35 -10G-> lc5-8B/P3
+   P36 -10G-> lc5-8A/P3
+
+SUBSYSTEM LEAF lc5D
+   P1 -10G-> fc9A P30
+   P2 -10G-> fc9B P12
+   P3 -10G-> fc8A P30
+   P4 -10G-> fc8B P12
+   P5 -10G-> fc7A P30
+   P6 -10G-> fc7B P12
+   P7 -10G-> fc6A P30
+   P8 -10G-> fc6B P12
+   P9 -10G-> fc5A P30
+   P10 -10G-> fc5B P12
+   P11 -10G-> fc4A P30
+   P12 -10G-> fc4B P12
+   P13 -10G-> fc1B P12
+   P14 -10G-> fc1A P30
+   P15 -10G-> fc2B P12
+   P16 -10G-> fc2A P30
+   P17 -10G-> fc3B P12
+   P18 -10G-> fc3A P30
+   P19 -10G-> lc5-9A/P3
+   P20 -10G-> lc5-9B/P3
+   P21 -10G-> lc5-9B/P2
+   P22 -10G-> lc5-9B/P1
+   P23 -10G-> lc5-9A/P2
+   P24 -10G-> lc5-9A/P1
+   P25 -10G-> lc5-10A/P3
+   P26 -10G-> lc5-10B/P3
+   P27 -10G-> lc5-10B/P2
+   P28 -10G-> lc5-10B/P1
+   P29 -10G-> lc5-10A/P2
+   P30 -10G-> lc5-10A/P1
+   P31 -10G-> lc5-11A/P1
+   P32 -10G-> lc5-11A/P2
+   P33 -10G-> lc5-11B/P1
+   P34 -10G-> lc5-11B/P2
+   P35 -10G-> lc5-11B/P3
+   P36 -10G-> lc5-11A/P3
+
+SUBSYSTEM LEAF lc6A
+   P1 -10G-> fc9B P11
+   P2 -10G-> fc9A P14
+   P3 -10G-> fc8B P11
+   P4 -10G-> fc8A P14
+   P5 -10G-> fc7B P11
+   P6 -10G-> fc7A P14
+   P7 -10G-> fc6B P11
+   P8 -10G-> fc6A P14
+   P9 -10G-> fc5B P11
+   P10 -10G-> fc5A P14
+   P11 -10G-> fc4B P11
+   P12 -10G-> fc4A P14
+   P13 -10G-> fc1A P14
+   P14 -10G-> fc1B P11
+   P15 -10G-> fc2A P14
+   P16 -10G-> fc2B P11
+   P17 -10G-> fc3A P14
+   P18 -10G-> fc3B P11
+   P19 -10G-> lc6-0A/P3
+   P20 -10G-> lc6-0B/P3
+   P21 -10G-> lc6-0B/P2
+   P22 -10G-> lc6-0B/P1
+   P23 -10G-> lc6-0A/P2
+   P24 -10G-> lc6-0A/P1
+   P25 -10G-> lc6-1A/P3
+   P26 -10G-> lc6-1B/P3
+   P27 -10G-> lc6-1B/P2
+   P28 -10G-> lc6-1B/P1
+   P29 -10G-> lc6-1A/P2
+   P30 -10G-> lc6-1A/P1
+   P31 -10G-> lc6-2A/P1
+   P32 -10G-> lc6-2A/P2
+   P33 -10G-> lc6-2B/P1
+   P34 -10G-> lc6-2B/P2
+   P35 -10G-> lc6-2B/P3
+   P36 -10G-> lc6-2A/P3
+
+SUBSYSTEM LEAF lc6B
+   P1 -10G-> fc9A P13
+   P2 -10G-> fc9B P10
+   P3 -10G-> fc8A P13
+   P4 -10G-> fc8B P10
+   P5 -10G-> fc7A P13
+   P6 -10G-> fc7B P10
+   P7 -10G-> fc6A P13
+   P8 -10G-> fc6B P10
+   P9 -10G-> fc5A P13
+   P10 -10G-> fc5B P10
+   P11 -10G-> fc4A P13
+   P12 -10G-> fc4B P10
+   P13 -10G-> fc1B P10
+   P14 -10G-> fc1A P13
+   P15 -10G-> fc2B P10
+   P16 -10G-> fc2A P13
+   P17 -10G-> fc3B P10
+   P18 -10G-> fc3A P13
+   P19 -10G-> lc6-3A/P3
+   P20 -10G-> lc6-3B/P3
+   P21 -10G-> lc6-3B/P2
+   P22 -10G-> lc6-3B/P1
+   P23 -10G-> lc6-3A/P2
+   P24 -10G-> lc6-3A/P1
+   P25 -10G-> lc6-4A/P3
+   P26 -10G-> lc6-4B/P3
+   P27 -10G-> lc6-4B/P2
+   P28 -10G-> lc6-4B/P1
+   P29 -10G-> lc6-4A/P2
+   P30 -10G-> lc6-4A/P1
+   P31 -10G-> lc6-5A/P1
+   P32 -10G-> lc6-5A/P2
+   P33 -10G-> lc6-5B/P1
+   P34 -10G-> lc6-5B/P2
+   P35 -10G-> lc6-5B/P3
+   P36 -10G-> lc6-5A/P3
+
+SUBSYSTEM LEAF lc6C
+   P1 -10G-> fc9B P18
+   P2 -10G-> fc9A P31
+   P3 -10G-> fc8B P18
+   P4 -10G-> fc8A P31
+   P5 -10G-> fc7B P18
+   P6 -10G-> fc7A P31
+   P7 -10G-> fc6B P18
+   P8 -10G-> fc6A P31
+   P9 -10G-> fc5B P18
+   P10 -10G-> fc5A P31
+   P11 -10G-> fc4B P18
+   P12 -10G-> fc4A P31
+   P13 -10G-> fc1A P31
+   P14 -10G-> fc1B P18
+   P15 -10G-> fc2A P31
+   P16 -10G-> fc2B P18
+   P17 -10G-> fc3A P31
+   P18 -10G-> fc3B P18
+   P19 -10G-> lc6-6A/P3
+   P20 -10G-> lc6-6B/P3
+   P21 -10G-> lc6-6B/P2
+   P22 -10G-> lc6-6B/P1
+   P23 -10G-> lc6-6A/P2
+   P24 -10G-> lc6-6A/P1
+   P25 -10G-> lc6-7A/P3
+   P26 -10G-> lc6-7B/P3
+   P27 -10G-> lc6-7B/P2
+   P28 -10G-> lc6-7B/P1
+   P29 -10G-> lc6-7A/P2
+   P30 -10G-> lc6-7A/P1
+   P31 -10G-> lc6-8A/P1
+   P32 -10G-> lc6-8A/P2
+   P33 -10G-> lc6-8B/P1
+   P34 -10G-> lc6-8B/P2
+   P35 -10G-> lc6-8B/P3
+   P36 -10G-> lc6-8A/P3
+
+SUBSYSTEM LEAF lc6D
+   P1 -10G-> fc9A P32
+   P2 -10G-> fc9B P8
+   P3 -10G-> fc8A P32
+   P4 -10G-> fc8B P8
+   P5 -10G-> fc7A P32
+   P6 -10G-> fc7B P8
+   P7 -10G-> fc6A P32
+   P8 -10G-> fc6B P8
+   P9 -10G-> fc5A P32
+   P10 -10G-> fc5B P8
+   P11 -10G-> fc4A P32
+   P12 -10G-> fc4B P8
+   P13 -10G-> fc1B P8
+   P14 -10G-> fc1A P32
+   P15 -10G-> fc2B P8
+   P16 -10G-> fc2A P32
+   P17 -10G-> fc3B P8
+   P18 -10G-> fc3A P32
+   P19 -10G-> lc6-9A/P3
+   P20 -10G-> lc6-9B/P3
+   P21 -10G-> lc6-9B/P2
+   P22 -10G-> lc6-9B/P1
+   P23 -10G-> lc6-9A/P2
+   P24 -10G-> lc6-9A/P1
+   P25 -10G-> lc6-10A/P3
+   P26 -10G-> lc6-10B/P3
+   P27 -10G-> lc6-10B/P2
+   P28 -10G-> lc6-10B/P1
+   P29 -10G-> lc6-10A/P2
+   P30 -10G-> lc6-10A/P1
+   P31 -10G-> lc6-11A/P1
+   P32 -10G-> lc6-11A/P2
+   P33 -10G-> lc6-11B/P1
+   P34 -10G-> lc6-11B/P2
+   P35 -10G-> lc6-11B/P3
+   P36 -10G-> lc6-11A/P3
+
+SUBSYSTEM LEAF lc7A
+   P1 -10G-> fc9B P7
+   P2 -10G-> fc9A P12
+   P3 -10G-> fc8B P7
+   P4 -10G-> fc8A P12
+   P5 -10G-> fc7B P7
+   P6 -10G-> fc7A P12
+   P7 -10G-> fc6B P7
+   P8 -10G-> fc6A P12
+   P9 -10G-> fc5B P7
+   P10 -10G-> fc5A P12
+   P11 -10G-> fc4B P7
+   P12 -10G-> fc4A P12
+   P13 -10G-> fc1A P12
+   P14 -10G-> fc1B P7
+   P15 -10G-> fc2A P12
+   P16 -10G-> fc2B P7
+   P17 -10G-> fc3A P12
+   P18 -10G-> fc3B P7
+   P19 -10G-> lc7-0A/P3
+   P20 -10G-> lc7-0B/P3
+   P21 -10G-> lc7-0B/P2
+   P22 -10G-> lc7-0B/P1
+   P23 -10G-> lc7-0A/P2
+   P24 -10G-> lc7-0A/P1
+   P25 -10G-> lc7-1A/P3
+   P26 -10G-> lc7-1B/P3
+   P27 -10G-> lc7-1B/P2
+   P28 -10G-> lc7-1B/P1
+   P29 -10G-> lc7-1A/P2
+   P30 -10G-> lc7-1A/P1
+   P31 -10G-> lc7-2A/P1
+   P32 -10G-> lc7-2A/P2
+   P33 -10G-> lc7-2B/P1
+   P34 -10G-> lc7-2B/P2
+   P35 -10G-> lc7-2B/P3
+   P36 -10G-> lc7-2A/P3
+
+SUBSYSTEM LEAF lc7B
+   P1 -10G-> fc9A P17
+   P2 -10G-> fc9B P6
+   P3 -10G-> fc8A P17
+   P4 -10G-> fc8B P6
+   P5 -10G-> fc7A P17
+   P6 -10G-> fc7B P6
+   P7 -10G-> fc6A P17
+   P8 -10G-> fc6B P6
+   P9 -10G-> fc5A P17
+   P10 -10G-> fc5B P6
+   P11 -10G-> fc4A P17
+   P12 -10G-> fc4B P6
+   P13 -10G-> fc1B P6
+   P14 -10G-> fc1A P17
+   P15 -10G-> fc2B P6
+   P16 -10G-> fc2A P17
+   P17 -10G-> fc3B P6
+   P18 -10G-> fc3A P17
+   P19 -10G-> lc7-3A/P3
+   P20 -10G-> lc7-3B/P3
+   P21 -10G-> lc7-3B/P2
+   P22 -10G-> lc7-3B/P1
+   P23 -10G-> lc7-3A/P2
+   P24 -10G-> lc7-3A/P1
+   P25 -10G-> lc7-4A/P3
+   P26 -10G-> lc7-4B/P3
+   P27 -10G-> lc7-4B/P2
+   P28 -10G-> lc7-4B/P1
+   P29 -10G-> lc7-4A/P2
+   P30 -10G-> lc7-4A/P1
+   P31 -10G-> lc7-5A/P1
+   P32 -10G-> lc7-5A/P2
+   P33 -10G-> lc7-5B/P1
+   P34 -10G-> lc7-5B/P2
+   P35 -10G-> lc7-5B/P3
+   P36 -10G-> lc7-5A/P3
+
+SUBSYSTEM LEAF lc7C
+   P1 -10G-> fc9B P9
+   P2 -10G-> fc9A P16
+   P3 -10G-> fc8B P9
+   P4 -10G-> fc8A P16
+   P5 -10G-> fc7B P9
+   P6 -10G-> fc7A P16
+   P7 -10G-> fc6B P9
+   P8 -10G-> fc6A P16
+   P9 -10G-> fc5B P9
+   P10 -10G-> fc5A P16
+   P11 -10G-> fc4B P9
+   P12 -10G-> fc4A P16
+   P13 -10G-> fc1A P16
+   P14 -10G-> fc1B P9
+   P15 -10G-> fc2A P16
+   P16 -10G-> fc2B P9
+   P17 -10G-> fc3A P16
+   P18 -10G-> fc3B P9
+   P19 -10G-> lc7-6A/P3
+   P20 -10G-> lc7-6B/P3
+   P21 -10G-> lc7-6B/P2
+   P22 -10G-> lc7-6B/P1
+   P23 -10G-> lc7-6A/P2
+   P24 -10G-> lc7-6A/P1
+   P25 -10G-> lc7-7A/P3
+   P26 -10G-> lc7-7B/P3
+   P27 -10G-> lc7-7B/P2
+   P28 -10G-> lc7-7B/P1
+   P29 -10G-> lc7-7A/P2
+   P30 -10G-> lc7-7A/P1
+   P31 -10G-> lc7-8A/P1
+   P32 -10G-> lc7-8A/P2
+   P33 -10G-> lc7-8B/P1
+   P34 -10G-> lc7-8B/P2
+   P35 -10G-> lc7-8B/P3
+   P36 -10G-> lc7-8A/P3
+
+SUBSYSTEM LEAF lc7D
+   P1 -10G-> fc9A P15
+   P2 -10G-> fc9B P5
+   P3 -10G-> fc8A P15
+   P4 -10G-> fc8B P5
+   P5 -10G-> fc7A P15
+   P6 -10G-> fc7B P5
+   P7 -10G-> fc6A P15
+   P8 -10G-> fc6B P5
+   P9 -10G-> fc5A P15
+   P10 -10G-> fc5B P5
+   P11 -10G-> fc4A P15
+   P12 -10G-> fc4B P5
+   P13 -10G-> fc1B P5
+   P14 -10G-> fc1A P15
+   P15 -10G-> fc2B P5
+   P16 -10G-> fc2A P15
+   P17 -10G-> fc3B P5
+   P18 -10G-> fc3A P15
+   P19 -10G-> lc7-9A/P3
+   P20 -10G-> lc7-9B/P3
+   P21 -10G-> lc7-9B/P2
+   P22 -10G-> lc7-9B/P1
+   P23 -10G-> lc7-9A/P2
+   P24 -10G-> lc7-9A/P1
+   P25 -10G-> lc7-10A/P3
+   P26 -10G-> lc7-10B/P3
+   P27 -10G-> lc7-10B/P2
+   P28 -10G-> lc7-10B/P1
+   P29 -10G-> lc7-10A/P2
+   P30 -10G-> lc7-10A/P1
+   P31 -10G-> lc7-11A/P1
+   P32 -10G-> lc7-11A/P2
+   P33 -10G-> lc7-11B/P1
+   P34 -10G-> lc7-11B/P2
+   P35 -10G-> lc7-11B/P3
+   P36 -10G-> lc7-11A/P3
+
+SUBSYSTEM LEAF lc8A
+   P1 -10G-> fc9B P2
+   P2 -10G-> fc9A P8
+   P3 -10G-> fc8B P2
+   P4 -10G-> fc8A P8
+   P5 -10G-> fc7B P2
+   P6 -10G-> fc7A P8
+   P7 -10G-> fc6B P2
+   P8 -10G-> fc6A P8
+   P9 -10G-> fc5B P2
+   P10 -10G-> fc5A P8
+   P11 -10G-> fc4B P2
+   P12 -10G-> fc4A P8
+   P13 -10G-> fc1A P8
+   P14 -10G-> fc1B P2
+   P15 -10G-> fc2A P8
+   P16 -10G-> fc2B P2
+   P17 -10G-> fc3A P8
+   P18 -10G-> fc3B P2
+   P19 -10G-> lc8-0A/P3
+   P20 -10G-> lc8-0B/P3
+   P21 -10G-> lc8-0B/P2
+   P22 -10G-> lc8-0B/P1
+   P23 -10G-> lc8-0A/P2
+   P24 -10G-> lc8-0A/P1
+   P25 -10G-> lc8-1A/P3
+   P26 -10G-> lc8-1B/P3
+   P27 -10G-> lc8-1B/P2
+   P28 -10G-> lc8-1B/P1
+   P29 -10G-> lc8-1A/P2
+   P30 -10G-> lc8-1A/P1
+   P31 -10G-> lc8-2A/P1
+   P32 -10G-> lc8-2A/P2
+   P33 -10G-> lc8-2B/P1
+   P34 -10G-> lc8-2B/P2
+   P35 -10G-> lc8-2B/P3
+   P36 -10G-> lc8-2A/P3
+
+SUBSYSTEM LEAF lc8B
+   P1 -10G-> fc9A P11
+   P2 -10G-> fc9B P3
+   P3 -10G-> fc8A P11
+   P4 -10G-> fc8B P3
+   P5 -10G-> fc7A P11
+   P6 -10G-> fc7B P3
+   P7 -10G-> fc6A P11
+   P8 -10G-> fc6B P3
+   P9 -10G-> fc5A P11
+   P10 -10G-> fc5B P3
+   P11 -10G-> fc4A P11
+   P12 -10G-> fc4B P3
+   P13 -10G-> fc1B P3
+   P14 -10G-> fc1A P11
+   P15 -10G-> fc2B P3
+   P16 -10G-> fc2A P11
+   P17 -10G-> fc3B P3
+   P18 -10G-> fc3A P11
+   P19 -10G-> lc8-3A/P3
+   P20 -10G-> lc8-3B/P3
+   P21 -10G-> lc8-3B/P2
+   P22 -10G-> lc8-3B/P1
+   P23 -10G-> lc8-3A/P2
+   P24 -10G-> lc8-3A/P1
+   P25 -10G-> lc8-4A/P3
+   P26 -10G-> lc8-4B/P3
+   P27 -10G-> lc8-4B/P2
+   P28 -10G-> lc8-4B/P1
+   P29 -10G-> lc8-4A/P2
+   P30 -10G-> lc8-4A/P1
+   P31 -10G-> lc8-5A/P1
+   P32 -10G-> lc8-5A/P2
+   P33 -10G-> lc8-5B/P1
+   P34 -10G-> lc8-5B/P2
+   P35 -10G-> lc8-5B/P3
+   P36 -10G-> lc8-5A/P3
+
+SUBSYSTEM LEAF lc8C
+   P1 -10G-> fc9B P4
+   P2 -10G-> fc9A P10
+   P3 -10G-> fc8B P4
+   P4 -10G-> fc8A P10
+   P5 -10G-> fc7B P4
+   P6 -10G-> fc7A P10
+   P7 -10G-> fc6B P4
+   P8 -10G-> fc6A P10
+   P9 -10G-> fc5B P4
+   P10 -10G-> fc5A P10
+   P11 -10G-> fc4B P4
+   P12 -10G-> fc4A P10
+   P13 -10G-> fc1A P10
+   P14 -10G-> fc1B P4
+   P15 -10G-> fc2A P10
+   P16 -10G-> fc2B P4
+   P17 -10G-> fc3A P10
+   P18 -10G-> fc3B P4
+   P19 -10G-> lc8-6A/P3
+   P20 -10G-> lc8-6B/P3
+   P21 -10G-> lc8-6B/P2
+   P22 -10G-> lc8-6B/P1
+   P23 -10G-> lc8-6A/P2
+   P24 -10G-> lc8-6A/P1
+   P25 -10G-> lc8-7A/P3
+   P26 -10G-> lc8-7B/P3
+   P27 -10G-> lc8-7B/P2
+   P28 -10G-> lc8-7B/P1
+   P29 -10G-> lc8-7A/P2
+   P30 -10G-> lc8-7A/P1
+   P31 -10G-> lc8-8A/P1
+   P32 -10G-> lc8-8A/P2
+   P33 -10G-> lc8-8B/P1
+   P34 -10G-> lc8-8B/P2
+   P35 -10G-> lc8-8B/P3
+   P36 -10G-> lc8-8A/P3
+
+SUBSYSTEM LEAF lc8D
+   P1 -10G-> fc9A P18
+   P2 -10G-> fc9B P1
+   P3 -10G-> fc8A P18
+   P4 -10G-> fc8B P1
+   P5 -10G-> fc7A P18
+   P6 -10G-> fc7B P1
+   P7 -10G-> fc6A P18
+   P8 -10G-> fc6B P1
+   P9 -10G-> fc5A P18
+   P10 -10G-> fc5B P1
+   P11 -10G-> fc4A P18
+   P12 -10G-> fc4B P1
+   P13 -10G-> fc1B P1
+   P14 -10G-> fc1A P18
+   P15 -10G-> fc2B P1
+   P16 -10G-> fc2A P18
+   P17 -10G-> fc3B P1
+   P18 -10G-> fc3A P18
+   P19 -10G-> lc8-9A/P3
+   P20 -10G-> lc8-9B/P3
+   P21 -10G-> lc8-9B/P2
+   P22 -10G-> lc8-9B/P1
+   P23 -10G-> lc8-9A/P2
+   P24 -10G-> lc8-9A/P1
+   P25 -10G-> lc8-10A/P3
+   P26 -10G-> lc8-10B/P3
+   P27 -10G-> lc8-10B/P2
+   P28 -10G-> lc8-10B/P1
+   P29 -10G-> lc8-10A/P2
+   P30 -10G-> lc8-10A/P1
+   P31 -10G-> lc8-11A/P1
+   P32 -10G-> lc8-11A/P2
+   P33 -10G-> lc8-11B/P1
+   P34 -10G-> lc8-11B/P2
+   P35 -10G-> lc8-11B/P3
+   P36 -10G-> lc8-11A/P3
+
+SUBSYSTEM LEAF lc9A
+   P1 -10G-> fc9B P21
+   P2 -10G-> fc9A P5
+   P3 -10G-> fc8B P21
+   P4 -10G-> fc8A P5
+   P5 -10G-> fc7B P21
+   P6 -10G-> fc7A P5
+   P7 -10G-> fc6B P21
+   P8 -10G-> fc6A P5
+   P9 -10G-> fc5B P21
+   P10 -10G-> fc5A P5
+   P11 -10G-> fc4B P21
+   P12 -10G-> fc4A P5
+   P13 -10G-> fc1A P5
+   P14 -10G-> fc1B P21
+   P15 -10G-> fc2A P5
+   P16 -10G-> fc2B P21
+   P17 -10G-> fc3A P5
+   P18 -10G-> fc3B P21
+   P19 -10G-> lc9-0A/P3
+   P20 -10G-> lc9-0B/P3
+   P21 -10G-> lc9-0B/P2
+   P22 -10G-> lc9-0B/P1
+   P23 -10G-> lc9-0A/P2
+   P24 -10G-> lc9-0A/P1
+   P25 -10G-> lc9-1A/P3
+   P26 -10G-> lc9-1B/P3
+   P27 -10G-> lc9-1B/P2
+   P28 -10G-> lc9-1B/P1
+   P29 -10G-> lc9-1A/P2
+   P30 -10G-> lc9-1A/P1
+   P31 -10G-> lc9-2A/P1
+   P32 -10G-> lc9-2A/P2
+   P33 -10G-> lc9-2B/P1
+   P34 -10G-> lc9-2B/P2
+   P35 -10G-> lc9-2B/P3
+   P36 -10G-> lc9-2A/P3
+
+SUBSYSTEM LEAF lc9B
+   P1 -10G-> fc9A P7
+   P2 -10G-> fc9B P20
+   P3 -10G-> fc8A P7
+   P4 -10G-> fc8B P20
+   P5 -10G-> fc7A P7
+   P6 -10G-> fc7B P20
+   P7 -10G-> fc6A P7
+   P8 -10G-> fc6B P20
+   P9 -10G-> fc5A P7
+   P10 -10G-> fc5B P20
+   P11 -10G-> fc4A P7
+   P12 -10G-> fc4B P20
+   P13 -10G-> fc1B P20
+   P14 -10G-> fc1A P7
+   P15 -10G-> fc2B P20
+   P16 -10G-> fc2A P7
+   P17 -10G-> fc3B P20
+   P18 -10G-> fc3A P7
+   P19 -10G-> lc9-3A/P3
+   P20 -10G-> lc9-3B/P3
+   P21 -10G-> lc9-3B/P2
+   P22 -10G-> lc9-3B/P1
+   P23 -10G-> lc9-3A/P2
+   P24 -10G-> lc9-3A/P1
+   P25 -10G-> lc9-4A/P3
+   P26 -10G-> lc9-4B/P3
+   P27 -10G-> lc9-4B/P2
+   P28 -10G-> lc9-4B/P1
+   P29 -10G-> lc9-4A/P2
+   P30 -10G-> lc9-4A/P1
+   P31 -10G-> lc9-5A/P1
+   P32 -10G-> lc9-5A/P2
+   P33 -10G-> lc9-5B/P1
+   P34 -10G-> lc9-5B/P2
+   P35 -10G-> lc9-5B/P3
+   P36 -10G-> lc9-5A/P3
+
+SUBSYSTEM LEAF lc9C
+   P1 -10G-> fc9B P19
+   P2 -10G-> fc9A P6
+   P3 -10G-> fc8B P19
+   P4 -10G-> fc8A P6
+   P5 -10G-> fc7B P19
+   P6 -10G-> fc7A P6
+   P7 -10G-> fc6B P19
+   P8 -10G-> fc6A P6
+   P9 -10G-> fc5B P19
+   P10 -10G-> fc5A P6
+   P11 -10G-> fc4B P19
+   P12 -10G-> fc4A P6
+   P13 -10G-> fc1A P6
+   P14 -10G-> fc1B P19
+   P15 -10G-> fc2A P6
+   P16 -10G-> fc2B P19
+   P17 -10G-> fc3A P6
+   P18 -10G-> fc3B P19
+   P19 -10G-> lc9-6A/P3
+   P20 -10G-> lc9-6B/P3
+   P21 -10G-> lc9-6B/P2
+   P22 -10G-> lc9-6B/P1
+   P23 -10G-> lc9-6A/P2
+   P24 -10G-> lc9-6A/P1
+   P25 -10G-> lc9-7A/P3
+   P26 -10G-> lc9-7B/P3
+   P27 -10G-> lc9-7B/P2
+   P28 -10G-> lc9-7B/P1
+   P29 -10G-> lc9-7A/P2
+   P30 -10G-> lc9-7A/P1
+   P31 -10G-> lc9-8A/P1
+   P32 -10G-> lc9-8A/P2
+   P33 -10G-> lc9-8B/P1
+   P34 -10G-> lc9-8B/P2
+   P35 -10G-> lc9-8B/P3
+   P36 -10G-> lc9-8A/P3
+
+SUBSYSTEM LEAF lc9D
+   P1 -10G-> fc9A P9
+   P2 -10G-> fc9B P22
+   P3 -10G-> fc8A P9
+   P4 -10G-> fc8B P22
+   P5 -10G-> fc7A P9
+   P6 -10G-> fc7B P22
+   P7 -10G-> fc6A P9
+   P8 -10G-> fc6B P22
+   P9 -10G-> fc5A P9
+   P10 -10G-> fc5B P22
+   P11 -10G-> fc4A P9
+   P12 -10G-> fc4B P22
+   P13 -10G-> fc1B P22
+   P14 -10G-> fc1A P9
+   P15 -10G-> fc2B P22
+   P16 -10G-> fc2A P9
+   P17 -10G-> fc3B P22
+   P18 -10G-> fc3A P9
+   P19 -10G-> lc9-9A/P3
+   P20 -10G-> lc9-9B/P3
+   P21 -10G-> lc9-9B/P2
+   P22 -10G-> lc9-9B/P1
+   P23 -10G-> lc9-9A/P2
+   P24 -10G-> lc9-9A/P1
+   P25 -10G-> lc9-10A/P3
+   P26 -10G-> lc9-10B/P3
+   P27 -10G-> lc9-10B/P2
+   P28 -10G-> lc9-10B/P1
+   P29 -10G-> lc9-10A/P2
+   P30 -10G-> lc9-10A/P1
+   P31 -10G-> lc9-11A/P1
+   P32 -10G-> lc9-11A/P2
+   P33 -10G-> lc9-11B/P1
+   P34 -10G-> lc9-11B/P2
+   P35 -10G-> lc9-11B/P3
+   P36 -10G-> lc9-11A/P3
diff --git a/ibdm/ibnl/SUNDCS72QDR.ibnl b/ibdm/ibnl/SUNDCS72QDR.ibnl
new file mode 100644
index 0000000..1907ec3
--- /dev/null
+++ b/ibdm/ibnl/SUNDCS72QDR.ibnl
@@ -0,0 +1,311 @@
+SYSTEM LEAF,LEAF:4x,LEAF:4X
+
+NODE SW 36 MT48436 U1
+1  -10G-> P1
+2  -10G-> P2
+3  -10G-> P3
+4  -10G-> P4
+5  -10G-> P5
+6  -10G-> P6
+7  -10G-> P7
+8  -10G-> P8
+9  -10G-> P9
+10 -10G-> P10
+11 -10G-> P11
+12 -10G-> P12
+13 -10G-> P13
+14 -10G-> P14
+15 -10G-> P15
+16 -10G-> P16
+17 -10G-> P17
+18 -10G-> P18
+19 -10G-> P19
+20 -10G-> P20
+21 -10G-> P21
+22 -10G-> P22
+23 -10G-> P23
+24 -10G-> P24
+25 -10G-> P25
+26 -10G-> P26
+27 -10G-> P27
+28 -10G-> P28
+29 -10G-> P29
+30 -10G-> P30
+31 -10G-> P31
+32 -10G-> P32
+33 -10G-> P33
+34 -10G-> P34
+35 -10G-> P35
+36 -10G-> P36
+
+SYSTEM SPINE,SPINE:4x,SPINE:4X
+
+NODE SW 36 MT48436 U1
+1  -10G-> P1
+2  -10G-> P2
+3  -10G-> P3
+4  -10G-> P4
+5  -10G-> P5
+6  -10G-> P6
+7  -10G-> P7
+8  -10G-> P8
+9  -10G-> P9
+10 -10G-> P10
+11 -10G-> P11
+12 -10G-> P12
+13 -10G-> P13
+14 -10G-> P14
+15 -10G-> P15
+16 -10G-> P16
+17 -10G-> P17
+18 -10G-> P18
+19 -10G-> P19
+20 -10G-> P20
+21 -10G-> P21
+22 -10G-> P22
+23 -10G-> P23
+24 -10G-> P24
+25 -10G-> P25
+26 -10G-> P26
+27 -10G-> P27
+28 -10G-> P28
+29 -10G-> P29
+30 -10G-> P30
+31 -10G-> P31
+32 -10G-> P32
+33 -10G-> P33
+34 -10G-> P34
+35 -10G-> P35
+36 -10G-> P36
+
+TOPSYSTEM SUNDCS72QDR,NM2-72P
+
+SUBSYSTEM SPINE SW-F
+   P1 -10G-> SW-C P9
+   P2 -10G-> SW-A P8
+   P4 -10G-> SW-C P6
+   P3 -10G-> SW-A P7
+   P5 -10G-> SW-A P5
+   P6 -10G-> SW-C P4
+   P7 -10G-> SW-A P3
+   P8 -10G-> SW-A P2
+   P9 -10G-> SW-C P1
+   P10 -10G-> SW-B P13
+   P11 -10G-> SW-D P14
+   P12 -10G-> SW-D P15
+   P13 -10G-> SW-B P10
+   P14 -10G-> SW-D P11
+   P15 -10G-> SW-D P12
+   P16 -10G-> SW-B P18
+   P17 -10G-> SW-D P17
+   P18 -10G-> SW-B P16
+   P19 -10G-> SW-A P10
+   P20 -10G-> SW-C P11
+   P21 -10G-> SW-C P12
+   P22 -10G-> SW-A P18
+   P23 -10G-> SW-C P17
+   P24 -10G-> SW-A P16
+   P25 -10G-> SW-C P15
+   P26 -10G-> SW-C P14
+   P27 -10G-> SW-A P13
+   P28 -10G-> SW-D P1
+   P29 -10G-> SW-B P2
+   P30 -10G-> SW-B P3
+   P31 -10G-> SW-D P9
+   P32 -10G-> SW-B P8
+   P33 -10G-> SW-B P7
+   P34 -10G-> SW-D P6
+   P35 -10G-> SW-B P5
+   P36 -10G-> SW-D P4
+
+
+SUBSYSTEM SPINE SW-E
+   P1 -10G-> SW-A P9
+   P2 -10G-> SW-C P8
+   P3 -10G-> SW-C P7
+   P4 -10G-> SW-A P6
+   P5 -10G-> SW-C P5
+   P6 -10G-> SW-A P4
+   P7 -10G-> SW-C P3
+   P8 -10G-> SW-C P2
+   P9 -10G-> SW-A P1
+   P10 -10G-> SW-D P13
+   P11 -10G-> SW-B P14
+   P12 -10G-> SW-B P15
+   P13 -10G-> SW-D P10
+   P14 -10G-> SW-B P11
+   P15 -10G-> SW-B P12
+   P16 -10G-> SW-D P18
+   P17 -10G-> SW-B P17
+   P18 -10G-> SW-D P16
+   P19 -10G-> SW-C P10
+   P20 -10G-> SW-A P11
+   P21 -10G-> SW-A P12
+   P22 -10G-> SW-C P18
+   P23 -10G-> SW-A P17
+   P24 -10G-> SW-C P16
+   P25 -10G-> SW-A P15
+   P26 -10G-> SW-A P14
+   P27 -10G-> SW-C P13
+   P28 -10G-> SW-B P1
+   P29 -10G-> SW-D P2
+   P30 -10G-> SW-D P3
+   P31 -10G-> SW-B P9
+   P32 -10G-> SW-D P8
+   P33 -10G-> SW-D P7
+   P34 -10G-> SW-B P6
+   P35 -10G-> SW-D P5
+   P36 -10G-> SW-B P4
+
+SUBSYSTEM LEAF SW-D
+   P1 -10G-> SW-F P28
+   P2 -10G-> SW-E P29
+   P3 -10G-> SW-E P30
+   P4 -10G-> SW-F P36
+   P5 -10G-> SW-E P35
+   P6 -10G-> SW-F P34
+   P7 -10G-> SW-E P33
+   P8 -10G-> SW-E P32
+   P9 -10G-> SW-F P31
+   P10 -10G-> SW-E P13
+   P11 -10G-> SW-F P14
+   P12 -10G-> SW-F P15
+   P13 -10G-> SW-E P10
+   P14 -10G-> SW-F P11
+   P15 -10G-> SW-F P12
+   P16 -10G-> SW-E P18
+   P17 -10G-> SW-F P17
+   P18 -10G-> SW-E P16
+   P19 -10G-> C-9A/P3
+   P20 -10G-> C-9B/P3
+   P21 -10G-> C-9B/P2
+   P22 -10G-> C-9B/P1
+   P23 -10G-> C-9A/P2
+   P24 -10G-> C-9A/P1
+   P25 -10G-> C-10A/P3
+   P26 -10G-> C-10B/P3
+   P27 -10G-> C-10B/P2
+   P28 -10G-> C-10B/P1
+   P29 -10G-> C-10A/P2
+   P30 -10G-> C-10A/P1
+   P31 -10G-> C-11A/P1
+   P32 -10G-> C-11A/P2
+   P33 -10G-> C-11B/P1
+   P34 -10G-> C-11B/P2
+   P35 -10G-> C-11B/P3
+   P36 -10G-> C-11A/P3
+
+SUBSYSTEM LEAF SW-C
+   P1 -10G-> SW-F P9
+   P2 -10G-> SW-E P8
+   P3 -10G-> SW-E P7
+   P4 -10G-> SW-F P6
+   P5 -10G-> SW-E P5
+   P6 -10G-> SW-F P4
+   P7 -10G-> SW-E P3
+   P8 -10G-> SW-E P2
+   P9 -10G-> SW-F P1
+   P10 -10G-> SW-E P19
+   P11 -10G-> SW-F P20
+   P12 -10G-> SW-F P21
+   P13 -10G-> SW-E P27
+   P14 -10G-> SW-F P26
+   P15 -10G-> SW-F P25
+   P16 -10G-> SW-E P24
+   P17 -10G-> SW-F P23
+   P18 -10G-> SW-E P22
+   P19 -10G-> C-6A/P3
+   P20 -10G-> C-6B/P3
+   P21 -10G-> C-6B/P2
+   P22 -10G-> C-6B/P1
+   P23 -10G-> C-6A/P2
+   P24 -10G-> C-6A/P1
+   P25 -10G-> C-7A/P3
+   P26 -10G-> C-7B/P3
+   P27 -10G-> C-7B/P2
+   P28 -10G-> C-7B/P1
+   P29 -10G-> C-7A/P2
+   P30 -10G-> C-7A/P1
+   P31 -10G-> C-8A/P1
+   P32 -10G-> C-8A/P2
+   P33 -10G-> C-8B/P1
+   P34 -10G-> C-8B/P2
+   P35 -10G-> C-8B/P3
+   P36 -10G-> C-8A/P3
+
+SUBSYSTEM LEAF SW-B
+   P1 -10G-> SW-E P28
+   P2 -10G-> SW-F P29
+   P3 -10G-> SW-F P30
+   P4 -10G-> SW-E P36
+   P5 -10G-> SW-F P35
+   P6 -10G-> SW-E P34
+   P7 -10G-> SW-F P33
+   P8 -10G-> SW-F P32
+   P9 -10G-> SW-E P31
+   P10 -10G-> SW-F P13
+   P11 -10G-> SW-E P14
+   P12 -10G-> SW-E P15
+   P13 -10G-> SW-F P10
+   P14 -10G-> SW-E P11
+   P15 -10G-> SW-E P12
+   P16 -10G-> SW-F P18
+   P17 -10G-> SW-E P17
+   P18 -10G-> SW-F P16
+   P19 -10G-> C-3A/P3
+   P20 -10G-> C-3B/P3
+   P21 -10G-> C-3B/P2
+   P22 -10G-> C-3B/P1
+   P23 -10G-> C-3A/P2
+   P24 -10G-> C-3A/P1
+   P25 -10G-> C-4A/P3
+   P26 -10G-> C-4B/P3
+   P27 -10G-> C-4B/P2
+   P28 -10G-> C-4B/P1
+   P29 -10G-> C-4A/P2
+   P30 -10G-> C-4A/P1
+   P31 -10G-> C-5A/P1
+   P32 -10G-> C-5A/P2
+   P33 -10G-> C-5B/P1
+   P34 -10G-> C-5B/P2
+   P35 -10G-> C-5B/P3
+   P36 -10G-> C-5A/P3
+
+SUBSYSTEM LEAF SW-A
+   P1 -10G-> SW-E P9
+   P2 -10G-> SW-F P8
+   P3 -10G-> SW-F P7
+   P4 -10G-> SW-E P6
+   P5 -10G-> SW-F P5
+   P6 -10G-> SW-E P4
+   P7 -10G-> SW-F P3
+   P8 -10G-> SW-F P2
+   P9 -10G-> SW-E P1
+   P10 -10G-> SW-F P19
+   P11 -10G-> SW-E P20
+   P12 -10G-> SW-E P21
+   P13 -10G-> SW-F P27
+   P14 -10G-> SW-E P26
+   P15 -10G-> SW-E P25
+   P16 -10G-> SW-F P24
+   P17 -10G-> SW-E P23
+   P18 -10G-> SW-F P22
+   P19 -10G-> C-0A/P3
+   P20 -10G-> C-0B/P3
+   P21 -10G-> C-0B/P2
+   P22 -10G-> C-0B/P1
+   P23 -10G-> C-0A/P2
+   P24 -10G-> C-0A/P1
+   P25 -10G-> C-1A/P3
+   P26 -10G-> C-1B/P3
+   P27 -10G-> C-1B/P2
+   P28 -10G-> C-1B/P1
+   P29 -10G-> C-1A/P2
+   P30 -10G-> C-1A/P1
+   P31 -10G-> C-2A/P1
+   P32 -10G-> C-2A/P2
+   P33 -10G-> C-2B/P1
+   P34 -10G-> C-2B/P2
+   P35 -10G-> C-2B/P3
+   P36 -10G-> C-2A/P3
+


From vlad at lists.openfabrics.org  Fri Aug 28 03:06:15 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 28 Aug 2009 03:06:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090828-0200 daily build status
Message-ID: <20090828100615.54101E61E5D@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From FENKES at de.ibm.com  Fri Aug 28 05:58:49 2009
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Fri, 28 Aug 2009 14:58:49 +0200
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <f0e08f230908270631j3e159f3fgb0034eb41acdac7b@mail.gmail.com>
References: <200908261337.56128.fenkes@de.ibm.com>	
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
	<OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>
	<f0e08f230908270631j3e159f3fgb0034eb41acdac7b@mail.gmail.com>
Message-ID: <OFA25F14CC.CE7042E9-ONC1257620.00453C46-C1257620.00474D70@de.ibm.com>

Hal Rosenstock <hal.rosenstock at gmail.com> wrote on 27.08.2009 15:31:40:

> I don't think it should be hard coded. IMO it would be better to default 
to 18
> and somehow able to be adjusted (via a (dynamic) module parameter ?).

I don't see how making this a parameter would benefit any end user, while 
on the other hand it clutters up our parameter list.

Changing RespTimeValue won't influence the IB performance or user-visible 
behavior of our driver in any way, and in fact, all RespTimeValue says is 
"Please use a timeout of one second for all future MADs you send me", only 
there won't be any more MADs in the future because we just redirected the 
client to someone else. So, the RespTimeValue field is a don't care in the 
redirection scenario. Setting it to an arbitrary, but legal value isn't 
much more than a concession towards any broken clients that may be out 
there.

Given that you seem to like the rest of the code and Jason hasn't spoken 
up yet, I think we can have Roland merge this patch. Roland, what do you 
think?

Regards,
  Joachim


From hnrose at comcast.net  Fri Aug 28 06:44:53 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 28 Aug 2009 09:44:53 -0400
Subject: [ofa-general] [PATCH] opensm/osm_helper.c: Add SM priority changed
	into trap 144 description
Message-ID: <20090828134452.GA20014@comcast.net>


Per MgtWG RefID #4503

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 3692474..1b83a9e 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -531,7 +531,7 @@ const char *ib_get_trap_str(ib_net16_t trap_num)
 		return "Flow Control Update watchdog timer expired";
 	case 144:
 		return
-		    "CapabilityMask, NodeDescription, Link [Width|Speed] Enabled changed";
+		    "CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed";
 	case 145:
 		return "System Image GUID changed";
 	case 256:


From halves at linux.vnet.ibm.com  Fri Aug 28 08:37:30 2009
From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves)
Date: Fri, 28 Aug 2009 12:37:30 -0300
Subject: [ofa-general] OFED 1.5-alpha 4 and RHEL 5.3 GA
Message-ID: <1251473851.10055.3.camel@halves-ltc>

Hi Guys, 

I tried build OFED1.5 on RHEL 5.3 GA and got an error to build
ofa_kernel. Build log attached.


Regards, 
-- 
Higor Aparecido Vieira Alves
Software Engineer
Linux Technology Center 
IBM Systems & Technology Group
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofa_kernel.rpmbuild.log
Type: text/x-log
Size: 772137 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090828/a330d88e/attachment.bin>

From hal.rosenstock at gmail.com  Fri Aug 28 09:03:47 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 28 Aug 2009 12:03:47 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables
	setup flow
In-Reply-To: <20090828080756.GH28379@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
Message-ID: <f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>

On 8/28/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
>
> Simplify (and unify) forwarding tables setup decision flow.


Seems to work for all engines but I got a failure for a test case where lash
fell back to min hop:

lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8) with
starting lane (0)
ucast_mgr_route: lash: cannot build fwd tables.
osm_ucast_mgr_process: minhop tables configured on all switches
ERR 331D: LFT of switch 0xguid is not up to date.

Prior to this change, the LFTs were pushed for this fallback case (and no
ERR 331D occured).

-- Hal

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
> opensm/opensm/osm_ucast_mgr.c |    7 +------
> 1 files changed, 1 insertions(+), 6 deletions(-)
>
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 629f628..8ba78f8 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
>                }
>        }
>
> -       set_fwd_tbl_top(p_mgr, p_sw);
> -
>        if (p_mgr->p_subn->opt.lmc)
>                free_ports_priv(p_mgr);
>
> @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
>        cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
> ucast_mgr_process_tbl,
>                           p_mgr);
>
> -       ucast_mgr_pipeline_fwd_tbl(p_mgr);
> -
>        cl_qlist_remove_all(&p_mgr->port_order_list);
>
>        return 0;
> @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine
> *r, osm_opensm_t * osm)
>
>        osm->routing_engine_used = osm_routing_engine_type(r->name);
>
> -       if (r->ucast_build_fwd_tables)
> -               osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
> +       osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
>
>        return 0;
> }
> --
> 1.6.4
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090828/da48cd5a/attachment.html>

From rdreier at cisco.com  Fri Aug 28 09:27:35 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Aug 2009 09:27:35 -0700
Subject: [ofa-general] QDR IB cards supports card back to back connectivity
In-Reply-To: <COL123-W51707D7C7D382D953B48EAB8F50@phx.gbl> (lakshmana swamy's
	message of "Fri, 28 Aug 2009 13:25:54 +0530")
References: <COL123-W51707D7C7D382D953B48EAB8F50@phx.gbl>
Message-ID: <adafxbbsy2w.fsf@cisco.com>


 >  I would like know the QDR Infinibad cards will support to back to
 >  back connectivity or not ie with out IB swicth to enable the IB
 >  communication between the two machines .

Yes, any IB port should be able to connect to any other IB port.  You do
need a subnet manager (SM) on every IB fabric, so in your case of two
HCAs connected back-to-back, and SM must be running on one of the HCA ports.


From rdreier at cisco.com  Fri Aug 28 09:28:29 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Aug 2009 09:28:29 -0700
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect
	replies from request MAD
In-Reply-To: <OFA25F14CC.CE7042E9-ONC1257620.00453C46-C1257620.00474D70@de.ibm.com>
	(Joachim Fenkes's message of "Fri, 28 Aug 2009 14:58:49 +0200")
References: <200908261337.56128.fenkes@de.ibm.com>
	<f0e08f230908260815g70de3002pfd0b34f1b17abd6@mail.gmail.com>
	<OFF2257478.FF0FEABE-ONC125761F.00344884-C125761F.00358310@de.ibm.com>
	<f0e08f230908270631j3e159f3fgb0034eb41acdac7b@mail.gmail.com>
	<OFA25F14CC.CE7042E9-ONC1257620.00453C46-C1257620.00474D70@de.ibm.com>
Message-ID: <adabplzsy1e.fsf@cisco.com>


 > Given that you seem to like the rest of the code and Jason hasn't spoken 
 > up yet, I think we can have Roland merge this patch. Roland, what do you 
 > think?

I don't see any problem with the idea and this does sound like a step
forward, so I am planning on merging this (pending review).


From rdreier at cisco.com  Fri Aug 28 10:30:36 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Aug 2009 10:30:36 -0700
Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing
	list to vger?
In-Reply-To: <20090820.160800.50693597.davem@davemloft.net> (David Miller's
	message of "Thu, 20 Aug 2009 16:08:00 -0700 (PDT)")
References: <adavdkiw3gx.fsf@cisco.com>
	<20090820.160800.50693597.davem@davemloft.net>
Message-ID: <ada7hwnsv5v.fsf@cisco.com>

It seems we only had positive responses to moving from general@ to a new
linux-rdma at vger.kernel.org list, so I'll work on a transition plan.

For now, please continue to use general at lists.openfabrics.org.  However,
you may want to subscribe to the vger list to be ready for the
transition; for information on that, see http://vger.kernel.org.

 - Roland


From rdreier at cisco.com  Fri Aug 28 10:55:16 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Aug 2009 10:55:16 -0700
Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open
	while device is being removed
In-Reply-To: <200908111021.01612.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 11 Aug 2009 10:21:01 +0300")
References: <200908111021.01612.jackm@dev.mellanox.co.il>
Message-ID: <adaljl3rfgb.fsf@cisco.com>

checkpatch output:

WARNING: suspect code indent for conditional statements (8, 12)
#88: FILE: drivers/infiniband/hw/mlx4/main.c:345:
+	if (!dev->ib_active)
+	    return ERR_PTR(-EAGAIN);

ERROR: code indent should use tabs where possible
#107: FILE: drivers/infiniband/hw/mlx4/main.c:737:
+    ^I^Iibdev->ib_active = 0;$

total: 1 errors, 1 warnings, 31 lines checked

not great for a patch this small.  Please clean up.


From rdreier at cisco.com  Fri Aug 28 10:58:43 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Aug 2009 10:58:43 -0700
Subject: [ofa-general] Re: [PATCH] mthca: Do not allow ib userspace open
	following device internal error
In-Reply-To: <200908121215.46221.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Wed, 12 Aug 2009 12:15:46 +0300")
References: <200908121215.46221.jackm@dev.mellanox.co.il>
Message-ID: <adahbvrrfak.fsf@cisco.com>

thanks, applied (and thanks for the detailed changelog, that really
makes things easier)


From hal.rosenstock at gmail.com  Fri Aug 28 11:34:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 28 Aug 2009 14:34:36 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr: better lft setup
In-Reply-To: <20090828081002.GI28379@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me> <20090828081002.GI28379@me>
Message-ID: <f0e08f230908281134u79467923k4b3c1b1a7fe9c2b3@mail.gmail.com>

On 8/28/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
>
> The function set_next_lft_block() is called in loop with block number
> incremented, inside it loops by itself in looking for changed block,
> caller will call this function with original block number incremented
> so this internal loop could be repeated again and again. This patch
> cleans this ineffectiveness.
>
> Also rename it to set_lft_block() since block number is treated as
> parameters and *not* next block is processed and merges some code.
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>


Acked-by: Hal Rosenstock hal.rosenstock at gmail.com


> ---
> opensm/include/opensm/osm_ucast_mgr.h |    1 +
> opensm/opensm/osm_ucast_mgr.c         |  126
> +++++++++++----------------------
> 2 files changed, 43 insertions(+), 84 deletions(-)
>
> diff --git a/opensm/include/opensm/osm_ucast_mgr.h
> b/opensm/include/opensm/osm_ucast_mgr.h
> index 4ef045c..78a88f0 100644
> --- a/opensm/include/opensm/osm_ucast_mgr.h
> +++ b/opensm/include/opensm/osm_ucast_mgr.h
> @@ -95,6 +95,7 @@ typedef struct osm_ucast_mgr {
>        osm_subn_t *p_subn;
>        osm_log_t *p_log;
>        cl_plock_t *p_lock;
> +       uint16_t max_lid;
>        cl_qlist_t port_order_list;
>        boolean_t is_dor;
>        boolean_t some_hop_count_set;
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 8ba78f8..a111c10 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -336,6 +336,9 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> IN osm_switch_t * p_sw)
>
>        CL_ASSERT(p_node);
>
> +       if (p_mgr->max_lid < p_sw->max_lid_ho)
> +               p_mgr->max_lid = p_sw->max_lid_ho;
> +
>        p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node,
> 0));
>
>        /*
> @@ -478,65 +481,13 @@ static void ucast_mgr_process_top(IN cl_map_item_t *
> p_map_item,
>        set_fwd_tbl_top(p_mgr, p_sw);
> }
>
> -static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t *
> p_sm,
> -                                   IN uint8_t * p_block,
> -                                   IN osm_dr_path_t * p_path,
> -                                   IN uint16_t block_id_ho,
> -                                   IN osm_madw_context_t * p_context)
> -{
> -       ib_api_status_t status;
> -       boolean_t sts;
> -
> -       OSM_LOG_ENTER(p_sm->p_log);
> -
> -       for (;
> -            (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block));
> -            block_id_ho++) {
> -               if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> -                   !memcmp(p_block,
> -                           p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> -                           IB_SMP_DATA_SIZE))
> -                       continue;
> -
> -               OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> -                       "Writing FT block %u to switch 0x%" PRIx64 "\n",
> -                       block_id_ho,
> -                       cl_ntoh64(p_context->lft_context.node_guid));
> -
> -               status = osm_req_set(p_sm, p_path,
> -                                    p_sw->new_lft +
> -                                    block_id_ho * IB_SMP_DATA_SIZE,
> -                                    IB_SMP_DATA_SIZE,
> IB_MAD_ATTR_LIN_FWD_TBL,
> -                                    cl_hton32(block_id_ho),
> -                                    CL_DISP_MSGID_NONE, p_context);
> -
> -               if (status != IB_SUCCESS)
> -                       OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> -                               "Sending linear fwd. tbl. block failed
> (%s)\n",
> -                               ib_get_err_str(status));
> -               break;
> -       }
> -
> -       OSM_LOG_EXIT(p_sm->p_log);
> -       return sts;
> -}
> -
> -static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw,
> -                                        IN osm_ucast_mgr_t *p_mgr,
> -                                        IN uint16_t block_id_ho)
> +static int set_lft_block(IN osm_switch_t *p_sw, IN osm_ucast_mgr_t *p_mgr,
> +                        IN uint16_t block_id_ho)
> {
> -       osm_dr_path_t *p_path;
> -       osm_madw_context_t context;
>        uint8_t block[IB_SMP_DATA_SIZE];
> -       boolean_t status;
> -
> -       OSM_LOG_ENTER(p_mgr->p_log);
> -
> -       CL_ASSERT(p_sw && p_sw->p_node);
> -
> -       OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> -               "Processing switch 0x%" PRIx64 "\n",
> -               cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> +       osm_madw_context_t context;
> +       osm_dr_path_t *p_path;
> +       ib_api_status_t status;
>
>        /*
>           Send linear forwarding table blocks to the switch
> @@ -547,8 +498,7 @@ static boolean_t pipeline_next_lft_block(IN
> osm_switch_t *p_sw,
>                /* any routing should provide the new_lft */
>                CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
>                          p_mgr->cache_valid && !p_sw->need_update);
> -               status = FALSE;
> -               goto Exit;
> +               return -1;
>        }
>
>        p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> @@ -556,12 +506,29 @@ static boolean_t pipeline_next_lft_block(IN
> osm_switch_t *p_sw,
>        context.lft_context.node_guid =
> osm_node_get_node_guid(p_sw->p_node);
>        context.lft_context.set_method = TRUE;
>
> -       status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path,
> -                                   block_id_ho, &context);
> +       if (!osm_switch_get_lft_block(p_sw, block_id_ho, block) ||
> +           (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> +            !memcmp(block, p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> +                    IB_SMP_DATA_SIZE)))
> +               return 0;
>
> -Exit:
> -       OSM_LOG_EXIT(p_mgr->p_log);
> -       return status;
> +       OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> +               "Writing FT block %u to switch 0x%" PRIx64 "\n",
> block_id_ho,
> +               cl_ntoh64(context.lft_context.node_guid));
> +
> +       status = osm_req_set(p_mgr->sm, p_path,
> +                            p_sw->new_lft + block_id_ho *
> IB_SMP_DATA_SIZE,
> +                            IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
> +                            cl_hton32(block_id_ho),
> +                            CL_DISP_MSGID_NONE, &context);
> +       if (status != IB_SUCCESS) {
> +               OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> +                       "Sending linear fwd. tbl. block failed (%s)\n",
> +                       ib_get_err_str(status));
> +               return -1;
> +       }
> +
> +       return 0;
> }
>
> /**********************************************************************
> @@ -919,26 +886,15 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t
> * m)
>
> static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr)
> {
> -       cl_qmap_t *p_sw_tbl;
> -       osm_switch_t *p_sw;
> -       uint16_t block_id_ho = 0;
> -       int sws_notdone;
> -       boolean_t sts;
> -
> -       p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> -       while (1) {
> -               p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> -               sws_notdone = 0;
> -               while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> -                       sts = pipeline_next_lft_block(p_sw, p_mgr,
> block_id_ho);
> -                       if (sts)
> -                               sws_notdone++;
> -                       p_sw = (osm_switch_t *)
> cl_qmap_next(&p_sw->map_item);
> -               }
> -               if (!sws_notdone)
> -                       break;
> -               block_id_ho++;
> -       }
> +       cl_qmap_t *tbl;
> +       cl_map_item_t *item;
> +       unsigned i, max_block = p_mgr->max_lid / 64 + 1;
> +
> +       tbl = &p_mgr->p_subn->sw_guid_tbl;
> +       for (i = 0; i < max_block; i++)
> +               for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl);
> +                    item = cl_qmap_next(item))
> +                       set_lft_block((osm_switch_t *)item, p_mgr, i);
> }
>
> static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> @@ -984,6 +940,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
> **********************************************************************/
> void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr)
> {
> +       p_mgr->max_lid = 0;
> +
>        cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
>                           ucast_mgr_process_top, p_mgr);
>
> --
> 1.6.4
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090828/05ddcc13/attachment.html>

From jgunthorpe at obsidianresearch.com  Fri Aug 28 12:02:51 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 28 Aug 2009 13:02:51 -0600
Subject: [ofa-general] [PATCH] Remove duplicated umad_get_mad.3 from
	Makefile.am
Message-ID: <20090828190251.GA8633@obsidianresearch.com>

Fixes builds on FC11.

Signed-off-by: Jason Gunthorpe <jgunthorpe at obsidianresearch.com>
---
 libibumad/Makefile.am |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am
index 50222df..27c6ff2 100644
--- a/libibumad/Makefile.am
+++ b/libibumad/Makefile.am
@@ -9,7 +9,7 @@ man_MANS = man/umad_debug.3 man/umad_get_ca.3 \
 	   man/umad_open_port.3 man/umad_close_port.3 man/umad_size.3 \
 	   man/umad_status.3 man/umad_alloc.3 man/umad_free.3 \
 	   man/umad_dump.3 man/umad_addr_dump.3 man/umad_get_fd.3 \
-	   man/umad_get_mad.3 man/umad_get_mad_addr.3 \
+	   man/umad_get_mad_addr.3 \
 	   man/umad_set_grh_net.3 man/umad_set_grh.3 \
 	   man/umad_set_addr_net.3 man/umad_set_addr.3 man/umad_set_pkey.3 \
 	   man/umad_get_pkey.3 \
-- 
1.6.0.4


From jaschut at sandia.gov  Fri Aug 28 12:08:14 2009
From: jaschut at sandia.gov (Jim Schutt)
Date: Fri, 28 Aug 2009 13:08:14 -0600
Subject: [ofa-general] [PATCH 0/2] opensm: release references to persistent
	routing engine private data
Message-ID: <1251486496-24812-1-git-send-email-jaschut@sandia.gov>


Hi,

LASH uses osm_switch_t:priv to reference private data that persists between
calls to the routing engine.  The first patch fixes a use-after-free bug
that occurs due to this reference, when a switch is removed from a fabric 
that LASH is routing.

The second patch applies the same methodology to osm_port_t:priv.  Even
though no routing engine currently uses it to hold references to persistent
private data, it seems appropriate to make the priv member for 
osm_switch_t and osm_port_t have the same behavior.

-- Jim


From jaschut at sandia.gov  Fri Aug 28 12:08:15 2009
From: jaschut at sandia.gov (Jim Schutt)
Date: Fri, 28 Aug 2009 13:08:15 -0600
Subject: [ofa-general] [PATCH 1/2] opensm: avoid LASH use-after-free when
	switch is deleted from fabric.
In-Reply-To: <1251486496-24812-1-git-send-email-jaschut@sandia.gov>
References: <1251486496-24812-1-git-send-email-jaschut@sandia.gov>
Message-ID: <1251486496-24812-2-git-send-email-jaschut@sandia.gov>

When LASH is run against ibsim, valgrind reports the following
(on x86_64) after a switch is removed from the fabric:

==15699== Invalid write of size 8
==15699==    at 0x45FD8A: switch_delete (osm_ucast_lash.c:648)
==15699==    by 0x461483: lash_cleanup (osm_ucast_lash.c:1123)
==15699==    by 0x461848: lash_process (osm_ucast_lash.c:1230)
==15699==    by 0x45C043: ucast_mgr_route (osm_ucast_mgr.c:1016)
==15699==    by 0x45C1A0: osm_ucast_mgr_process (osm_ucast_mgr.c:1057)
==15699==    by 0x44F11B: do_sweep (osm_state_mgr.c:1283)
==15699==    by 0x44F539: osm_state_mgr_process (osm_state_mgr.c:1398)
==15699==    by 0x447296: sm_process (osm_sm.c:90)
==15699==    by 0x4473FE: sm_sweeper (osm_sm.c:130)
==15699==    by 0x5023505: __cl_thread_wrapper (cl_thread.c:57)
==15699==    by 0x37AC006366: start_thread (in /lib64/libpthread-2.5.so)
==15699==    by 0x37AB4D30AC: clone (in /lib64/libc-2.5.so)
==15699==  Address 0x9B28198 is 152 bytes inside a block of size 160 free'd
==15699==    at 0x4A0541E: free (vg_replace_malloc.c:233)
==15699==    by 0x453866: osm_switch_delete (osm_switch.c:97)
==15699==    by 0x4116AA: drop_mgr_remove_switch (osm_drop_mgr.c:290)
==15699==    by 0x411820: drop_mgr_process_node (osm_drop_mgr.c:339)
==15699==    by 0x411D0C: osm_drop_mgr_process (osm_drop_mgr.c:465)
==15699==    by 0x44EF97: do_sweep (osm_state_mgr.c:1231)
==15699==    by 0x44F539: osm_state_mgr_process (osm_state_mgr.c:1398)
==15699==    by 0x447296: sm_process (osm_sm.c:90)
==15699==    by 0x4473FE: sm_sweeper (osm_sm.c:130)
==15699==    by 0x5023505: __cl_thread_wrapper (cl_thread.c:57)
==15699==    by 0x37AC006366: start_thread (in /lib64/libpthread-2.5.so)
==15699==    by 0x37AB4D30AC: clone (in /lib64/libc-2.5.so)

The root cause is that in order to perform SL lookup for path record
queries, LASH needs to keep persistent data between calls to the
routing engine.

LASH uses the osm_switch_t:priv member to speed lookup of the LASH
switch_t objects it needs to perform SL lookup, and has a corresponding
switch_t:p_sw member to point to the corresponding osm_switch_t object.

When a switch is deleted from the fabric, the switch_t:p_sw value becomes
invalid, but LASH's switch_delete() uses it to clear the corresponding
osm_switch_t:priv value.

Solve this problem by adding a priv_release function pointer that
is set when osm_switch_t:priv is set.  This allows the opensm core to
clean up after any routing engine that is using priv to access
persistent data (LASH seems to be the only one so far), without
knowing the details of how to do so.

When multiple routing engines are configured, it also allows a routing
engine using osm_switch_t:priv to clean up if some other routing engine
using priv fails in an unexpected way.

With this addition, the rules for using osm_switch_t:priv become:
1) Never assign to priv without also assigning to priv_release.
2) Always use priv_release() before assigning to priv; this
   prevents memory issues due to unexpected errors in a
   routing engine using priv.
3) Always use priv_release() to clean up after a use of priv.

Since updn uses osm_switch_t:priv, fix it up to follow the above
rules as well, for consistency.

Signed-off-by: Jim Schutt <jaschut at sandia.gov>
---
 opensm/include/opensm/osm_switch.h |    1 +
 opensm/opensm/osm_switch.c         |    2 ++
 opensm/opensm/osm_ucast_lash.c     |   24 ++++++++++++++++++++----
 opensm/opensm/osm_ucast_updn.c     |   15 +++++++++++----
 4 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 7ce28c5..d48f8c6 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -106,6 +106,7 @@ typedef struct osm_switch {
 	unsigned endport_links;
 	unsigned need_update;
 	void *priv;
+	void (*priv_release)(struct osm_switch *p_sw);
 } osm_switch_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index ce1ca63..fbf3973 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -94,6 +94,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw)
 				free(p_sw->hops[i]);
 		free(p_sw->hops);
 	}
+	if (p_sw->priv_release)
+		p_sw->priv_release(p_sw);
 	free(*pp_sw);
 	*pp_sw = NULL;
 }
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 0a567b3..ceae7d8 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -603,6 +603,17 @@ static int balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed)
 	return 0;
 }
 
+static void lash_switch_priv_release(osm_switch_t *osm_sw)
+{
+	switch_t *sw = osm_sw->priv;
+
+	osm_sw->priv_release = NULL;
+	osm_sw->priv = NULL;
+
+	if (sw && sw->p_sw == osm_sw)
+		sw->p_sw = NULL;
+}
+
 static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw)
 {
 	unsigned num_switches = p_lash->num_switches;
@@ -628,8 +639,12 @@ static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw
 	}
 
 	sw->p_sw = p_sw;
-	if (p_sw)
+	if (p_sw) {
+		if (p_sw->priv_release)
+			p_sw->priv_release(p_sw);
 		p_sw->priv = sw;
+		p_sw->priv_release = lash_switch_priv_release;
+	}
 
 	if (osm_mesh_node_create(p_lash, sw)) {
 		free(sw->dij_channels);
@@ -644,8 +659,8 @@ static void switch_delete(lash_t *p_lash, switch_t * sw)
 {
 	if (sw->dij_channels)
 		free(sw->dij_channels);
-	if (sw->p_sw)
-		sw->p_sw->priv = NULL;
+	if (sw->p_sw && sw->p_sw->priv_release)
+		sw->p_sw->priv_release(sw->p_sw);
 	free(sw);
 }
 
@@ -1113,7 +1128,8 @@ static void lash_cleanup(lash_t * p_lash)
 	while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) {
 		p_sw = p_next_sw;
 		p_next_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
-		p_sw->priv = NULL;
+		if (p_sw->priv_release)
+			p_sw->priv_release(p_sw);
 	}
 
 	if (p_lash->switches) {
diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
index bb9ccda..dc5f459 100644
--- a/opensm/opensm/osm_ucast_updn.c
+++ b/opensm/opensm/osm_ucast_updn.c
@@ -404,10 +404,13 @@ static struct updn_node *create_updn_node(osm_switch_t * sw)
 	return u;
 }
 
-static void delete_updn_node(struct updn_node *u)
+static void updn_sw_priv_release(osm_switch_t *sw)
 {
-	u->sw->priv = NULL;
-	free(u);
+	if (sw->priv)
+		free(sw->priv);
+
+	sw->priv_release = NULL;
+	sw->priv = NULL;
 }
 
 /**********************************************************************
@@ -589,6 +592,8 @@ static int updn_lid_matrices(void *ctx)
 	     item != cl_qmap_end(&p_updn->p_osm->subn.sw_guid_tbl);
 	     item = cl_qmap_next(item)) {
 		p_sw = (osm_switch_t *)item;
+		if (p_sw->priv_release)
+			p_sw->priv_release(p_sw);
 		p_sw->priv = create_updn_node(p_sw);
 		if (!p_sw->priv) {
 			OSM_LOG(&(p_updn->p_osm->log), OSM_LOG_ERROR, "ERR AA0C: "
@@ -596,6 +601,7 @@ static int updn_lid_matrices(void *ctx)
 			OSM_LOG_EXIT(&p_updn->p_osm->log);
 			return -1;
 		}
+		p_sw->priv_release = updn_sw_priv_release;
 	}
 
 	/* First setup root nodes */
@@ -653,7 +659,8 @@ static int updn_lid_matrices(void *ctx)
 	     item != cl_qmap_end(&p_updn->p_osm->subn.sw_guid_tbl);
 	     item = cl_qmap_next(item)) {
 		p_sw = (osm_switch_t *) item;
-		delete_updn_node(p_sw->priv);
+		if (p_sw->priv_release)
+			p_sw->priv_release(p_sw);
 	}
 
 	OSM_LOG_EXIT(&p_updn->p_osm->log);
-- 
1.5.6.GIT


From jaschut at sandia.gov  Fri Aug 28 12:08:16 2009
From: jaschut at sandia.gov (Jim Schutt)
Date: Fri, 28 Aug 2009 13:08:16 -0600
Subject: [ofa-general] [PATCH 2/2] opensm: Add priv_release() function
	pointer member to osm_port_t.
In-Reply-To: <1251486496-24812-1-git-send-email-jaschut@sandia.gov>
References: <1251486496-24812-1-git-send-email-jaschut@sandia.gov>
Message-ID: <1251486496-24812-3-git-send-email-jaschut@sandia.gov>

Although no routing engine currently uses osm_port_t:priv to reference
routing engine data that is persistent between calls to the engine,
one may be added in the future.

Since this type of bug was just fixed for osm_switch_t:priv, fix up
osm_port_t to use the same mechanism.

Signed-off-by: Jim Schutt <jaschut at sandia.gov>
---
 opensm/include/opensm/osm_port.h |    1 +
 opensm/opensm/osm_port.c         |    2 ++
 opensm/opensm/osm_ucast_mgr.c    |   19 ++++++++++++++-----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h
index 7079e74..21379b2 100644
--- a/opensm/include/opensm/osm_port.h
+++ b/opensm/include/opensm/osm_port.h
@@ -1162,6 +1162,7 @@ typedef struct osm_port {
 	cl_qlist_t mcm_list;
 	int flag;
 	void *priv;
+	void (*priv_release)(struct osm_port *p_pt);
 } osm_port_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 751c0f0..519d8bd 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -132,6 +132,8 @@ void osm_physp_init(IN osm_physp_t * p_physp, IN const ib_net64_t port_guid,
  **********************************************************************/
 void osm_port_delete(IN OUT osm_port_t ** pp_port)
 {
+	if ((*pp_port)->priv_release)
+		(*pp_port)->priv_release(*pp_port);
 	/* cleanup all mcm recs attached */
 	osm_port_remove_all_mgrp(*pp_port);
 	free(*pp_port);
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 629f628..1bf367d 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -385,6 +385,15 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw)
 
 /**********************************************************************
  **********************************************************************/
+static void minhop_port_priv_release(osm_port_t *port)
+{
+	if (port->priv)
+		free(port->priv);
+
+	port->priv_release = NULL;
+	port->priv = NULL;
+}
+
 static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
 {
 	cl_qmap_t *port_tbl = &mgr->p_subn->port_guid_tbl;
@@ -396,6 +405,8 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
 	for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl);
 	     item = cl_qmap_next(item)) {
 		port = (osm_port_t *) item;
+		if (port->priv_release)
+			port->priv_release(port);
 		lmc = ib_port_info_get_lmc(&port->p_physp->port_info);
 		if (!lmc)
 			continue;
@@ -404,11 +415,11 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
 			OSM_LOG(mgr->p_log, OSM_LOG_ERROR, "ERR 3A09: "
 				"cannot allocate memory to track remote"
 				" systems for lmc > 0\n");
-			port->priv = NULL;
 			continue;
 		}
 		memset(r, 0, sizeof(*r) + sizeof(r->guids[0]) * (1 << lmc));
 		port->priv = r;
+		port->priv_release = minhop_port_priv_release;
 	}
 }
 
@@ -420,10 +431,8 @@ static void free_ports_priv(osm_ucast_mgr_t * mgr)
 	for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl);
 	     item = cl_qmap_next(item)) {
 		port = (osm_port_t *) item;
-		if (port->priv) {
-			free(port->priv);
-			port->priv = NULL;
-		}
+		if (port->priv_release)
+			port->priv_release(port);
 	}
 }
 
-- 
1.5.6.GIT


From vlad at lists.openfabrics.org  Sat Aug 29 03:11:53 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 29 Aug 2009 03:11:53 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090829-0200 daily build status
Message-ID: <20090829101154.174E6E2820F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hnrose at comcast.net  Sat Aug 29 08:30:10 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sat, 29 Aug 2009 11:30:10 -0400
Subject: [ofa-general] [PATCH] opensm: Reduce heap consumption by unicast
	routing tables (LFTs)
Message-ID: <20090829153010.GA4272@comcast.net>


Heap memory consumption by the unicast and multicast routing tables can be
reduced.

Using valgrind --tool=massif (for heap profiling), there are couple of places that consume most of the heap memory:
->38.75% (11,206,656B) 0x43267E: osm_switch_new (osm_switch.c:134)
->12.89% (3,728,256B) 0x40F8C9: osm_mcast_tbl_init (osm_mcast_tbl.c:96)

osm_switch_new (osm_switch.c:108):
       p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1);

>From ib_types.h
 #define IB_LID_UCAST_END_HO 0xBFFF

The LFT can be allocated in smaller chunks. If there is a LID that
exeeds the current LFT size, LFT is reallocated with an increased size.
This reduces performance and increases memory fragmentation, so this
tradeoff is made optional based on new build and config options (see below).

Using a 4K chunk as the minimal LFT block reduces the memory used
by the LFTs by a factor of 12. For a larger (than 4K) fabric, 4K is added each
time the existing LFT size is insufficient.

So it looks like for cluster of 2-4K withan LMC of 0 about 40% (!!!) of the
heap memory can be saved:

 - 39% used by LFTs, each with 48K entries - SM can allocate 4K entries instead.

There is a new build option to specify whether to include the FT heap optimization code or not. It defaults to off and not include the new code (basically just the code that exists today).

A new config option specifies whether to optimize FT allocation and
defaults to off.
Another new config option will specify the LFT allocation chunk and
defaults to 4K.
These chunks will be used as the initial minimum allocation and increased
in increments of the chunk using realloc.

LFTs are only be increased in size and are never reduced in size. If a realloc for an LFT fails, it results in an exit.

A similar subsequent change will do this for MFTs.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4
index c24930b..1c7c8a2 100644
--- a/opensm/config/osmvsel.m4
+++ b/opensm/config/osmvsel.m4
@@ -232,6 +232,25 @@ fi
 # --- END OPENIB_OSM_PERF_MGR_SEL ---
 ]) dnl OPENIB_OSM_PERF_MGR_SEL
 
+dnl Check if they want the FT heap optimization 
+AC_DEFUN([OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL], [
+# --- BEGIN OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL ---
+
+dnl enable the FT heap optimization
+AC_ARG_ENABLE(ft-heap-optimize,
+[  --enable-ft-heap-optimize Enable FT heap optimization (default no)],
+   [case $enableval in
+     yes) ft_heap_optimize=yes ;;
+     no)  ft_heap_optimize=no ;;
+   esac],
+   ft_heap_optimize=no)
+if test $ft_heap_optimize = yes; then
+  AC_DEFINE(ENABLE_OSM_FT_HEAP_OPTIMIZATION,
+	1,
+	[Define as 1 if you want to enable the FT heap optimization])
+fi
+# --- END OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL ---
+]) dnl OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL
 
 dnl Check if they want the event plugin
 AC_DEFUN([OPENIB_OSM_DEFAULT_EVENT_PLUGIN_SEL], [
diff --git a/opensm/configure.in b/opensm/configure.in
index 8a6b4c0..9b5ec00 100644
--- a/opensm/configure.in
+++ b/opensm/configure.in
@@ -87,6 +87,9 @@ OPENIB_OSM_CONSOLE_SOCKET_SEL
 dnl select performance manager or not
 OPENIB_OSM_PERF_MGR_SEL
 
+dnl select FT heap optimization or not
+OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL
+
 dnl resolve <sysconfdir> config dir.
 conf_dir_tmp1="`eval echo ${sysconfdir} | sed 's/^NONE/$ac_default_prefix/'`"
 SYS_CONFIG_DIR="`eval echo $conf_dir_tmp1`"
diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index 0537002..89b125c 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  *
@@ -449,6 +449,18 @@ BEGIN_C_DECLS
 */
 #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
 /***********/
+/****d* OpenSM: Base/OSM_DEFAULT_LFT_CHUNKS
+* NAME
+*	OSM_DEFAULT_LFT_CHUNKS
+*
+* DESCRIPTION
+*	Specifies the default number of 64 entry (byte) chunks in LFT
+*	related memory (re)allocation. Default is 64 (4K bytes).
+*
+* SYNOPSIS
+*/
+#define OSM_DEFAULT_LFT_CHUNKS 64
+/***********/
 /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
 * NAME
 *	OSM_SM_DEFAULT_QP0_RCV_SIZE
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 6c20de8..be90ce4 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
@@ -218,6 +218,10 @@ typedef struct osm_subn_opt {
 	uint32_t perfmgr_max_outstanding_queries;
 	char *event_db_dump_file;
 #endif				/* ENABLE_OSM_PERF_MGR */
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	boolean_t ft_heap_optimization;
+	uint32_t lft_chunks;
+#endif				/* ENABLE_OSM_FT_HEAP_OPTIMIZATION */
 	char *event_plugin_name;
 	char *node_name_map_name;
 	char *prefix_routes_file;
@@ -437,6 +441,12 @@ typedef struct osm_subn_opt {
 *	perfmgr_sweep_time_s
 *		Define the period (in seconds) of PerfMgr sweeps
 *
+*	ft_heap_optimization
+*		Enable or disable forwarding table (FT) heap optimization
+*
+*	lft_chunks
+*		Number of 64 entry (byte) chunks used in LFT (re)allocation
+*
 *       event_db_dump_file
 *               File to dump the event database to
 *
diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 7ce28c5..2c60fb6 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -102,6 +102,8 @@ typedef struct osm_switch {
 	osm_port_profile_t *p_prof;
 	uint8_t *lft;
 	uint8_t *new_lft;
+	uint16_t lft_size;
+	uint16_t new_lft_size;
 	osm_mcast_tbl_t mcast_tbl;
 	unsigned endport_links;
 	unsigned need_update;
@@ -219,7 +221,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw);
 * SYNOPSIS
 */
 osm_switch_t *osm_switch_new(IN osm_node_t * const p_node,
-			     IN const osm_madw_t * const p_madw);
+			     IN const osm_madw_t * const p_madw,
+			     IN osm_subn_t * const p_subn);
 /*
 * PARAMETERS
 *	p_node
@@ -227,7 +230,10 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node,
 *
 *	p_madw
 *		[in] Pointer to the MAD Wrapper containing the switch's
-*		SwitchInfo attribute.
+*		SwitchInfo attribute
+*
+*	p_subn
+*		[in] Pointer to the subnet object
 *
 * RETURN VALUES
 *	Pointer to the new initialized switch object.
@@ -408,7 +414,7 @@ static inline uint8_t
 osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw,
 			   IN const uint16_t lid_ho)
 {
-	if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO)
+	if (lid_ho == 0 || lid_ho >= p_sw->lft_size)
 		return OSM_NO_PATH;
 	return p_sw->lft[lid_ho];
 }
@@ -575,6 +581,44 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw)
 *	Switch object
 *********/
 
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+/****f* OpenSM: Switch/osm_switch_set_new_lft_entry
+* NAME
+*	osm_switch_set_new_lft_entry
+*
+* DESCRIPTION
+*	Set a LID entry in the switch's new_lft.
+*
+* SYNOPSIS
+*
+*/
+boolean_t
+osm_switch_set_new_lft_entry(IN osm_switch_t * const p_sw,
+			     IN uint16_t lid, IN uint8_t port,
+			     IN const osm_subn_t * const p_subn);
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to an osm_switch_t object.
+*
+*	lid
+*		[in] LID.
+*
+*	port
+*		[in] port number.
+*
+*	p_subn
+*		[in] Pointer to an osm_subn_t object.
+*
+* RETURN VALUES
+*	TRUE if success and FALSE if failure.
+*
+* NOTES
+*
+* SEE ALSO
+*********/
+#endif
+
 /****f* OpenSM: Switch/osm_switch_get_lft_block
 * NAME
 *	osm_switch_get_lft_block
@@ -586,6 +630,7 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw)
 */
 boolean_t
 osm_switch_get_lft_block(IN const osm_switch_t * const p_sw,
+			 IN const osm_subn_t * const p_subn,
 			 IN const uint16_t block_id,
 			 OUT uint8_t * const p_block);
 /*
@@ -593,6 +638,9 @@ osm_switch_get_lft_block(IN const osm_switch_t * const p_sw,
 *	p_sw
 *		[in] Pointer to an osm_switch_t object.
 *
+*	p_subn
+*		[in] Pointer to an osm_subn_t object.
+*
 *	block_ID
 *		[in] The block_id to retrieve.
 *
@@ -714,16 +762,40 @@ osm_switch_count_path(IN osm_switch_t * const p_sw, IN const uint8_t port)
 static inline ib_api_status_t
 osm_switch_set_lft_block(IN osm_switch_t * const p_sw,
 			 IN const uint8_t * const p_block,
-			 IN const uint32_t block_num)
+			 IN const uint32_t block_num,
+			 IN osm_subn_t * const p_subn)
 {
 	uint16_t lid_start =
 		(uint16_t) (block_num * IB_SMP_DATA_SIZE);
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	uint8_t *lft;
+	size_t size;
+#endif
+
 	CL_ASSERT(p_sw);
 
 	if (lid_start + IB_SMP_DATA_SIZE > IB_LID_UCAST_END_HO)
 		return IB_INVALID_PARAMETER;
 
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 	memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE);
+#else
+	if (!p_subn->opt.ft_heap_optimization)
+		memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE);
+	else {
+		if (lid_start + IB_SMP_DATA_SIZE > p_sw->lft_size) {
+			size = (lid_start + (1 + p_subn->opt.lft_chunks) * IB_SMP_DATA_SIZE - 1) / IB_SMP_DATA_SIZE * IB_SMP_DATA_SIZE;
+			lft = realloc(p_sw->lft, size);
+			if (!lft)
+				return IB_INSUFFICIENT_MEMORY;
+			memset(lft + p_sw->lft_size, OSM_NO_PATH,
+			       size - p_sw->lft_size);
+			p_sw->lft = lft;
+			p_sw->lft_size = size;
+		}
+		memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE);
+	}
+#endif
 	return IB_SUCCESS;
 }
 /*
@@ -735,7 +807,10 @@ osm_switch_set_lft_block(IN osm_switch_t * const p_sw,
 *		[in] Pointer to the forwarding table block.
 *
 *	block_num
-*		[in] Block number for this block
+*		[in] Block number for this block.
+*
+*	p_subn
+*		[in] Pointer to the subnet object.
 *
 * RETURN VALUE
 *	None.
diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index c541804..e1fb073 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -21,6 +21,13 @@
 %define _disable_event_plugin --disable-default-event-plugin
 %endif
 
+%if %{?_with_ft_heap_optimize:1}%{!?_with_ft_heap_optimize:0}
+%define _enable_ft_heap_optimize --enable-ft-heap-optimize
+%endif
+%if %{?_without_ft_heap_optimize:1}%{!?_without_ft_heap_optimize:0}
+%define _disable_ft_heap_optimize --disable-ft-heap-optimize
+%endif
+
 %if %{?_with_node_name_map:1}%{!?_with_node_name_map:0}
 %define _enable_node_name_map --with-node-name-map%{?_with_node_name_map}
 %endif
@@ -83,6 +90,8 @@ Static version of the opensm libraries
         %{?_disable_console_socket} \
         %{?_enable_perf_mgr} \
         %{?_disable_perf_mgr} \
+        %{?_enable_ft_heap_optimize} \
+        %{?_disable_ft_heap_optimize} \
         %{?_enable_event_plugin} \
         %{?_disable_event_plugin} \
         %{?_enable_node_name_map}
diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c
index ae40b0d..6f05bd7 100644
--- a/opensm/opensm/osm_lin_fwd_rcv.c
+++ b/opensm/opensm/osm_lin_fwd_rcv.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -87,7 +87,8 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
 			"LFT received for nonexistent node "
 			"0x%" PRIx64 "\n", cl_ntoh64(node_guid));
 	} else {
-		status = osm_switch_set_lft_block(p_sw, p_block, block_num);
+		status = osm_switch_set_lft_block(p_sw, p_block, block_num,
+						  sm->p_subn);
 		if (status != IB_SUCCESS) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0402: "
 				"Setting forwarding table block failed (%s)"
diff --git a/opensm/opensm/osm_sa_lft_record.c b/opensm/opensm/osm_sa_lft_record.c
index d092129..b84bf6c 100644
--- a/opensm/opensm/osm_sa_lft_record.c
+++ b/opensm/opensm/osm_sa_lft_record.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -99,7 +99,7 @@ static ib_api_status_t lftr_rcv_new_lftr(IN osm_sa_t * sa,
 	p_rec_item->rec.block_num = cl_hton16(block);
 
 	/* copy the lft block */
-	osm_switch_get_lft_block(p_sw, block, p_rec_item->rec.lft);
+	osm_switch_get_lft_block(p_sw, sa->p_subn, block, p_rec_item->rec.lft);
 
 	cl_qlist_insert_tail(p_list, &p_rec_item->list_item);
 
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 185c700..1423c11 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
@@ -1011,7 +1011,8 @@ static void cleanup_switch(cl_map_item_t * item, void *log)
 	if (!sw->new_lft)
 		return;
 
-	if (memcmp(sw->lft, sw->new_lft, IB_LID_UCAST_END_HO + 1))
+	if (sw->new_lft_size != sw->lft_size ||
+	    memcmp(sw->lft, sw->new_lft, sw->lft_size))
 		osm_log(log, OSM_LOG_ERROR, "ERR 331D: "
 			"LFT of switch 0x%016" PRIx64 " is not up to date.\n",
 			cl_ntoh64(sw->p_node->node_info.node_guid));
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 8d63a75..5189229 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
@@ -359,6 +359,10 @@ static const opt_rec_t opt_tbl[] = {
 	{ "perfmgr_max_outstanding_queries", OPT_OFFSET(perfmgr_max_outstanding_queries), opts_parse_uint32, NULL, 0 },
 	{ "event_db_dump_file", OPT_OFFSET(event_db_dump_file), opts_parse_charp, NULL, 0 },
 #endif				/* ENABLE_OSM_PERF_MGR */
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	{ "ft_heap_optimization", OPT_OFFSET(ft_heap_optimization), opts_parse_boolean, NULL, 0 },
+	{ "lft_chunks", OPT_OFFSET(lft_chunks), opts_parse_uint32, NULL, 1 },
+#endif				/* ENABLE_OSM_FT_HEAP_OPTIMIZATION */
 	{ "event_plugin_name", OPT_OFFSET(event_plugin_name), opts_parse_charp, NULL, 0 },
 	{ "node_name_map_name", OPT_OFFSET(node_name_map_name), opts_parse_charp, NULL, 0 },
 	{ "qos_max_vls", OPT_OFFSET(qos_options.max_vls), opts_parse_uint32, NULL, 1 },
@@ -723,6 +727,10 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	    OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES;
 	p_opt->event_db_dump_file = NULL; /* use default */
 #endif				/* ENABLE_OSM_PERF_MGR */
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	p_opt->ft_heap_optimization = FALSE;
+	p_opt->lft_chunks = OSM_DEFAULT_LFT_CHUNKS;
+#endif				/* ENABLE_OSM_FT_HEAP_OPTIMIZATION */
 
 	p_opt->event_plugin_name = NULL;
 	p_opt->node_name_map_name = NULL;
@@ -1141,6 +1149,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
 	}
 #endif
 
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	if (p_opts->ft_heap_optimization) {
+		if (p_opts->lft_chunks < 1 || p_opts->lft_chunks > 768) {
+			log_report(" Invalid Cached Option Value:"
+				   "lft_chunks = %u"
+				   " Using Default:%u\n",
+				   p_opts->lft_chunks, OSM_DEFAULT_LFT_CHUNKS);
+			p_opts->lft_chunks = OSM_DEFAULT_LFT_CHUNKS;
+		}
+	}
+#endif
+
 	return 0;
 }
 
@@ -1465,6 +1485,21 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		"# SA database file name\nsa_db_file %s\n\n",
 		p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
 
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	fprintf(out,
+		"# Forwarding table (LFT) heap optimization\n"
+		"ft_heap_optimization %s\n\n",
+		p_opts->ft_heap_optimization ? "TRUE" : "FALSE");
+
+	fprintf(out,
+		"# Number of 64 entry (byte) chunks used when (re)allocating "
+		"LFTs\n"
+		"# Values go from 1 (highest granularity) to 786 "
+		"(allocate all the LFT in a single chunk)\n"
+		"lft_chunks %d\n\n",
+		p_opts->lft_chunks);
+#endif				/* ENABLE_OSM_FT_HEAP_OPTIMIZATION */
+
 	fprintf(out,
 		"#\n# HANDOVER - MULTIPLE SMs OPTIONS\n#\n"
 		"# SM priority used for deciding who is the master\n"
diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c
index c335263..9861525 100644
--- a/opensm/opensm/osm_sw_info_rcv.c
+++ b/opensm/opensm/osm_sw_info_rcv.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -211,7 +211,7 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * p_node,
 
 	osm_dump_switch_info(sm->p_log, p_si, OSM_LOG_DEBUG);
 
-	p_sw = osm_switch_new(p_node, p_madw);
+	p_sw = osm_switch_new(p_node, p_madw, sm->p_subn);
 	if (p_sw == NULL) {
 		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3608: "
 			"Unable to allocate new switch object\n");
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index ce1ca63..0d725e8 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
@@ -51,6 +51,11 @@
 #include <iba/ib_types.h>
 #include <opensm/osm_switch.h>
 
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+static uint8_t no_path_block[IB_SMP_DATA_SIZE] =
+				{[0 ... IB_SMP_DATA_SIZE-1] = OSM_NO_PATH };
+#endif
+
 /**********************************************************************
  **********************************************************************/
 cl_status_t
@@ -101,7 +106,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw)
 /**********************************************************************
  **********************************************************************/
 osm_switch_t *osm_switch_new(IN osm_node_t * const p_node,
-			     IN const osm_madw_t * const p_madw)
+			     IN const osm_madw_t * const p_madw,
+			     IN osm_subn_t * const p_subn)
 {
 	osm_switch_t *p_sw;
 	ib_switch_info_t *p_si;
@@ -132,11 +138,22 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node,
 	p_sw->num_ports = num_ports;
 	p_sw->need_update = 2;
 
-	p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1);
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+	p_sw->lft_size = IB_LID_UCAST_END_HO + 1;
+#else
+	if (!p_subn->opt.ft_heap_optimization)
+		p_sw->lft_size = IB_LID_UCAST_END_HO + 1;
+	else
+		p_sw->lft_size = p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE;
+#endif
+
+	p_sw->lft = malloc(p_sw->lft_size);
 	if (!p_sw->lft)
 		goto err;
 
-	memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+	memset(p_sw->lft, OSM_NO_PATH, p_sw->lft_size);
+
+	p_sw->new_lft_size = p_sw->lft_size;
 
 	p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports);
 	if (!p_sw->p_prof)
@@ -158,10 +175,43 @@ err:
 	return NULL;
 }
 
+#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION
+/**********************************************************************
+ **********************************************************************/
+boolean_t
+osm_switch_set_new_lft_entry(IN osm_switch_t * const p_sw,
+			     IN uint16_t lid, IN uint8_t port,
+			     IN const osm_subn_t * const p_subn)
+{
+	size_t size;
+	uint8_t *new_lft;
+
+	if (!p_subn->opt.ft_heap_optimization)
+		p_sw->new_lft[lid] = port;
+	else {
+		if (lid >= p_sw->new_lft_size) {
+			size = (lid + p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE - 1) / IB_SMP_DATA_SIZE * IB_SMP_DATA_SIZE;
+			if (size == p_sw->new_lft_size)
+				size += p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE;
+			new_lft = realloc(p_sw->new_lft, size);
+			if (!new_lft)
+				return FALSE;
+			memset(new_lft + p_sw->new_lft_size, OSM_NO_PATH,
+			       size - p_sw->new_lft_size);
+			p_sw->new_lft = new_lft;
+			p_sw->new_lft_size = size;
+		}
+		p_sw->new_lft[lid] = port;
+	}
+	return TRUE;
+}
+#endif
+
 /**********************************************************************
  **********************************************************************/
 boolean_t
 osm_switch_get_lft_block(IN const osm_switch_t * const p_sw,
+			 IN const osm_subn_t * const p_subn,
 			 IN const uint16_t block_id,
 			 OUT uint8_t * const p_block)
 {
@@ -174,7 +224,19 @@ osm_switch_get_lft_block(IN const osm_switch_t * const p_sw,
 		return FALSE;
 
 	CL_ASSERT(base_lid_ho + IB_SMP_DATA_SIZE <= IB_LID_UCAST_END_HO);
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 	memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE);
+#else
+	if (!p_subn->opt.ft_heap_optimization)
+		memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE);
+	else {
+		if (base_lid_ho + IB_SMP_DATA_SIZE > p_sw->lft_size)
+			memcpy(p_block, &no_path_block[0], IB_SMP_DATA_SIZE);
+		else
+			memcpy(p_block, &(p_sw->lft[base_lid_ho]),
+			       IB_SMP_DATA_SIZE);
+	}
+#endif
 	return TRUE;
 }
 
@@ -517,10 +579,10 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids)
 	osm_switch_clear_hops(p_sw);
 
 	if (!p_sw->new_lft &&
-	    !(p_sw->new_lft = malloc(IB_LID_UCAST_END_HO + 1)))
+	    !(p_sw->new_lft = malloc(p_sw->new_lft_size)))
 		return IB_INSUFFICIENT_MEMORY;
 
-	memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+	memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size);
 
 	if (!p_sw->hops) {
 		hops = malloc((max_lids + 1) * sizeof(hops[0]));
diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
index 30a3c1d..73f1ce0 100644
--- a/opensm/opensm/osm_ucast_cache.c
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -73,6 +73,7 @@ typedef struct cache_switch {
 	uint16_t num_hops;
 	uint8_t **hops;
 	uint8_t *lft;
+	uint16_t lft_size;
 	uint8_t num_ports;
 	cache_port_t ports[0];
 } cache_switch_t;
@@ -349,6 +350,7 @@ cache_restore_ucast_info(osm_ucast_mgr_t * p_mgr,
 	if (p_sw->new_lft)
 		free(p_sw->new_lft);
 	p_sw->new_lft = p_cache_sw->lft;
+	p_sw->new_lft_size = p_cache_sw->lft_size;
 	p_cache_sw->lft = NULL;
 
 	p_sw->num_hops = p_cache_sw->num_hops;
@@ -1023,10 +1025,12 @@ void osm_ucast_cache_add_node(osm_ucast_mgr_t * p_mgr, osm_node_t * p_node)
 			/* LFT buffer exists - we use it, because
 			   it is more updated than the switch's LFT */
 			p_cache_sw->lft = p_node->sw->new_lft;
+			p_cache_sw->lft_size = p_node->sw->new_lft_size;
 			p_node->sw->new_lft = NULL;
 		} else {
 			/* no LFT buffer, so we use the switch's LFT */
 			p_cache_sw->lft = p_node->sw->lft;
+			p_cache_sw->lft_size = p_node->sw->lft_size;
 			p_node->sw->lft = NULL;
 		}
 		p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho;
@@ -1079,10 +1083,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr)
 			/* no new routing was recently calculated for this
 			   switch, but the LFT needs to be updated anyway */
 			p_sw->new_lft = p_sw->lft;
-			p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1);
+			p_sw->new_lft_size = p_sw->lft_size;
+			p_sw->lft = malloc(p_sw->lft_size);
 			if (!p_sw->lft)
 				return IB_INSUFFICIENT_MEMORY;
-			memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+			memset(p_sw->lft, OSM_NO_PATH, p_sw->lft_size);
 		}
 
 	}
diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c
index 5b73ca5..136e0de 100644
--- a/opensm/opensm/osm_ucast_file.c
+++ b/opensm/opensm/osm_ucast_file.c
@@ -92,7 +92,18 @@ static void add_path(osm_opensm_t * p_osm,
 			new_lid);
 	}
 
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 	p_sw->new_lft[new_lid] = port_num;
+#else
+	if (!osm_switch_set_new_lft_entry(p_sw, new_lid, port_num,
+					  &p_osm->subn)) {
+		OSM_LOG(&p_osm->log, OSM_LOG_SYS,
+			"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+		OSM_LOG(&p_osm->log, OSM_LOG_ERROR, "ERR 630F: "
+			"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+		exit(1);
+	}
+#endif
 	if (!(p_osm->subn.opt.port_profile_switch_nodes && port_guid &&
 	      osm_get_switch_by_guid(&p_osm->subn, port_guid)))
 		osm_switch_count_path(p_sw, port_num);
@@ -193,8 +204,7 @@ static int do_ucast_file_load(void *context)
 					cl_ntoh64(sw_guid));
 				continue;
 			}
-			memset(p_sw->new_lft, OSM_NO_PATH,
-			       IB_LID_UCAST_END_HO + 1);
+			memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size);
 		} else if (p_sw && !strncmp(p, "0x", 2)) {
 			p += 2;
 			lid = (uint16_t) strtoul(p, &q, 16);
diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 6ec6bc7..e37abb4 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -566,7 +566,7 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree,
 		return NULL;
 
 	/* initialize lft buffer */
-	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+	memset(p_osm_sw->new_lft, OSM_NO_PATH, p_osm_sw->new_lft_size);
 	p_sw->hops = malloc((p_osm_sw->max_lid_ho + 1) * sizeof(*(p_sw->hops)));
 	if (p_sw->hops == NULL)
 		return NULL;
@@ -2236,8 +2236,22 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 
 		/* setting fwd tbl port only if this is real LID */
 		if (is_real_lid) {
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 			p_remote_sw->p_osm_sw->new_lft[target_lid] =
 			    p_min_port->remote_port_num;
+#else
+			if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw,
+							  target_lid,
+							  p_min_port->remote_port_num,
+							  &p_ftree->p_osm->subn)) {
+
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+					"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+					"ERR AB15: osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					exit(1);
+			}
+#endif
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to CA LID %u through port %u\n",
 				tuple_to_str(p_remote_sw->tuple),
@@ -2459,6 +2473,7 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 			/* We update the LFT only if this LID isn't already present. */
 
 			/* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 			if ((p_remote_sw->p_osm_sw->new_lft[target_lid] ==
 			     OSM_NO_PATH)
 			    ||
@@ -2470,6 +2485,28 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 
 				p_remote_sw->p_osm_sw->new_lft[target_lid] =
 				    p_min_port->remote_port_num;
+#else
+			if (target_lid >= p_remote_sw->p_osm_sw->new_lft_size ||
+			    ((p_remote_sw->p_osm_sw->new_lft[target_lid] ==
+			      OSM_NO_PATH) ||
+			     ((p_remote_sw->p_osm_sw->new_lft[target_lid] !=
+			       OSM_NO_PATH) &&
+			      (current_hops + 1 <
+			       sw_get_least_hops(p_remote_sw, target_lid))))) {
+
+				if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw,
+								  target_lid,
+								  p_min_port->remote_port_num,
+								  &p_ftree->p_osm->subn)) {
+					OSM_LOG(&p_ftree->p_osm->log,
+						OSM_LOG_SYS,
+						"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					OSM_LOG(&p_ftree->p_osm->log,
+						OSM_LOG_ERROR, "ERR AB16: "
+						"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					exit(1);
+				}
+#endif
 				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 					"Switch %s: set path to CA LID %u through port %u\n",
 					tuple_to_str(p_remote_sw->tuple),
@@ -2540,7 +2577,12 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
 
 		/* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 		if (p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH)
+#else
+		if (target_lid < p_remote_sw->p_osm_sw->new_lft_size &&
+		    p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH)
+#endif
 			if (current_hops + 1 >=
 			    sw_get_least_hops(p_remote_sw, target_lid))
 				continue;
@@ -2576,8 +2618,21 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 
 		p_port = p_min_port;
 		//cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port);
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 		p_remote_sw->p_osm_sw->new_lft[target_lid] =
 		    p_port->remote_port_num;
+#else
+		if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw,
+						  target_lid,
+						  p_port->remote_port_num,
+						  &p_ftree->p_osm->subn)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+				"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+				"ERR AB17: osm_switch_set_new_lft_entry realloc failed - exiting\n");
+			exit(1);
+		}
+#endif
 
 		/* On the remote switch that is pointed by the p_group,
 		   set hops for ALL the ports in the remote group. */
@@ -2609,7 +2664,12 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
 
 		/* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 		if (p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH)
+#else
+		if (target_lid < p_remote_sw->p_osm_sw->new_lft_size &&
+		    p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH)
+#endif
 			if (current_hops + 1 >=
 			    sw_get_least_hops(p_remote_sw, target_lid))
 				continue;
@@ -2645,9 +2705,21 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 
 		p_port = p_min_port;
 		//cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port);
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 		p_remote_sw->p_osm_sw->new_lft[target_lid] =
 		    p_port->remote_port_num;
-
+#else
+		if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw,
+						  target_lid,
+						  p_port->remote_port_num,
+						  &p_ftree->p_osm->subn)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+				"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+				"ERR AB18: osm_switch_set_new_lft_entry realloc failed - exiting\n");
+			exit(1);
+		}
+#endif
 		/* On the remote switch that is pointed by the p_group,
 		   set hops for ALL the ports in the remote group. */
 
@@ -2771,7 +2843,20 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
 			/* set local LFT(LID) to the port that is connected to HCA */
 			cl_ptr_vector_at(&p_leaf_port_group->ports, 0,
 					 (void *)&p_port);
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 			p_sw->p_osm_sw->new_lft[hca_lid] = p_port->port_num;
+#else
+			if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw,
+							  hca_lid,
+							  p_port->port_num,
+							  &p_ftree->p_osm->subn)) {
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+					"osm_switch_set_new_lft_entry realloc error - exiting\n");
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+					"ERR AB19: osm_switch_set_new_lft_entry realloc error - exiting\n");
+				exit(1);
+			}
+#endif
 
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to CN LID %u through port %u\n",
@@ -2883,7 +2968,20 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
 			cl_ptr_vector_at(&p_hca_port_group->ports, 0,
 					 (void *)&p_hca_port);
 			port_num_on_switch = p_hca_port->remote_port_num;
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 			p_sw->p_osm_sw->new_lft[hca_lid] = port_num_on_switch;
+#else
+			if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw,
+							  hca_lid,
+							  port_num_on_switch,
+							  &p_ftree->p_osm->subn)) {
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+					"osm_switch_set_new_lft_entry realloc error - exiting\n");
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+					"ERR AB1A: osm_switch_set_new_lft_entry realloc error - exiting\n");
+					exit(1);
+			}
+#endif
 
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to non-CN HCA LID %u through port %u\n",
@@ -2941,7 +3039,19 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
 		p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item);
 
 		/* set local LFT(LID) to 0 (route to itself) */
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 		p_sw->p_osm_sw->new_lft[p_sw->base_lid] = 0;
+#else
+		if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw,
+						  p_sw->base_lid, 0,
+						  &p_ftree->p_osm->subn)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS,
+				"osm_switch_set_new_lft_entry realloc error - exiting\n");
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+				"ERR AB1B: osm_switch_set_new_lft_entry realloc error - exiting\n");
+			exit(1);
+		}
+#endif
 
 		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 			"Switch %s (LID %u): routing switch-to-switch paths\n",
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 0a567b3..18088a9 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -1009,7 +1009,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
 		current_guid = p_sw->p_node->node_info.port_guid;
 		sw = p_sw->priv;
 
-		memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+		memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size);
 
 		for (lid = 1; lid <= max_lid_ho; lid++) {
 			port = cl_ptr_vector_get(&p_subn->port_lid_tbl, lid);
@@ -1020,7 +1020,20 @@ static void populate_fwd_tbls(lash_t * p_lash)
 			if (p_dst_sw == p_sw) {
 				uint8_t egress_port = port->p_node->sw ? 0 :
 					port->p_physp->p_remote_physp->port_num;
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 				p_sw->new_lft[lid] = egress_port;
+#else
+				if (!osm_switch_set_new_lft_entry(p_sw, lid,
+								  egress_port,
+								  p_subn)) {
+					OSM_LOG(p_log, OSM_LOG_SYS,
+						"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D05: "
+						"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					exit(1);
+				}
+#endif
+
 				OSM_LOG(p_log, OSM_LOG_VERBOSE,
 					"LASH fwd MY SRC SRC GUID 0x%016" PRIx64
 					" src lash id (%d), src lid no (%u) src lash port (%d) "
@@ -1038,7 +1051,19 @@ static void populate_fwd_tbls(lash_t * p_lash)
 				uint8_t physical_egress_port =
 					get_next_port(sw, lash_egress_port);
 
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 				p_sw->new_lft[lid] = physical_egress_port;
+#else
+				if (!osm_switch_set_new_lft_entry(p_sw, lid,
+								  physical_egress_port,
+								  p_subn)) {
+					OSM_LOG(p_log, OSM_LOG_SYS,
+						"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06:"
+					       "osm_switch_set_new_lft_entry realloc failed - exiting\n");
+					exit(1);
+				}
+#endif
 				OSM_LOG(p_log, OSM_LOG_VERBOSE,
 					"LASH fwd SRC GUID 0x%016" PRIx64
 					" src lash id (%d), "
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 629f628..0a34e8f 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -298,7 +298,17 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * p_mgr,
 	   We have selected the port for this LID.
 	   Write it to the forwarding tables.
 	 */
+#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION
 	p_sw->new_lft[lid_ho] = port;
+#else
+	if (!osm_switch_set_new_lft_entry(p_sw, lid_ho, port, p_mgr->p_subn)) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_SYS,
+			"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0F: "
+			"osm_switch_set_new_lft_entry realloc failed - exiting\n");
+		exit(1);
+	}
+#endif
 	if (!is_ignored_by_port_prof) {
 		struct osm_remote_node *rem_node_used;
 		osm_switch_count_path(p_sw, port);
@@ -443,7 +453,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
 		cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
 
 	/* Initialize LIDs in buffer to invalid port number. */
-	memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+	memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size);
 
 	if (p_mgr->p_subn->opt.lmc)
 		alloc_ports_priv(p_mgr);
@@ -492,7 +502,8 @@ static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm,
 	OSM_LOG_ENTER(p_sm->p_log);
 
 	for (;
-	     (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block));
+	     (sts = osm_switch_get_lft_block(p_sw, p_sm->p_subn, block_id_ho,
+					     p_block));
 	     block_id_ho++) {
 		if (!p_sw->need_update && !p_sm->p_subn->need_update &&
 		    !memcmp(p_block,


From sashak at voltaire.com  Sat Aug 29 09:25:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 19:25:55 +0300
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd
	tables  setup flow
In-Reply-To: <f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
	<f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
Message-ID: <20090829162555.GA21238@me>

On 12:03 Fri 28 Aug     , Hal Rosenstock wrote:
> 
> lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8) with
> starting lane (0)
> ucast_mgr_route: lash: cannot build fwd tables.
> osm_ucast_mgr_process: minhop tables configured on all switches
> ERR 331D: LFT of switch 0xguid is not up to date.
> 
> Prior to this change, the LFTs were pushed for this fallback case (and no
> ERR 331D occured).

Nice catch.

Such addition is needed to make a fallback to work properly:

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index b7e3893..39d825c 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr)
 		/* If configured routing algorithm failed, use default MinHop */
 		osm_ucast_mgr_build_lid_matrices(p_mgr);
 		ucast_mgr_build_lfts(p_mgr);
+		osm_ucast_mgr_set_fwd_tables(p_mgr);
 		p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
 	}

Sasha


From sashak at voltaire.com  Sat Aug 29 09:28:52 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 19:28:52 +0300
Subject: [ofa-general] [PATCH v2] opensm/osm_ucast_mgr.c: simplify fwd tables
	setup flow
In-Reply-To: <20090829162555.GA21238@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
	<f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
	<20090829162555.GA21238@me>
Message-ID: <20090829162852.GB21238@me>


Simplify (and unify) forwarding tables setup decision flow.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_mgr.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 629f628..45a4a7e 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
 		}
 	}
 
-	set_fwd_tbl_top(p_mgr, p_sw);
-
 	if (p_mgr->p_subn->opt.lmc)
 		free_ports_priv(p_mgr);
 
@@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
 	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl,
 			   p_mgr);
 
-	ucast_mgr_pipeline_fwd_tbl(p_mgr);
-
 	cl_qlist_remove_all(&p_mgr->port_order_list);
 
 	return 0;
@@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm)
 
 	osm->routing_engine_used = osm_routing_engine_type(r->name);
 
-	if (r->ucast_build_fwd_tables)
-		osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
+	osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
 
 	return 0;
 }
@@ -1063,6 +1058,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr)
 		/* If configured routing algorithm failed, use default MinHop */
 		osm_ucast_mgr_build_lid_matrices(p_mgr);
 		ucast_mgr_build_lfts(p_mgr);
+		osm_ucast_mgr_set_fwd_tables(p_mgr);
 		p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
 	}
 
-- 
1.6.4


From sashak at voltaire.com  Sat Aug 29 09:35:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 19:35:45 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Add SM priority
 changed into trap 144 description
In-Reply-To: <20090828134452.GA20014@comcast.net>
References: <20090828134452.GA20014@comcast.net>
Message-ID: <20090829163545.GC21238@me>

On 09:44 Fri 28 Aug     , Hal Rosenstock wrote:
> 
> Per MgtWG RefID #4503
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From roel.kluin at gmail.com  Sat Aug 29 13:25:38 2009
From: roel.kluin at gmail.com (Roel Kluin)
Date: Sat, 29 Aug 2009 22:25:38 +0200
Subject: [ofa-general] [PATCH] IB: dereference of dev->ibdev.iwcm in
	c2_register_device()
Message-ID: <4A998EC2.70500@gmail.com>

dev->ibdev.iwcm allocation may fail, prevent a dereference.

Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
---
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index f1948fa..0f90fe6 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -851,6 +851,10 @@ int c2_register_device(struct c2_dev *dev)
 	dev->ibdev.post_recv = c2_post_receive;
 
 	dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL);
+	if (dev->ibdev.iwcm == NULL) {
+		ret = -ENOMEM;
+		goto out1;
+	}
 	dev->ibdev.iwcm->add_ref = c2_add_ref;
 	dev->ibdev.iwcm->rem_ref = c2_rem_ref;
 	dev->ibdev.iwcm->get_qp = c2_get_qp;


From sashak at voltaire.com  Sat Aug 29 13:41:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 23:41:18 +0300
Subject: [ofa-general] Re: [PATCH] Duplicated file man/umad_get_mad.3 in
 libibumad/Makefile.am
In-Reply-To: <4A952478.7060407@bull.net>
References: <4A952478.7060407@bull.net>
Message-ID: <20090829204118.GD21238@me>

On 14:03 Wed 26 Aug     , Vincent Ficet wrote:
> 
> Hello,
> 
> the file man/umad_get_mad.3 was listed twice in libibumad/Makefile.am resulting in the following error:
> 
> /usr/bin/install: will not overwrite just-created `/home/vficet/work/infiniband/I686/usr/share/man/man3/umad_get_mad.3' with `man/umad_get_mad.3'
> 
> This patch removes the duplicated entry.
> 
> Cheers,
> 
> Vincent
> 
> 
> Signed-off-by: Jean-Vincent Ficet <jean-vincent.ficet at bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Aug 29 13:44:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 23:44:38 +0300
Subject: [ofa-general] Re: [PATCH] Remove duplicated umad_get_mad.3 from
	Makefile.am
In-Reply-To: <20090828190251.GA8633@obsidianresearch.com>
References: <20090828190251.GA8633@obsidianresearch.com>
Message-ID: <20090829204438.GE21238@me>

On 13:02 Fri 28 Aug     , Jason Gunthorpe wrote:
> Fixes builds on FC11.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe at obsidianresearch.com>

Thanks Jason, similar patch was already posted by Jean-Vincent Ficet,
so the fix is applied.

Sasha


From sashak at voltaire.com  Sat Aug 29 13:44:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 23:44:57 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Only change method
 when > rather than >=
In-Reply-To: <20090825232024.GA17650@comcast.net>
References: <20090825232024.GA17650@comcast.net>
Message-ID: <20090829204457.GG21238@me>

On 19:20 Tue 25 Aug     , Hal Rosenstock wrote:
> 
> Also, cosmetic formatting change to combine lines like:
> 	uint16_t host_attr;
> 	host_attr = cl_ntoh16(attr);
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Aug 29 13:45:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 29 Aug 2009 23:45:08 +0300
Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question
In-Reply-To: <f0e08f230908071138m10e9a574g8d84623a99f89527@mail.gmail.com>
References: <f0e08f230908071138m10e9a574g8d84623a99f89527@mail.gmail.com>
Message-ID: <20090829204508.GH21238@me>

Hi Hal,

On 14:38 Fri 07 Aug     , Hal Rosenstock wrote:
> 
> osm_link_mgr.c:link_mgr_get_smsl has the following:
> 
>         /* Find osm_port of the source = p_physp */
>         slid = osm_physp_get_base_lid(p_physp);
>         p_src_port =
>             cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, cl_ntoh16(slid));
> 
>         /* Call lash to find proper SL */
>         sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port);
> 
> It may be that this code is invoked prior to the LID being assigned

How is it possible? In the code I can see that link_mgr_process() is
always executed after lid_mgr run.

> so
> getting the p_src_port based on the LID yields NULL and then calling
> osm_get_lash_sl causes a seg fault.
> 
> I can see two ways to fix this:
> 1. Replace with port GUID search
> 2. Have osm_get_lash_sl handle NULL for p_src_port
> Maybe you see other ways to deal with this.
> 
> Do you have a preferred approach ?

Hmm, SMSL will be irrelevant for a port where LID was not assigned,
right?

If so than it is probably just enough to add in link_mgr_get_smsl():

	if (!p_src_port)
		return;

But it would be really better to understand an error source before
deciding about proper solution.

Sasha


From hal.rosenstock at gmail.com  Sat Aug 29 15:59:14 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 29 Aug 2009 18:59:14 -0400
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables
	setup flow
In-Reply-To: <20090829162555.GA21238@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
	<f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
	<20090829162555.GA21238@me>
Message-ID: <f0e08f230908291559p1c14a7edkfe642a05c458bd46@mail.gmail.com>

On 8/29/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 12:03 Fri 28 Aug     , Hal Rosenstock wrote:
> >
> > lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8)
> with
> > starting lane (0)
> > ucast_mgr_route: lash: cannot build fwd tables.
> > osm_ucast_mgr_process: minhop tables configured on all switches
> > ERR 331D: LFT of switch 0xguid is not up to date.
> >
> > Prior to this change, the LFTs were pushed for this fallback case (and no
> > ERR 331D occured).
>
> Nice catch.
>
> Such addition is needed to make a fallback to work properly:
>
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index b7e3893..39d825c 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr)
>                /* If configured routing algorithm failed, use default
> MinHop */
>                osm_ucast_mgr_build_lid_matrices(p_mgr);
>                ucast_mgr_build_lfts(p_mgr);
> +               osm_ucast_mgr_set_fwd_tables(p_mgr);


Shouldn't this be osm_ucast_mgr_set_fwd_table ?

-- Hal

               p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
>        }
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090829/74bebdcc/attachment.html>

From hal.rosenstock at gmail.com  Sat Aug 29 16:05:38 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 29 Aug 2009 19:05:38 -0400
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_ucast_mgr.c: simplify fwd
	tables setup flow
In-Reply-To: <20090829162852.GB21238@me>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
	<f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
	<20090829162555.GA21238@me> <20090829162852.GB21238@me>
Message-ID: <f0e08f230908291605g47114ef3o125af7b2386b7e0c@mail.gmail.com>

On 8/29/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
>
> Simplify (and unify) forwarding tables setup decision flow.
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
> opensm/opensm/osm_ucast_mgr.c |    8 ++------
> 1 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 629f628..45a4a7e 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
>                }
>        }
>
> -       set_fwd_tbl_top(p_mgr, p_sw);
> -
>        if (p_mgr->p_subn->opt.lmc)
>                free_ports_priv(p_mgr);
>
> @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
>        cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
> ucast_mgr_process_tbl,
>                           p_mgr);
>
> -       ucast_mgr_pipeline_fwd_tbl(p_mgr);
> -
>        cl_qlist_remove_all(&p_mgr->port_order_list);
>
>        return 0;
> @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine
> *r, osm_opensm_t * osm)
>
>        osm->routing_engine_used = osm_routing_engine_type(r->name);
>
> -       if (r->ucast_build_fwd_tables)
> -               osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
> +       osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr);
>
>        return 0;
> }
> @@ -1063,6 +1058,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr)
>                /* If configured routing algorithm failed, use default
> MinHop */
>                osm_ucast_mgr_build_lid_matrices(p_mgr);
>                ucast_mgr_build_lfts(p_mgr);
> +               osm_ucast_mgr_set_fwd_tables(p_mgr);


                     osm_ucast_mgr_set_fwd_table(p_mgr); ?


               p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
>        }
>
> --
> 1.6.4
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090829/8e569301/attachment.html>

From tziporet at dev.mellanox.co.il  Sat Aug 29 23:59:03 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 30 Aug 2009 09:59:03 +0300
Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support
In-Reply-To: <20090824121307.GA3919@mtls03>
References: <20090819171935.GA14411@mtls03> <20090824121307.GA3919@mtls03>
Message-ID: <4A9A2337.3030500@mellanox.co.il>

Eli Cohen wrote:
> Roland,
>
> what about this series of patches? Would you like me to re-create them
> over your xrc branch or would you rather take them before xrc?
>   
>
>   
Hi Roland
We wait for your input how to proceed

Thanks
Tziporet


From vlad at lists.openfabrics.org  Sun Aug 30 03:05:46 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 30 Aug 2009 03:05:46 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090830-0200 daily build status
Message-ID: <20090830100547.4013EF20436@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From sashak at voltaire.com  Sun Aug 30 03:02:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 13:02:53 +0300
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd
	tables setup flow
In-Reply-To: <f0e08f230908291559p1c14a7edkfe642a05c458bd46@mail.gmail.com>
References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me>
	<20090828080756.GH28379@me>
	<f0e08f230908280903ke375d0aoa2a22e13da8b52b8@mail.gmail.com>
	<20090829162555.GA21238@me>
	<f0e08f230908291559p1c14a7edkfe642a05c458bd46@mail.gmail.com>
Message-ID: <20090830100253.GA21909@me>

On 18:59 Sat 29 Aug     , Hal Rosenstock wrote:
> >
> > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> > index b7e3893..39d825c 100644
> > --- a/opensm/opensm/osm_ucast_mgr.c
> > +++ b/opensm/opensm/osm_ucast_mgr.c
> > @@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr)
> >                /* If configured routing algorithm failed, use default
> > MinHop */
> >                osm_ucast_mgr_build_lid_matrices(p_mgr);
> >                ucast_mgr_build_lfts(p_mgr);
> > +               osm_ucast_mgr_set_fwd_tables(p_mgr);
> 
> 
> 
> Shouldn't this be osm_ucast_mgr_set_fwd_table ?

Yes it should be, but I renamed this later, now it is
osm_ucast_mgr_set_fwd_tables() (since it sets all tables and not per
switch as before). By mistake I pushed this last change patch before
renaming so we have broken patch in the history (thing I'm trying to
avoid normally).

Sasha


From sashak at voltaire.com  Sun Aug 30 03:08:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 13:08:26 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: In lash_core,
 return status -1 for all errors
In-Reply-To: <20090806182315.GB21698@comcast.net>
References: <20090806182315.GB21698@comcast.net>
Message-ID: <20090830100826.GB21909@me>

On 14:23 Thu 06 Aug     , Hal Rosenstock wrote:
> 
> In lash_process, rename variable from return_status to status
> Also, status is not really IB_SUCCESS or not (although that works)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From jackm at dev.mellanox.co.il  Sun Aug 30 03:31:51 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Aug 2009 13:31:51 +0300
Subject: [ofa-general] [PATCH V3] mlx4: Do not allow ib userspace open
	following a fatal event
Message-ID: <200908301331.51212.jackm@dev.mellanox.co.il>

Userspace apps are supposed to release all ib device resources if
they receive a fatal async event (IBV_EVENT_DEVICE_FATAL).  However,
the app has no way of knowing when the device has come back up, except
to repeatedly attempt ibv_open_device() until it succeeds.

However, currently there is no protection against open succeeding when
the device is in the midst of the removal following the fatal event.
In this case, the open will succeed, but as a result the device waits
in the middle of its removal until the new app releases its ib resources
 -- and the new app will not do so, since the open succeeded at a point
following the fatal event generation.

This patch adds an "active" flag to the device. The active flag is set to
false (in the fatal event flow) before the "fatal" event is generated,
so any subsequent ibv_dev_open() call to the device will fail until the
device comes back up, thus preventing the above deadlock.

V2: move active flag from net to hw/mlx4, and use only for fatal event flow.
(per feedback from Roland).

V3: fixed checkpatch.pl warnings.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---
Roland,
Sorry about the checkpatch.pl oversight.  No excuse, but that day I was particularly
rushed -- I left for the airport that evening with my family to go on vacation for 2 weeks.
I guess I cut some corners, and shouldn't have.

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ae3d759..4effc19 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -342,6 +342,9 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev,
 	struct mlx4_ib_alloc_ucontext_resp resp;
 	int err;
 
+	if (!dev->ib_active)
+		return ERR_PTR(-EAGAIN);
+
 	resp.qp_tab_size      = dev->dev->caps.num_qps;
 	resp.bf_reg_size      = dev->dev->caps.bf_reg_size;
 	resp.bf_regs_per_page = dev->dev->caps.bf_regs_per_page;
@@ -673,6 +676,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 			goto err_reg;
 	}
 
+	ibdev->ib_active = 1;
+
 	return ibdev;
 
 err_reg:
@@ -729,6 +734,7 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr,
 		break;
 
 	case MLX4_DEV_EVENT_CATASTROPHIC_ERROR:
+		ibdev->ib_active = 0;
 		ibev.event = IB_EVENT_DEVICE_FATAL;
 		break;
 
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 8a7dd67..b22df97 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -175,6 +175,7 @@ struct mlx4_ib_dev {
 	spinlock_t		sm_lock;
 
 	struct mutex		cap_mask_mutex;
+	int			ib_active;
 };
 
 static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev)


From sashak at voltaire.com  Sun Aug 30 03:36:15 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 13:36:15 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Remove edges in lash
	matrix
In-Reply-To: <20090806223417.GA2997@comcast.net>
References: <20090806223417.GA2997@comcast.net>
Message-ID: <20090830103615.GC21909@me>

Hi Hal,

On 18:34 Thu 06 Aug     , Hal Rosenstock wrote:
> 
> @@ -773,6 +838,7 @@ static void seed_axes(lash_t *p_lash, int sw)
>  	mesh_node_t *node = p_lash->switches[sw]->node;
>  	int n = node->num_links;
>  	int i, j, c;
> +	char buf[256], *p;
>  
>  	OSM_LOG_ENTER(p_log);
>  	if (!node->matrix || !node->dimension)
> @@ -805,6 +871,12 @@ static void seed_axes(lash_t *p_lash, int sw)
>  		}
>  	}
>  
> +	for (i = 0; i < n; i++) {
> +		p = buf;
> +		print_axis(p_lash, p, sw, i);
> +		OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
> +	}
> +

As far as I can see it is only debug prints, so why is OSM_LOG_INFO here?
Also please move whole chunk under:

	if (osm_log_is_active(p_log, OSM_LOG_DEBUG)) {
		char buf[256], *p;
		....
	}

>  done:
>  	OSM_LOG_EXIT(p_log);
>  }
> @@ -878,6 +950,12 @@ static void make_geometry(lash_t *p_lash, int sw)
>  			n = s1->node->num_links;
>  
>  			/*
> +			 * ignore chain fragments
> +			 */
> +			if (n < seed->node->num_links && n <= 2)
> +				continue;
> +
> +			/*
>  			 * only process 'mesh' switches
>  			 */
>  			if (!s1->node->matrix)
> @@ -908,7 +986,8 @@ static void make_geometry(lash_t *p_lash, int sw)
>  					if (j == i)
>  						continue;
>  
> -					if (s1->node->matrix[i][j] != 2) {
> +					if (s1->node->matrix[i][j] != 2 &&
> +						s1->node->matrix[i][j] <= 4) {

What does this ' <= 4' check?

Sasha

>  						if (s1->node->axes[j]) {
>  							if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) {
>  								OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n");
> 


From sashak at voltaire.com  Sun Aug 30 04:26:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 14:26:24 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Parallelize (Stripe) MFT sets
	across switches
In-Reply-To: <20090807164127.GA795@comcast.net>
References: <20090807164127.GA795@comcast.net>
Message-ID: <20090830112624.GD21909@me>

Hi Hal,

On 12:41 Fri 07 Aug     , Hal Rosenstock wrote:
> 
> Similar to previous patch to "Parallelize (Stripe) LFT sets across switches".
> Currently, MADs are pipelined to a single switch first which effectively
> serializes these requests. This patch pipelines the MFT set MADs across
> switches first (before cycling to the next MFT block) so that multiple
> switches can be responding concurrently. Speedup is dependent on number
> of MFT blocks in use (number of MLIDs) which is dependent on the number
> of multicast groups.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
> index 7ce28c5..e281842 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -103,6 +103,8 @@ typedef struct osm_switch {
>  	uint8_t *lft;
>  	uint8_t *new_lft;
>  	osm_mcast_tbl_t mcast_tbl;
> +	uint32_t mft_block_num;
> +	uint32_t mft_position;
>  	unsigned endport_links;
>  	unsigned need_update;
>  	void *priv;
> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
> index 4dbbaa0..f91c6b6 100644
> --- a/opensm/opensm/osm_mcast_mgr.c
> +++ b/opensm/opensm/osm_mcast_mgr.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
>   *
> @@ -325,15 +325,12 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
>  {
>  	osm_node_t *p_node;
>  	osm_dr_path_t *p_path;
> -	osm_madw_context_t mad_context;
> +	osm_madw_context_t context;
>  	ib_api_status_t status;
> -	uint32_t block_id_ho = 0;
> -	int16_t block_num = 0;
> -	uint32_t position = 0;
> -	uint32_t max_position;
> +	uint32_t block_id_ho;
>  	osm_mcast_tbl_t *p_tbl;
>  	ib_net16_t block[IB_MCAST_BLOCK_SIZE];
> -	int ret = 0;
> +	int ret = -1;
>  
>  	CL_ASSERT(sm);
>  
> @@ -353,36 +350,34 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
>  	   configuration.
>  	 */
>  
> -	mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node);
> -	mad_context.mft_context.set_method = TRUE;
> +	context.mft_context.node_guid = osm_node_get_node_guid(p_node);
> +	context.mft_context.set_method = TRUE;
>  
>  	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
> -	max_position = p_tbl->max_position;
>  
> -	while (osm_mcast_tbl_get_block(p_tbl, block_num,
> -				       (uint8_t) position, block)) {
> -		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> -			"Writing MFT block 0x%X\n", block_id_ho);
> +	if (p_sw->mft_position <= p_tbl->max_position &&
> +	    osm_mcast_tbl_get_block(p_tbl, p_sw->mft_block_num,
> +				    (uint8_t) p_sw->mft_position, block)) {
> +
> +		block_id_ho = p_sw->mft_block_num + (p_sw->mft_position << 28);
>  
> -		block_id_ho = block_num + (position << 28);
> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> +			"Writing MFT block %u position %u to switch 0x%" PRIx64 "\n",
> +			p_sw->mft_block_num, p_sw->mft_position,
> +			cl_ntoh64(context.lft_context.node_guid));
>  
>  		status = osm_req_set(sm, p_path, (void *)block, sizeof(block),
>  				     IB_MAD_ATTR_MCAST_FWD_TBL,
>  				     cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
> -				     &mad_context);
> +				     &context);
>  
> -		if (status != IB_SUCCESS) {
> +		if (status != IB_SUCCESS)
>  			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: "
> -				"Sending multicast fwd. tbl. block failed (%s)\n",
> +				"Sending MFT block failed (%s)\n",
>  				ib_get_err_str(status));
> -			ret = -1;
> -		}
>  
> -		if (++position > max_position) {
> -			position = 0;
> -			block_num++;
> -		}
> -	}
> +	} else
> +		ret = 0;
>  
>  	OSM_LOG_EXIT(sm->p_log);
>  	return ret;
> @@ -1077,7 +1072,8 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
>  	cl_qmap_t *p_sw_tbl;
>  	cl_qlist_t *p_list = &sm->mgrp_list;
>  	osm_mgrp_t *p_mgrp;
> -	int i, ret = 0;
> +	osm_mcast_tbl_t *p_tbl;
> +	int sws_notdone, i, ret = 0;
>  
>  	OSM_LOG_ENTER(sm->p_log);
>  
> @@ -1114,11 +1110,30 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
>  	 */
>  	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
>  	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> -		if (mcast_mgr_set_tbl(sm, p_sw))
> -			ret = -1;
> +		p_sw->mft_block_num = 0;
> +		p_sw->mft_position = 0;
>  		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
>  	}
>  
> +	while (1) {
> +		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> +		sws_notdone = 0;
> +		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> +			if (mcast_mgr_set_tbl(sm, p_sw))
> +				sws_notdone++;
> +			p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
> +			if (++p_sw->mft_position > p_tbl->max_position) {
> +				p_sw->mft_position = 0;
> +				p_sw->mft_block_num++;
> +			}
> +			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
> +		}
> +		if (!sws_notdone) {
> +			ret = -1;
> +			break;
> +		}

So osm_mcast_mgr_process() will always return -1 value? Why?

> +	}
> +
>  	while (!cl_is_qlist_empty(p_list)) {
>  		cl_list_item_t *p = cl_qlist_remove_head(p_list);
>  		free(p);
> @@ -1142,9 +1157,10 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
>  	osm_switch_t *p_sw;
>  	cl_qmap_t *p_sw_tbl;
>  	osm_mgrp_t *p_mgrp;
> +	osm_mcast_tbl_t *p_tbl;
>  	ib_net16_t mlid;
>  	osm_mcast_mgr_ctxt_t *ctx;
> -	int ret = 0;
> +	int sws_notdone, ret = 0;
>  
>  	OSM_LOG_ENTER(sm->p_log);
>  
> @@ -1195,11 +1211,30 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
>  	p_sw_tbl = &sm->p_subn->sw_guid_tbl;
>  	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
>  	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> -		if (mcast_mgr_set_tbl(sm, p_sw))
> -			ret = -1;
> +		p_sw->mft_block_num = 0;
> +		p_sw->mft_position = 0;
>  		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
>  	}
>  
> +	while (1) {
> +		p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> +		sws_notdone = 0;
> +		while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> +			if (mcast_mgr_set_tbl(sm, p_sw))
> +				sws_notdone++;
> +			p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
> +			if (++p_sw->mft_position > p_tbl->max_position) {
> +				p_sw->mft_position = 0;
> +				p_sw->mft_block_num++;
> +			}
> +			p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
> +		}
> +		if (!sws_notdone) {
> +			ret = -1;
> +			break;

Ditto.

> +		}
> +	}
> +

Could you consolidate this code which is equivalent with one in
osm_mcast_mgr_process() in single function say
mcast_mgr_set_mftables()?

Also similar to LFTs case it would be nice to simplify this tables setup
loop.

Sasha

>  	osm_dump_mcast_routes(sm->p_subn->p_osm);
>  
>  exit:
> 


From hal.rosenstock at gmail.com  Sun Aug 30 04:32:41 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 30 Aug 2009 07:32:41 -0400
Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question
In-Reply-To: <20090829204508.GH21238@me>
References: <f0e08f230908071138m10e9a574g8d84623a99f89527@mail.gmail.com>
	<20090829204508.GH21238@me>
Message-ID: <f0e08f230908300432x3a51db2er6c787957702a8c4f@mail.gmail.com>

Hi Sasha,

On 8/29/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Hal,
>
> On 14:38 Fri 07 Aug     , Hal Rosenstock wrote:
> >
> > osm_link_mgr.c:link_mgr_get_smsl has the following:
> >
> >         /* Find osm_port of the source = p_physp */
> >         slid = osm_physp_get_base_lid(p_physp);
> >         p_src_port =
> >             cl_ptr_vector_get(&sm->p_subn->port_lid_tbl,
> cl_ntoh16(slid));
> >
> >         /* Call lash to find proper SL */
> >         sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port);
> >
> > It may be that this code is invoked prior to the LID being assigned
>
> How is it possible? In the code I can see that link_mgr_process() is
> always executed after lid_mgr run.


 When nodes use gPXE, the LID is not passed from the gPXE to the Linux
environment.


> > so
> > getting the p_src_port based on the LID yields NULL and then calling
> > osm_get_lash_sl causes a seg fault.
> >
> > I can see two ways to fix this:
> > 1. Replace with port GUID search
> > 2. Have osm_get_lash_sl handle NULL for p_src_port
> > Maybe you see other ways to deal with this.
> >
> > Do you have a preferred approach ?
>
> Hmm, SMSL will be irrelevant for a port where LID was not assigned,
> right?


Of course.


> If so than it is probably just enough to add in link_mgr_get_smsl():
>
>        if (!p_src_port)
>                return;


OK.

-- Hal

But it would be really better to understand an error source before deciding
> about proper solution.


> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/cff4b9df/attachment.html>

From sashak at voltaire.com  Sun Aug 30 04:36:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 14:36:30 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for MulticastFDBTop
In-Reply-To: <20090826140202.GA19158@comcast.net>
References: <20090826140202.GA19158@comcast.net>
Message-ID: <20090830113630.GE21909@me>

On 10:02 Wed 26 Aug     , Hal Rosenstock wrote:
> 
> Add support for SwitchInfo:MulticastFDBTop and
> PortInfo:CapabilityMask.IsMulticastFDBTopSupported
> 
> Added by MgtWG errata #4505-4508
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 30 04:53:16 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 14:53:16 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for
 MulticastFDBTop
In-Reply-To: <20090826140350.GB19158@comcast.net>
References: <20090826140350.GB19158@comcast.net>
Message-ID: <20090830115316.GF21909@me>

On 10:03 Wed 26 Aug     , Hal Rosenstock wrote:
> 
> Add support for SwitchInfo:MulticastFDBTop
> Added by MgtWG errata #4505-4508 and #4640
> 
> If MulticastFDBTop is set to other than 0, only fetch MulticastForwardingTable
> blocks up through MulticastFDBTop rather than MulticastFDBCap
> 
> If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
> index 106c934..f3ebe56 100644
> --- a/infiniband-diags/src/ibroute.c
> +++ b/infiniband-diags/src/ibroute.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
> + * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
>  	char *s;
>  	uint64_t nodeguid;
>  	uint32_t mod;
> -	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock;
> +	unsigned block, i, j, e, nports, cap, top, chunks,
> +		 startblock, lastblock;
>  	int n = 0;
>  
>  	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
>  		return s;
>  
>  	mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
> +	mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
>  
>  	if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
>  		endlid = IB_MIN_MCAST_LID + cap - 1;
> +	if (!dump_all && top && top < endlid) {
> +		if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff)

I don't understand what does this "top == 0xffff" check?

Shouldn't be something like

	(top > IB_MIN_MCAST_LID + cap - 1 && top != 0xbfff)

instead?

> +			IBWARN("illegal top mlid %x", top);
> +		else
> +			endlid = top;
> +	}

And where is the case of "no entries" (top = 0xbfff) handled (as
declared in change log)?

Sasha

>  
>  	if (!startlid)
>  		startlid = IB_MIN_MCAST_LID;
> @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
>  		printf(" MLid\n");
>  	}
>  	if (ibverbose)
> -		printf("Switch multicast mlid capability is %d\n", cap);
> +		printf("Switch multicast mlid capability is %d top is %d\n",
> +		       cap, top);
>  
>  	chunks = ALIGN(nports + 1, 16) / 16;
>  
> 


From sashak at voltaire.com  Sun Aug 30 05:00:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:00:11 +0300
Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question
In-Reply-To: <f0e08f230908300432x3a51db2er6c787957702a8c4f@mail.gmail.com>
References: <f0e08f230908071138m10e9a574g8d84623a99f89527@mail.gmail.com>
	<20090829204508.GH21238@me>
	<f0e08f230908300432x3a51db2er6c787957702a8c4f@mail.gmail.com>
Message-ID: <20090830120011.GG21909@me>

On 07:32 Sun 30 Aug     , Hal Rosenstock wrote:
> > >
> > > osm_link_mgr.c:link_mgr_get_smsl has the following:
> > >
> > >         /* Find osm_port of the source = p_physp */
> > >         slid = osm_physp_get_base_lid(p_physp);
> > >         p_src_port =
> > >             cl_ptr_vector_get(&sm->p_subn->port_lid_tbl,
> > cl_ntoh16(slid));
> > >
> > >         /* Call lash to find proper SL */
> > >         sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port);
> > >
> > > It may be that this code is invoked prior to the LID being assigned
> >
> > How is it possible? In the code I can see that link_mgr_process() is
> > always executed after lid_mgr run.
> 
>  When nodes use gPXE, the LID is not passed from the gPXE to the Linux
> environment.

How is it related to gPXE?

OpenSM's lid manager runs and assigns lids to all available endports,
only after this link manager runs and try with SMSL - at this point all
lids should be in place and p_subn->port_lid_tbl should be fine.

Am I missing something?

Sasha


From sashak at voltaire.com  Sun Aug 30 05:04:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:04:55 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support for
	MulticastFDBTop
In-Reply-To: <20090826140450.GC19158@comcast.net>
References: <20090826140450.GC19158@comcast.net>
Message-ID: <20090830120455.GH21909@me>

On 10:04 Wed 26 Aug     , Hal Rosenstock wrote:
> @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info {
>  	ib_net16_t lids_per_port;
>  	ib_net16_t enforce_cap;
>  	uint8_t flags;
> +	uint8_t resvd;
> +	ib_net16_t mcast_top;
>  } PACK_SUFFIX ib_switch_info_t;
>  #include <complib/cl_packoff.h>
>  /************/
> @@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record {
>  	ib_net16_t lid;
>  	uint16_t resv0;
>  	ib_switch_info_t switch_info;
> -	uint8_t pad[3];
> +	uint8_t pad[1];

Why should be pad[1] here? In struct switch_info you are adding three
bytes (resvd - 1 and mcast_top - 2), no?

Sasha


From sashak at voltaire.com  Sun Aug 30 05:11:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:11:57 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags: Fix IB network discovery
 from switch node.
In-Reply-To: <4A9548AA.4020900@gmail.com>
References: <4A9548AA.4020900@gmail.com>
Message-ID: <20090830121157.GI21909@me>

On 17:37 Wed 26 Aug     , Eli Dorfman (Voltaire) wrote:
> Subject: [PATCH] Fix IB network discovery from switch node.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Please next time add descriptive change log to your patches.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:19:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:19:05 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Add CounterSelect2
 field to PortCounters attribute
In-Reply-To: <20090826155447.GA25235@comcast.net>
References: <20090826155447.GA25235@comcast.net>
Message-ID: <20090830121905.GJ21909@me>

On 11:54 Wed 26 Aug     , Hal Rosenstock wrote:
> 
> Per MgtWG RefID #4527
> 
> Also, cosmetic commentary change
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Next time could you add more descriptive change log to your patches -
"RefID #4527" by itself doesn't say a lot (and RefID texts is available
only in member area of IBTA site).

Sasha


From hal.rosenstock at gmail.com  Sun Aug 30 05:20:33 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 30 Aug 2009 08:20:33 -0400
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Add CounterSelect2 
	field to PortCounters attribute
In-Reply-To: <20090830121905.GJ21909@me>
References: <20090826155447.GA25235@comcast.net> <20090830121905.GJ21909@me>
Message-ID: <f0e08f230908300520o7a8fb7c3p60e7a3802dd59988@mail.gmail.com>

On 8/30/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 11:54 Wed 26 Aug     , Hal Rosenstock wrote:
> >
> > Per MgtWG RefID #4527
> >
> > Also, cosmetic commentary change
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>
> Applied. Thanks.
>
> Next time could you add more descriptive change log to your patches -
> "RefID #4527" by itself doesn't say a lot (and RefID texts is available
> only in member area of IBTA site).


There is a public version now.

-- Hal

Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/f8eb1ea1/attachment.html>

From hal.rosenstock at gmail.com  Sun Aug 30 05:22:09 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 30 Aug 2009 08:22:09 -0400
Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support for 
	MulticastFDBTop
In-Reply-To: <20090830120455.GH21909@me>
References: <20090826140450.GC19158@comcast.net> <20090830120455.GH21909@me>
Message-ID: <f0e08f230908300522m37187fbciee62dace5d767962@mail.gmail.com>

On 8/30/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 10:04 Wed 26 Aug     , Hal Rosenstock wrote:
> > @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info {
> >       ib_net16_t lids_per_port;
> >       ib_net16_t enforce_cap;
> >       uint8_t flags;
> > +     uint8_t resvd;
> > +     ib_net16_t mcast_top;
> >  } PACK_SUFFIX ib_switch_info_t;
> >  #include <complib/cl_packoff.h>
> >  /************/
> > @@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record {
> >       ib_net16_t lid;
> >       uint16_t resv0;
> >       ib_switch_info_t switch_info;
> > -     uint8_t pad[3];
> > +     uint8_t pad[1];
>
> Why should be pad[1] here? In struct switch_info you are adding three
> bytes (resvd - 1 and mcast_top - 2), no?


Good catch. It was due to an initial version which didn't have the 16 bit
MFTTop alignment. Do you want a v2 patch for this ?

-- Hal

Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/84e8a85a/attachment.html>

From sashak at voltaire.com  Sun Aug 30 05:26:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:26:54 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags: Fix IB network discovery
 from switch node.
In-Reply-To: <20090830121157.GI21909@me>
References: <4A9548AA.4020900@gmail.com>
 <20090830121157.GI21909@me>
Message-ID: <20090830122654.GK21909@me>

On 15:11 Sun 30 Aug     , Sasha Khapyorsky wrote:
> On 17:37 Wed 26 Aug     , Eli Dorfman (Voltaire) wrote:
> > Subject: [PATCH] Fix IB network discovery from switch node.
> > 
> > Signed-off-by: Eli Dorfman <elid at voltaire.com>
> 
> Applied. Thanks.

BTW, was need to rebase the patch against master.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:30:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:30:21 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery.c: Indicate
 whether PortXmitWait counter is supported
In-Reply-To: <20090826161223.GA30257@comcast.net>
References: <20090826161223.GA30257@comcast.net>
Message-ID: <20090830123021.GL21909@me>

On 12:12 Wed 26 Aug     , Hal Rosenstock wrote:
> 
> Indicate extended v. (normal) port counters in output
> Also, some cosmetic formatting changes and commentary typo fixed
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:31:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:31:05 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/libibnetdisc: add missing
 '\n' to error message
In-Reply-To: <20090826102957.bed66987.weiny2@llnl.gov>
References: <20090826102957.bed66987.weiny2@llnl.gov>
Message-ID: <20090830123105.GM21909@me>

On 10:29 Wed 26 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Fri, 21 Aug 2009 15:01:00 -0700
> Subject: [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:32:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:32:24 +0300
Subject: [ofa-general] Re: [PATCH] libibnetdisc: add retract_dpath function
In-Reply-To: <20090826103142.660ac83b.weiny2@llnl.gov>
References: <20090826103142.660ac83b.weiny2@llnl.gov>
Message-ID: <20090830123224.GN21909@me>

On 10:31 Wed 26 Aug     , Ira Weiny wrote:
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 26 Aug 2009 09:25:00 -0700
> Subject: [PATCH] libibnetdisc: add retract_dpath function
> 
> 	When using combined routing some switches do not handle Hop Count of 0
> 	well.  Detect when the drpath count is 0 and return to lid based
> 	routing in this case.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:35:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:35:25 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support
	for MulticastFDBTop
In-Reply-To: <f0e08f230908300522m37187fbciee62dace5d767962@mail.gmail.com>
References: <20090826140450.GC19158@comcast.net> <20090830120455.GH21909@me>
	<f0e08f230908300522m37187fbciee62dace5d767962@mail.gmail.com>
Message-ID: <20090830123525.GO21909@me>

On 08:22 Sun 30 Aug     , Hal Rosenstock wrote:
> >
> > Why should be pad[1] here? In struct switch_info you are adding three
> > bytes (resvd - 1 and mcast_top - 2), no?
> 
> 
> Good catch. It was due to an initial version which didn't have the 16 bit
> MFTTop alignment. Do you want a v2 patch for this ?

Yes please.

Sasha


From sashak at voltaire.com  Sun Aug 30 05:43:16 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 15:43:16 +0300
Subject: [ofa-general] [PATCH] libibnetdisc: fix compilation warning
In-Reply-To: <20090826103142.660ac83b.weiny2@llnl.gov>
References: <20090826103142.660ac83b.weiny2@llnl.gov>
Message-ID: <20090830124316.GP21909@me>


Newly introduced retract_dpath() was declared as int but no any value
was returned, this resulted in this warning:

src/ibnetdisc.c: In function ‘retract_dpath’:
src/ibnetdisc.c:186: warning: control reaches end of non-void function

Fixing this by declaring retract_dpath() as void.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 0f6fc55..97e369c 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -175,7 +175,7 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport)
 	return path->cnt;
 }
 
-static int retract_dpath(ib_portid_t * path)
+static void retract_dpath(ib_portid_t * path)
 {
 	path->drpath.cnt--;	/* restore path */
 	if (path->drpath.cnt == 0 && path->lid) {
-- 
1.6.4.1


From hal.rosenstock at gmail.com  Sun Aug 30 05:42:00 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 30 Aug 2009 08:42:00 -0400
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support 
	for MulticastFDBTop
In-Reply-To: <20090830115316.GF21909@me>
References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me>
Message-ID: <f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>

On 8/30/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 10:03 Wed 26 Aug     , Hal Rosenstock wrote:
> >
> > Add support for SwitchInfo:MulticastFDBTop
> > Added by MgtWG errata #4505-4508 and #4640
> >
> > If MulticastFDBTop is set to other than 0, only fetch
> MulticastForwardingTable
> > blocks up through MulticastFDBTop rather than MulticastFDBCap
> >
> > If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640)
> >
> > Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> > ---
> > diff --git a/infiniband-diags/src/ibroute.c
> b/infiniband-diags/src/ibroute.c
> > index 106c934..f3ebe56 100644
> > --- a/infiniband-diags/src/ibroute.c
> > +++ b/infiniband-diags/src/ibroute.c
> > @@ -1,5 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
> > + * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid,
> unsigned startlid,
> >       char *s;
> >       uint64_t nodeguid;
> >       uint32_t mod;
> > -     unsigned block, i, j, e, nports, cap, chunks, startblock,
> lastblock;
> > +     unsigned block, i, j, e, nports, cap, top, chunks,
> > +              startblock, lastblock;
> >       int n = 0;
> >
> >       if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
> >               return s;
> >
> >       mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
> > +     mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
> >
> >       if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
> >               endlid = IB_MIN_MCAST_LID + cap - 1;
> > +     if (!dump_all && top && top < endlid) {
> > +             if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff)
>
> I don't understand what does this "top == 0xffff" check?


MFTTop is only allowed up to 0xfffe so it's the max but I now see that gets
checked later where endlid > IB_MAX_MCAST_LID.


> Shouldn't be something like
>
>        (top > IB_MIN_MCAST_LID + cap - 1 && top != 0xbfff)
>
> instead?


Yes.


> > +                     IBWARN("illegal top mlid %x", top);
> > +             else
> > +                     endlid = top;
> > +     }
>
> And where is the case of "no entries" (top = 0xbfff) handled (as
> declared in change log)?


 This is handled by the block loop inside of dump_multicast_tables.

-- Hal


> Sasha
>
> >
> >       if (!startlid)
> >               startlid = IB_MIN_MCAST_LID;
> > @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid,
> unsigned startlid,
> >               printf(" MLid\n");
> >       }
> >       if (ibverbose)
> > -             printf("Switch multicast mlid capability is %d\n", cap);
> > +             printf("Switch multicast mlid capability is %d top is
> %d\n",
> > +                    cap, top);
> >
> >       chunks = ALIGN(nports + 1, 16) / 16;
> >
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/79f661b0/attachment.html>

From hnrose at comcast.net  Sun Aug 30 05:51:50 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 30 Aug 2009 08:51:50 -0400
Subject: [ofa-general] [PATCHv2] opensm: Add infrastructure support for
	MulticastFDBTop
Message-ID: <20090830125150.GA2079@comcast.net>


Add support for SwitchInfo:MulticastFDBTop
Added by MgtWG errata #4505-4508

Add OpenSM infrastructure support to ib_types.h and osm_helper.c

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Removed erroneous pad byte left remaining in ib_switch_info_record_t

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index fe3f051..9e38a6d 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
@@ -4492,7 +4492,7 @@ typedef struct _ib_port_info {
 #define IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL (CL_HTON32(0x08000000))
 #define IB_PORT_CAP_RESV28        (CL_HTON32(0x10000000))
 #define IB_PORT_CAP_RESV29        (CL_HTON32(0x20000000))
-#define IB_PORT_CAP_RESV30        (CL_HTON32(0x40000000))
+#define IB_PORT_CAP_HAS_MCAST_FDB_TOP (CL_HTON32(0x40000000))
 #define IB_PORT_CAP_RESV31        (CL_HTON32(0x80000000))
 
 /****f* IBA Base: Types/ib_port_info_get_port_state
@@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info {
 	ib_net16_t lids_per_port;
 	ib_net16_t enforce_cap;
 	uint8_t flags;
+	uint8_t resvd;
+	ib_net16_t mcast_top;
 } PACK_SUFFIX ib_switch_info_t;
 #include <complib/cl_packoff.h>
 /************/
@@ -5908,7 +5910,6 @@ typedef struct _ib_switch_info_record {
 	ib_net16_t lid;
 	uint16_t resv0;
 	ib_switch_info_t switch_info;
-	uint8_t pad[3];
 } PACK_SUFFIX ib_switch_info_record_t;
 #include <complib/cl_packoff.h>
 
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index 3692474..b5a29c2 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
@@ -764,9 +764,9 @@ static void dbg_get_capabilities_str(IN char *p_buf, IN const uint32_t buf_size,
 				&total_len) != IB_SUCCESS)
 			return;
 	}
-	if (p_pi->capability_mask & IB_PORT_CAP_RESV30) {
+	if (p_pi->capability_mask & IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
 		if (dbg_do_line(&p_local, buf_size, p_prefix_str,
-				"IB_PORT_CAP_RESV30\n",
+				"IB_PORT_CAP_HAS_MCAST_FDB_TOP\n",
 				&total_len) != IB_SUCCESS)
 			return;
 	}
@@ -1512,7 +1512,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log,
 			"\t\t\t\tlife_state..............0x%X\n"
 			"\t\t\t\tlids_per_port...........%u\n"
 			"\t\t\t\tpartition_enf_cap.......0x%X\n"
-			"\t\t\t\tflags...................0x%X\n",
+			"\t\t\t\tflags...................0x%X\n"
+			"\t\t\t\tmcast_top...............0x%X\n",
 			cl_ntoh16(p_si->lin_cap),
 			cl_ntoh16(p_si->rand_cap),
 			cl_ntoh16(p_si->mcast_cap),
@@ -1522,7 +1523,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log,
 			p_si->def_mcast_not_port,
 			p_si->life_state,
 			cl_ntoh16(p_si->lids_per_port),
-			cl_ntoh16(p_si->enforce_cap), p_si->flags);
+			cl_ntoh16(p_si->enforce_cap), p_si->flags,
+			cl_ntoh16(p_si->mcast_top));
 	}
 }
 

From hnrose at comcast.net  Sun Aug 30 06:25:49 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 30 Aug 2009 09:25:49 -0400
Subject: [ofa-general] [PATCHv2] infiniband-diags/ibroute: Add support for
	MulticastFDBTop
Message-ID: <20090830132549.GA13950@comcast.net>


Add support for SwitchInfo:MulticastFDBTop
Added by MgtWG errata #4505-4508 and 4640

If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable
blocks up through MulticastFDBTop rather than MulticastFDBCap

If MulticastFDBTop set to 0xbfff, this means no entries (per 4640)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Fixed top range check

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 106c934..1112b87 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 	char *s;
 	uint64_t nodeguid;
 	uint32_t mod;
-	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock;
+	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock,
+		 top;
 	int n = 0;
 
 	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
 		return s;
 
 	mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
+	mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
 
 	if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
 		endlid = IB_MIN_MCAST_LID + cap - 1;
+	if (!dump_all && top && top < endlid) {
+		if (top < IB_MIN_MCAST_LID - 1 || top > IB_MIN_MCAST_LID + cap - 1)
+			IBWARN("illegal top mlid %x", top);
+		else
+			endlid = top;
+	}
 
 	if (!startlid)
 		startlid = IB_MIN_MCAST_LID;
@@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 		printf(" MLid\n");
 	}
 	if (ibverbose)
-		printf("Switch multicast mlid capability is %d\n", cap);
+		printf("Switch multicast mlid capability is %d top is 0x%x\n",
+		       cap, top);
 
 	chunks = ALIGN(nports + 1, 16) / 16;
 

From sashak at voltaire.com  Sun Aug 30 07:16:48 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 17:16:48 +0300
Subject: [ofa-general] Re: [PATCHv2] opensm: Add infrastructure support for
 MulticastFDBTop
In-Reply-To: <20090830125150.GA2079@comcast.net>
References: <20090830125150.GA2079@comcast.net>
Message-ID: <20090830141648.GA15546@me>

On 08:51 Sun 30 Aug     , Hal Rosenstock wrote:
> 
> Add support for SwitchInfo:MulticastFDBTop
> Added by MgtWG errata #4505-4508
> 
> Add OpenSM infrastructure support to ib_types.h and osm_helper.c
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From jackm at dev.mellanox.co.il  Sun Aug 30 08:35:13 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Aug 2009 18:35:13 +0300
Subject: [ofa-general] Fwd: OFED-1.5-alpha4 installation problem
In-Reply-To: <fde1733a0908260317p3f754642jfd28077f93c15bd4@mail.gmail.com>
References: <fde1733a0908260234s4fff91f4oe16e3186b708d5dc@mail.gmail.com>
	<fde1733a0908260317p3f754642jfd28077f93c15bd4@mail.gmail.com>
Message-ID: <200908301835.13593.jackm@dev.mellanox.co.il>

On Wednesday 26 August 2009 13:17, Sneha Mistry wrote:
> Hi,
> 
> I am new be to Infiniband and trying to install OFED-1.5-alpha4 on
> opensuse 10.3 .
> Kernel version is  2.6.26-2-686 .
1. OFED 1.5 is not supported on OpenSuse 10.3 -- it is supported on OpenSuse 11.
2. You are correct in that the release notes indicate that 10.3 is supported -- this was an
   oversight, which will be corrected in the next OFED 1.5 release candidate (the notes will
   then indicate support for OpenSuse 11, not 10.3).
3. The kernel you are running is evidently 2.6.22.5-31 (from the log below), not 2.6.26-2-686.
   This is indeed the OpenSuse 10.3 kernel.
> But it gives me error  message.
> 
> Failed to build ofa_kernel RPM
> See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log
> 
> I checked release note it says suse 10.3 is supported.
> 
> Output of uname -a is
> Linux linux-ljhr 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC
> i686 i686 i386 GNU/Linux
> 
> Last few line of log is as given.
> 
> make[1]: Entering directory `/usr/src/linux-2.6.22.5-31-obj/i386/default'
> make -C ../../../linux-2.6.22.5-31
> O=../linux-2.6.22.5-31-obj/i386/default modules
> make -C /usr/src/linux-2.6.22.5-31-obj/i386/default \
> 	KBUILD_SRC=/usr/src/linux-2.6.22.5-31 \
> 	KBUILD_EXTMOD="/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5" -f
> /usr/src/linux-2.6.22.5-31/Makefile modules
> test -e include/linux/autoconf.h -a -e include/config/auto.conf || (		\
> 	echo;								\
> 	echo "  ERROR: Kernel configuration is invalid.";		\
> 	echo "         include/linux/autoconf.h or include/config/auto.conf
> are missing.";	\
> 	echo "         Run 'make oldconfig && make prepare' on kernel src to
> fix it.";	\
> 	echo;								\
> 	/bin/false)
> mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions
> rm -f /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/*
> make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5
> make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband
> make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core
>   gcc -m32 -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d
>  -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.2.1/include
> -D__KERNEL__ \
> -D__OFED_BUILD__ \
> -include include/linux/autoconf.h \
> -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \
> -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.22_suse10_3/include/
> \
>  \
>  \
> -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \
> -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \
> -I/usr/local/include/scst \
> -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \
> -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \
> -Iinclude \
> -Iinclude2 -I/usr/src/linux-2.6.22.5-31/include \
> -I/usr/src/linux-2.6.22.5-31/arch//include \
>    -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core
> -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
> -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common
> -Os -pipe -msoft-float -mregparm=3 -freg-struct-return
> -mpreferred-stack-boundary=2 -march=i586 -mtune=generic -ffreestanding
> -maccumulate-outgoing-args -DCONFIG_AS_CFI=1
> -DCONFIG_AS_CFI_SIGNAL_FRAME=1
> -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-generic
> -Iinclude/asm-i386/mach-generic
> -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-default
> -Iinclude/asm-i386/mach-default -fomit-frame-pointer -g
> -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign
> -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)"
> -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.tmp_addr.o
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c
> In file included from
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_addr.h:41,
>                  from
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:46:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
> function ‘ib_dma_mapping_error’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677:
> warning: passing argument 1 of ‘dma_mapping_error’ makes integer from
> pointer without a cast
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677:
> error: too many arguments to function ‘dma_mapping_error’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716:
> warning: ‘struct dma_attrs’ declared inside parameter list
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716:
> warning: its scope is only this definition or declaration, which is
> probably not what you want
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
> function ‘ib_dma_map_single_attrs’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1718:
> error: implicit declaration of function ‘dma_map_single_attrs’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1725:
> warning: ‘struct dma_attrs’ declared inside parameter list
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
> function ‘ib_dma_unmap_single_attrs’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1727:
> error: implicit declaration of function ‘dma_unmap_single_attrs’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1728:
> warning: ‘return’ with a value, in function returning void
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1803:
> warning: ‘struct dma_attrs’ declared inside parameter list
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
> function ‘ib_dma_map_sg_attrs’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1805:
> error: implicit declaration of function ‘dma_map_sg_attrs’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1811:
> warning: ‘struct dma_attrs’ declared inside parameter list
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In
> function ‘ib_dma_unmap_sg_attrs’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1813:
> error: implicit declaration of function ‘dma_unmap_sg_attrs’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
> In function ‘rdma_translate_ip’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
> error: ‘init_net’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
> error: (Each undeclared identifier is reported only once
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122:
> error: for each function it appears in.)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:123:
> error: too many arguments to function ‘ip_dev_find’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:33:
> error: macro "for_each_netdev" passed 2 arguments, but takes just 1
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:
> error: ‘for_each_netdev’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:
> error: expected ‘;’ before ‘{’ token
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
> In function ‘addr_send_arp’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
> error: ‘init_net’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
> warning: passing argument 2 of ‘ip_route_output_key’ from incompatible
> pointer type
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191:
> error: too many arguments to function ‘ip_route_output_key’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:206:
> error: too many arguments to function ‘ip6_route_output’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
> In function ‘addr4_resolve_remote’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
> error: ‘init_net’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
> warning: passing argument 2 of ‘ip_route_output_key’ from incompatible
> pointer type
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232:
> error: too many arguments to function ‘ip_route_output_key’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
> In function ‘addr6_resolve_remote’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281:
> error: ‘init_net’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281:
> error: too many arguments to function ‘ip6_route_output’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:
> In function ‘addr_resolve_local’:
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368:
> error: ‘init_net’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368:
> error: too many arguments to function ‘ip_dev_find’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:372:
> error: implicit declaration of function ‘ipv4_is_zeronet’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:376:
> error: implicit declaration of function ‘ipv4_is_loopback’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:33:
> error: macro "for_each_netdev" passed 2 arguments, but takes just 1
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:
> error: ‘for_each_netdev’ undeclared (first use in this function)
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:395:
> error: expected ‘;’ before ‘if’
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:410:
> error: implicit declaration of function ‘ipv6_addr_loopback’
> make[6]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o]
> Error 1
> make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core]
> Error 2
> make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband]
> Error 2
> make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2
> make[2]: *** [modules] Error 2
> make[1]: *** [modules] Error 2
> make[1]: Leaving directory `/usr/src/linux-2.6.22.5-31-obj/i386/default'
> make: *** [kernel] Error 2
> error: Bad exit status from /var/tmp/rpm-tmp.64786 (%build)
> 
> 
> RPM build errors:
>     user vlad does not exist - using root
>     group vlad does not exist - using root
>     user vlad does not exist - using root
>     group vlad does not exist - using root
>     Bad exit status from /var/tmp/rpm-tmp.64786 (%build)
> 
> Regards,
> sgm
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Sun Aug 30 08:36:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 18:36:19 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add
	support for MulticastFDBTop
In-Reply-To: <f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>
References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me>
	<f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>
Message-ID: <20090830153619.GB15546@me>

On 08:42 Sun 30 Aug     , Hal Rosenstock wrote:
> 
>  This is handled by the block loop inside of dump_multicast_tables.

Where? I don't see this. Should not it to show nothing ("no entries")
when top = 0xbfff and dump_all is not set?

Sasha


From sashak at voltaire.com  Sun Aug 30 08:40:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Aug 2009 18:40:38 +0300
Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/ibroute: Add support
 for MulticastFDBTop
In-Reply-To: <20090830132549.GA13950@comcast.net>
References: <20090830132549.GA13950@comcast.net>
Message-ID: <20090830154038.GC15546@me>

On 09:25 Sun 30 Aug     , Hal Rosenstock wrote:
> 
> Add support for SwitchInfo:MulticastFDBTop
> Added by MgtWG errata #4505-4508 and 4640
> 
> If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable
> blocks up through MulticastFDBTop rather than MulticastFDBCap
> 
> If MulticastFDBTop set to 0xbfff, this means no entries (per 4640)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Fixed top range check
> 
> diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
> index 106c934..1112b87 100644
> --- a/infiniband-diags/src/ibroute.c
> +++ b/infiniband-diags/src/ibroute.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
> + * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
>  	char *s;
>  	uint64_t nodeguid;
>  	uint32_t mod;
> -	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock;
> +	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock,
> +		 top;
>  	int n = 0;
>  
>  	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
>  		return s;
>  
>  	mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
> +	mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
>  
>  	if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
>  		endlid = IB_MIN_MCAST_LID + cap - 1;
> +	if (!dump_all && top && top < endlid) {
> +		if (top < IB_MIN_MCAST_LID - 1 || top > IB_MIN_MCAST_LID + cap - 1)

Looking more at this it seems for me that test
'top > IB_MIN_MCAST_LID + cap - 1' will be never true (and actually not
needed) - this test will be performed only when top < endlid and endlid
is verified one line before to be in '< IB_MIN_MCAST_LID + cap - 1'
range.

Sasha

> +			IBWARN("illegal top mlid %x", top);
> +		else
> +			endlid = top;
> +	}
>  
>  	if (!startlid)
>  		startlid = IB_MIN_MCAST_LID;
> @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
>  		printf(" MLid\n");
>  	}
>  	if (ibverbose)
> -		printf("Switch multicast mlid capability is %d\n", cap);
> +		printf("Switch multicast mlid capability is %d top is 0x%x\n",
> +		       cap, top);
>  
>  	chunks = ALIGN(nports + 1, 16) / 16;
>  
> 


From jackm at dev.mellanox.co.il  Sun Aug 30 08:45:38 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Aug 2009 18:45:38 +0300
Subject: [ofa-general] Number of devices returned by ibv_get_device_list()
In-Reply-To: <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com>
References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com>
	<122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com>
Message-ID: <200908301845.38642.jackm@dev.mellanox.co.il>

On Wednesday 26 August 2009 02:03, MANIKANTAN KALAIYA wrote:
> Resending to the mailing list...
> 
> We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code.

libibverbs reads the number of devices ONCE, at calling process startup (as part of its initialization).
To get a new device count, you need to restart your program.

- Jack

> The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected.
> 
> However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop).
> Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program.
> 
> We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated.
> 
> modprobe command - "sudo modprobe -r ib_mthca"
> 
> Test program:
> =================================================
> #include <stdio.h>
> #include <infiniband/verbs.h>
> 
> int main(int argc, char **argv)
> {
>     int ret, num_devices;
>     struct ibv_device      **dev_list;
> 
>     while(1) {
> 
>         dev_list = ibv_get_device_list(&num_devices);
> 
>         if (num_devices != 0) {
>             printf("IB ADAPTER AVAILABLE:%d\n", num_devices);
>         }
>         else {
>             printf("IB ADAPTER UNAVAILABLE\n");
>         }
>         sleep(2);
>         ibv_free_device_list(dev_list);
>     }
> 
>     return(0);
> }
> =================================================
> 
> Thanks,
> Mani.
> 


From jackm at dev.mellanox.co.il  Sun Aug 30 08:47:58 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Aug 2009 18:47:58 +0300
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A90FAD8.6000701@mellanox.co.il>
References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il>
Message-ID: <200908301847.59143.jackm@dev.mellanox.co.il>

On Sunday 23 August 2009 11:16, Tziporet Koren wrote:
> Jeremy Enos wrote:
> > Coming up on a year of Fedora 10 GA...  Fedora 9 no longer maintained. 
> > No OFED support for FC10 yet creates a tough spot if trying to stay
> > secure.  Is there *any* version (1.5, etc) that will even build on FC10? 
> > thx-
> >
> >     Jeremy
> >
> >
> >   
> 
> I think OFED 1.5 might work on it but not sure. Which kernel version 
> FC10 use?
> In general OFED 1.5 supports FC11
Actually, it supports FC12 (kernel 2.6.29).
- Jack

> Tziporet
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From jackm at dev.mellanox.co.il  Sun Aug 30 08:56:33 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Aug 2009 18:56:33 +0300
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <200908301847.59143.jackm@dev.mellanox.co.il>
References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il>
	<200908301847.59143.jackm@dev.mellanox.co.il>
Message-ID: <200908301856.33259.jackm@dev.mellanox.co.il>

On Sunday 30 August 2009 18:47, Jack Morgenstein wrote:
> On Sunday 23 August 2009 11:16, Tziporet Koren wrote:
> > Jeremy Enos wrote:
> > > Coming up on a year of Fedora 10 GA...  Fedora 9 no longer maintained. 
> > > No OFED support for FC10 yet creates a tough spot if trying to stay
> > > secure.  Is there *any* version (1.5, etc) that will even build on FC10? 
> > > thx-
> > >
> > >     Jeremy
> > >
> > >
> > >   
> > 
> > I think OFED 1.5 might work on it but not sure. Which kernel version 
> > FC10 use?
> > In general OFED 1.5 supports FC11
> Actually, it supports FC12 (kernel 2.6.29).
We had originally planned to support FC11 -- however, in the interim, FC12 was
released -- based on kernel 2.6.29, which is supported -- so we decided to support
FC12 instead.

-Jack

> - Jack
> 
> > Tziporet
> > 
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 


From hal.rosenstock at gmail.com  Sun Aug 30 09:35:54 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 30 Aug 2009 12:35:54 -0400
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support 
	for MulticastFDBTop
In-Reply-To: <20090830153619.GB15546@me>
References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me>
	<f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>
	<20090830153619.GB15546@me>
Message-ID: <f0e08f230908300935k58e9e82dp543bc0a3810d20c3@mail.gmail.com>

On 8/30/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 08:42 Sun 30 Aug     , Hal Rosenstock wrote:
> >
> >  This is handled by the block loop inside of dump_multicast_tables.
>
> Where? I don't see this. Should not it to show nothing ("no entries")
> when top = 0xbfff and dump_all is not set?


Doesn't the loop:
for (block = startblock; block <= lastblock; block++)
terminates without any blocks read ? So it shows no entries. Do you mean to
print "no entries" ?

-- Hal

Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/acdf1444/attachment.html>

From hnrose at comcast.net  Sun Aug 30 09:32:07 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Sun, 30 Aug 2009 12:32:07 -0400
Subject: [ofa-general] [PATCHv3] infiniband-diags/ibroute: Add support for
	MulticastFDBTop
Message-ID: <20090830163207.GA17406@comcast.net>


Add support for SwitchInfo:MulticastFDBTop
Added by MgtWG errata #4505-4508 and 4640

If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable
blocks up through MulticastFDBTop rather than MulticastFDBCap

If MulticastFDBTop set to 0xbfff, this means no entries (per 4640)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v2:
Removed redundant clause in top range check

Changes since v1:
Fixed top range check

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 106c934..00df1ec 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2009 Mellanox Technologies LTD.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 	char *s;
 	uint64_t nodeguid;
 	uint32_t mod;
-	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock;
+	unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock,
+		 top;
 	int n = 0;
 
 	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
 		return s;
 
 	mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap);
+	mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top);
 
 	if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1)
 		endlid = IB_MIN_MCAST_LID + cap - 1;
+	if (!dump_all && top && top < endlid) {
+		if (top < IB_MIN_MCAST_LID - 1)
+			IBWARN("illegal top mlid %x", top);
+		else
+			endlid = top;
+	}
 
 	if (!startlid)
 		startlid = IB_MIN_MCAST_LID;
@@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid,
 		printf(" MLid\n");
 	}
 	if (ibverbose)
-		printf("Switch multicast mlid capability is %d\n", cap);
+		printf("Switch multicast mlid capability is %d top is 0x%x\n",
+		       cap, top);
 
 	chunks = ALIGN(nports + 1, 16) / 16;
 

From jenos at ncsa.uiuc.edu  Sun Aug 30 10:41:01 2009
From: jenos at ncsa.uiuc.edu (Jeremy Enos)
Date: Sun, 30 Aug 2009 12:41:01 -0500
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <200908301856.33259.jackm@dev.mellanox.co.il>
References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il>
	<200908301847.59143.jackm@dev.mellanox.co.il>
	<200908301856.33259.jackm@dev.mellanox.co.il>
Message-ID: <4A9AB9AD.80803@ncsa.uiuc.edu>

Is it supposed to support FC10 as well then, or just fc12?  Actually- it 
wouldn't matter if I couldn't use 1.5.  I just want *some* version that 
supports FC10.  Is there one?
thx-

    Jeremy

Jack Morgenstein wrote:
> On Sunday 30 August 2009 18:47, Jack Morgenstein wrote:
>   
>> On Sunday 23 August 2009 11:16, Tziporet Koren wrote:
>>     
>>> Jeremy Enos wrote:
>>>       
>>>> Coming up on a year of Fedora 10 GA...  Fedora 9 no longer maintained. 
>>>> No OFED support for FC10 yet creates a tough spot if trying to stay
>>>> secure.  Is there *any* version (1.5, etc) that will even build on FC10? 
>>>> thx-
>>>>
>>>>     Jeremy
>>>>
>>>>
>>>>   
>>>>         
>>> I think OFED 1.5 might work on it but not sure. Which kernel version 
>>> FC10 use?
>>> In general OFED 1.5 supports FC11
>>>       
>> Actually, it supports FC12 (kernel 2.6.29).
>>     
> We had originally planned to support FC11 -- however, in the interim, FC12 was
> released -- based on kernel 2.6.29, which is supported -- so we decided to support
> FC12 instead.
>
> -Jack
>
>   
>> - Jack
>>
>>     
>>> Tziporet
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>       
>
>   


From gopalakk at cse.ohio-state.edu  Sun Aug 30 19:38:10 2009
From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan)
Date: Sun, 30 Aug 2009 22:38:10 -0400
Subject: [ofa-general] Number of devices returned by ibv_get_device_list()
In-Reply-To: <200908301845.38642.jackm@dev.mellanox.co.il>
References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com>
	<122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com>
	<200908301845.38642.jackm@dev.mellanox.co.il>
Message-ID: <92eddfb50908301938w533df6e9vb4579a538209d97@mail.gmail.com>

Hi Jack.

On Sun, Aug 30, 2009 at 11:45 AM, Jack Morgenstein <jackm at dev.mellanox.co.il
> wrote:

> On Wednesday 26 August 2009 02:03, MANIKANTAN KALAIYA wrote:
> > Resending to the mailing list...
> >
> > We have Ofed1.3.1 installed, one of the sub packages is libibverbs
> version 1.1.1. We have a small program that lists the number of IB cards
> available in the system through ibv_get_device_list(). See below for the
> sample code.
>
> libibverbs reads the number of devices ONCE, at calling process startup (as
> part of its initialization).
> To get a new device count, you need to restart your program.
>
Does this mean PCI Hotplug is not supported for Infiniband Adapters?


>
> - Jack
>
> > The system has two IB cards, the value returned by ibv_get_device_list()
> in 'num_devices' is two, as expected.
> >
> > However, when we disable one of the cards using the modprobe command, the
> program continues to return two cards present (monitoring is continuous in a
> while loop).
> > Killing and restarting the sample test process results in reporting
> correct number of IB cards available (returns one after it is restarted).
> One of the prior versions was known to report the correct number of IB cards
> without requiring to restart the program.
> >
> > We would like to determine the number of cards present without having to
> go through a restart. Any inputs on this behavior is appreciated.
> >
> > modprobe command - "sudo modprobe -r ib_mthca"
> >
> > Test program:
> > =================================================
> > #include <stdio.h>
> > #include <infiniband/verbs.h>
> >
> > int main(int argc, char **argv)
> > {
> >     int ret, num_devices;
> >     struct ibv_device      **dev_list;
> >
> >     while(1) {
> >
> >         dev_list = ibv_get_device_list(&num_devices);
> >
> >         if (num_devices != 0) {
> >             printf("IB ADAPTER AVAILABLE:%d\n", num_devices);
> >         }
> >         else {
> >             printf("IB ADAPTER UNAVAILABLE\n");
> >         }
> >         sleep(2);
> >         ibv_free_device_list(dev_list);
> >     }
> >
> >     return(0);
> > }
> > =================================================
> >
> > Thanks,
> > Mani.
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090830/b60659ba/attachment.html>

From ogerlitz at voltaire.com  Mon Aug 31 00:49:54 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 31 Aug 2009 10:49:54 +0300
Subject: [ofa-general] [PATCH] IPoIB: check multicast address format
In-Reply-To: <4A96ABCE.2030204@Voltaire.COM>
References: <20090821000431.GA5713@obsidianresearch.com>	<4A94FB67.6050600@voltaire.com>
	<20090826180457.GR406@obsidianresearch.com>
	<4A96ABCE.2030204@Voltaire.COM>
Message-ID: <4A9B80A2.5010602@voltaire.com>

Moni Shoua wrote:
> Jason Gunthorpe wrote:
>> Is this true? That is pretty ugly, but probably manageable..
> Unfortunately, losing routes is a side effect of closing the device
Moni, I tend to agree with Jason's about this being OTOH ugly but OTOH 
manageable, maybe you can send a patch to the kernel bonding document 
that states to re-set non trivial routes for ipoib bonds after their 
initial establishment (will save you some support cases...)

Or.


From ogerlitz at voltaire.com  Mon Aug 31 00:52:38 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 31 Aug 2009 10:52:38 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR	query
	to	QoS level with pkey
In-Reply-To: <4A94EBB5.7050107@dev.mellanox.co.il>
References: <4A8D4A6F.9050404@dev.mellanox.co.il>	<4A90DC04.3020906@voltaire.com>
	<4A910609.3040305@dev.mellanox.co.il>	<4A94DE99.5050308@voltaire.com>
	<4A94EBB5.7050107@dev.mellanox.co.il>
Message-ID: <4A9B8146.7080800@voltaire.com>

Yevgeny Kliteynik wrote:
> Nope, just the other way around.

Yevgeny, we want to do some testing/validation to understand better 
what's goes on here, will get back to you soon

Or.


From jackm at dev.mellanox.co.il  Mon Aug 31 02:17:43 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 31 Aug 2009 12:17:43 +0300
Subject: [ofa-general] Fedora 10 OFED support plans
In-Reply-To: <4A9AB9AD.80803@ncsa.uiuc.edu>
References: <4A8E4854.2060909@ncsa.uiuc.edu>
	<200908301856.33259.jackm@dev.mellanox.co.il>
	<4A9AB9AD.80803@ncsa.uiuc.edu>
Message-ID: <200908311217.43954.jackm@dev.mellanox.co.il>

> >>> I think OFED 1.5 might work on it but not sure. Which kernel version 
> >>> FC10 use?
> >>> In general OFED 1.5 supports FC11
> >>>       
> >> Actually, it supports FC12 (kernel 2.6.29).
> >>     
> > We had originally planned to support FC11 -- however, in the interim, FC12 was
> > released -- based on kernel 2.6.29, which is supported -- so we decided to support
> > FC12 instead.
> >
> > -Jack
 Actually, Tziporet is correct.  FC11 is built on kernel 2.6.29.4-167.
 OFED 1.5 supports FC11 (I confused this with OpenSuse) -- No FC12 as yet.

 There is no support for FC10.

sorry about the mistake.
-Jack


From vlad at lists.openfabrics.org  Mon Aug 31 03:03:41 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 31 Aug 2009 03:03:41 -0700 (PDT)
Subject: [ofa-general] ofa_1_5_kernel 20090831-0200 daily build status
Message-ID: <20090831100341.8430AE30149@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.16.60-0.21-smp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit'
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit':
/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit'
make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From hnrose at comcast.net  Mon Aug 31 06:39:34 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 31 Aug 2009 09:39:34 -0400
Subject: [ofa-general] [PATCHv2] opensm: Parallelize (Stripe) MFT sets across
	switches
Message-ID: <20090831133934.GA10155@comcast.net>


Similar to previous patch to "Parallelize (Stripe) LFT sets across switches".
Currently, MADs are pipelined to a single switch first which effectively
serializes these requests. This patch pipelines the MFT set MADs across
switches first (before cycling to the next MFT block) so that multiple
switches can be responding concurrently. Speedup is dependent on number
of MFT blocks in use (number of MLIDs) which is dependent on the number
of multicast groups.

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
Changes since v1:
Fixed loop which stripes MFT block across switches
Changed routine name from mcast_mgr_set_tbl to mcast_mgr_set_mft_block
and added block_num and position parameters
Consolidate code into mcast_mgr_set_mftables

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 7ce28c5..e281842 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -103,6 +103,8 @@ typedef struct osm_switch {
 	uint8_t *lft;
 	uint8_t *new_lft;
 	osm_mcast_tbl_t mcast_tbl;
+	uint32_t mft_block_num;
+	uint32_t mft_position;
 	unsigned endport_links;
 	unsigned need_update;
 	void *priv;
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 4dbbaa0..708d837 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  *
@@ -321,16 +321,14 @@ static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm,
 
 /**********************************************************************
  **********************************************************************/
-static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
+static int mcast_mgr_set_mft_block(osm_sm_t * sm, IN osm_switch_t * p_sw,
+				   uint32_t block_num, uint32_t position)
 {
 	osm_node_t *p_node;
 	osm_dr_path_t *p_path;
-	osm_madw_context_t mad_context;
+	osm_madw_context_t context;
 	ib_api_status_t status;
-	uint32_t block_id_ho = 0;
-	int16_t block_num = 0;
-	uint32_t position = 0;
-	uint32_t max_position;
+	uint32_t block_id_ho;
 	osm_mcast_tbl_t *p_tbl;
 	ib_net16_t block[IB_MCAST_BLOCK_SIZE];
 	int ret = 0;
@@ -353,23 +351,25 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
 	   configuration.
 	 */
 
-	mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node);
-	mad_context.mft_context.set_method = TRUE;
+	context.mft_context.node_guid = osm_node_get_node_guid(p_node);
+	context.mft_context.set_method = TRUE;
 
 	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
-	max_position = p_tbl->max_position;
 
-	while (osm_mcast_tbl_get_block(p_tbl, block_num,
-				       (uint8_t) position, block)) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-			"Writing MFT block 0x%X\n", block_id_ho);
+	if (osm_mcast_tbl_get_block(p_tbl, block_num,
+				    (uint8_t) position, block)) {
 
 		block_id_ho = block_num + (position << 28);
 
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Writing MFT block %u position %u to switch 0x%" PRIx64 "\n",
+			block_num, position,
+			cl_ntoh64(context.lft_context.node_guid));
+
 		status = osm_req_set(sm, p_path, (void *)block, sizeof(block),
 				     IB_MAD_ATTR_MCAST_FWD_TBL,
 				     cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
-				     &mad_context);
+				     &context);
 
 		if (status != IB_SUCCESS) {
 			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: "
@@ -377,11 +377,6 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw)
 				ib_get_err_str(status));
 			ret = -1;
 		}
-
-		if (++position > max_position) {
-			position = 0;
-			block_num++;
-		}
 	}
 
 	OSM_LOG_EXIT(sm->p_log);
@@ -1071,9 +1066,55 @@ Exit:
 
 /**********************************************************************
  **********************************************************************/
-int osm_mcast_mgr_process(osm_sm_t * sm)
+static int mcast_mgr_set_mftables(osm_sm_t * sm)
 {
+	cl_qmap_t *p_sw_tbl = &sm->p_subn->sw_guid_tbl;
 	osm_switch_t *p_sw;
+	osm_mcast_tbl_t *p_tbl;
+	int block_notdone, ret = 0;
+	int16_t block_num, max_block = -1;
+
+	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
+	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
+		p_sw->mft_block_num = 0;
+		p_sw->mft_position = 0;
+		p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
+		if (osm_mcast_tbl_get_max_block_in_use(p_tbl) > max_block)
+			max_block = osm_mcast_tbl_get_max_block_in_use(p_tbl);
+		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
+	}
+
+	/* Stripe the MFT blocks across the switches */
+	for (block_num = 0; block_num <= max_block; block_num++) {
+		block_notdone = 1;
+		while (block_notdone) {		
+			block_notdone = 0;
+			p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
+			while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
+				if (p_sw->mft_block_num == block_num) {
+					block_notdone = 1;
+					if (mcast_mgr_set_mft_block(sm, p_sw,
+								    p_sw->mft_block_num,
+								    p_sw->mft_position))
+						ret = -1;
+					p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
+					if (++p_sw->mft_position > p_tbl->max_position) {
+						p_sw->mft_position = 0;
+						p_sw->mft_block_num++;
+					}
+				}
+				p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
+			}
+		}
+	}
+
+	return ret; 
+}
+
+/**********************************************************************
+ **********************************************************************/
+int osm_mcast_mgr_process(osm_sm_t * sm)
+{
 	cl_qmap_t *p_sw_tbl;
 	cl_qlist_t *p_list = &sm->mgrp_list;
 	osm_mgrp_t *p_mgrp;
@@ -1112,12 +1153,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
 	/*
 	   Walk the switches and download the tables for each.
 	 */
-	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
-	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
-		if (mcast_mgr_set_tbl(sm, p_sw))
-			ret = -1;
-		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
-	}
+	ret = mcast_mgr_set_mftables(sm);
 
 	while (!cl_is_qlist_empty(p_list)) {
 		cl_list_item_t *p = cl_qlist_remove_head(p_list);
@@ -1139,8 +1175,6 @@ exit:
 int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 {
 	cl_qlist_t *p_list = &sm->mgrp_list;
-	osm_switch_t *p_sw;
-	cl_qmap_t *p_sw_tbl;
 	osm_mgrp_t *p_mgrp;
 	ib_net16_t mlid;
 	osm_mcast_mgr_ctxt_t *ctx;
@@ -1192,13 +1226,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
 	/*
 	   Walk the switches and download the tables for each.
 	 */
-	p_sw_tbl = &sm->p_subn->sw_guid_tbl;
-	p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
-	while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
-		if (mcast_mgr_set_tbl(sm, p_sw))
-			ret = -1;
-		p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
-	}
+	ret = mcast_mgr_set_mftables(sm);
 
 	osm_dump_mcast_routes(sm->p_subn->p_osm);
 

From sashak at voltaire.com  Mon Aug 31 09:44:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 31 Aug 2009 19:44:56 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add
	support for MulticastFDBTop
In-Reply-To: <f0e08f230908300935k58e9e82dp543bc0a3810d20c3@mail.gmail.com>
References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me>
	<f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>
	<20090830153619.GB15546@me>
	<f0e08f230908300935k58e9e82dp543bc0a3810d20c3@mail.gmail.com>
Message-ID: <20090831164456.GA24631@me>

On 12:35 Sun 30 Aug     , Hal Rosenstock wrote:
> 
> Doesn't the loop:
> for (block = startblock; block <= lastblock; block++)
> terminates without any blocks read ? So it shows no entries.

Sorry, I still don't understand. Let's suppose that top = 0xbfff,
cap = 1024, startlid = 0xc000, endlid = 0xc030 and dump_all = 0.
What will prevent MFT entries printing? This will ignore a value of
'top' or I'm missing something?

> Do you mean to
> print "no entries" ?

No, of course not that :)

Sasha


From sashak at voltaire.com  Mon Aug 31 09:45:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 31 Aug 2009 19:45:20 +0300
Subject: [ofa-general] Re: [PATCHv3] infiniband-diags/ibroute: Add support
 for MulticastFDBTop
In-Reply-To: <20090830163207.GA17406@comcast.net>
References: <20090830163207.GA17406@comcast.net>
Message-ID: <20090831164520.GB24631@me>

On 12:32 Sun 30 Aug     , Hal Rosenstock wrote:
> 
> Add support for SwitchInfo:MulticastFDBTop
> Added by MgtWG errata #4505-4508 and 4640
> 
> If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable
> blocks up through MulticastFDBTop rather than MulticastFDBCap
> 
> If MulticastFDBTop set to 0xbfff, this means no entries (per 4640)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Mon Aug 31 10:42:44 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 31 Aug 2009 13:42:44 -0400
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support 
	for MulticastFDBTop
In-Reply-To: <20090831164456.GA24631@me>
References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me>
	<f0e08f230908300542p3b4f652eh63aa20a9d571ca82@mail.gmail.com>
	<20090830153619.GB15546@me>
	<f0e08f230908300935k58e9e82dp543bc0a3810d20c3@mail.gmail.com>
	<20090831164456.GA24631@me>
Message-ID: <f0e08f230908311042xee772den3a5ae0f8dbc4ae0a@mail.gmail.com>

On 8/31/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 12:35 Sun 30 Aug     , Hal Rosenstock wrote:
> >
> > Doesn't the loop:
> > for (block = startblock; block <= lastblock; block++)
> > terminates without any blocks read ? So it shows no entries.
>
> Sorry, I still don't understand. Let's suppose that top = 0xbfff,
> cap = 1024, startlid = 0xc000, endlid = 0xc030 and dump_all = 0.
> What will prevent MFT entries printing? This will ignore a value of
> 'top' or I'm missing something?


Wouldn't endlid be set to top for this case (since top < endlid) ? It
ignores endlid and not top in this case.

-- Hal

> Do you mean to
> > print "no entries" ?
>
> No, of course not that :)
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090831/801e8209/attachment.html>

From hnrose at comcast.net  Mon Aug 31 12:21:34 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 31 Aug 2009 15:21:34 -0400
Subject: [ofa-general] [PATCH] osmtest: Add SA get PathRecord stress test
Message-ID: <20090831192134.GA12094@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/man/osmtest.8 b/opensm/man/osmtest.8
index fa0cd52..f0d6323 100644
--- a/opensm/man/osmtest.8
+++ b/opensm/man/osmtest.8
@@ -1,4 +1,4 @@
-.TH OSMTEST 8 "August 11, 2008" "OpenIB" "OpenIB Management"
+.TH OSMTEST 8 "August 31, 2009" "OpenIB" "OpenIB Management"
 
 .SH NAME
 osmtest \- InfiniBand subnet manager and administration (SM/SA) test program
@@ -108,9 +108,10 @@ Stress test options are as follows:
 
  OPT    Description
  ---    -----------------
- -s1  - Single-MAD response SA queries
+ -s1  - Single-MAD (RMPP) response SA queries
  -s2  - Multi-MAD (RMPP) response SA queries
  -s3  - Multi-MAD (RMPP) Path Record SA queries
+ -s4  - Single-MAD (non RMPP) get Path Record SA queries 
 
 Without -s, stress testing is not performed
 .TP
diff --git a/opensm/osmtest/include/osmtest_base.h b/opensm/osmtest/include/osmtest_base.h
index 7c33da3..cda3a31 100644
--- a/opensm/osmtest/include/osmtest_base.h
+++ b/opensm/osmtest/include/osmtest_base.h
@@ -56,11 +56,12 @@
 
 #define STRESS_SMALL_RMPP_THR 100000
 /*
-    Take long times when quering big clusters (over 40 nodes) , an average of : 0.25 sec for query
+    Take long times when querying big clusters (over 40 nodes), an average of : 0.25 sec for query
     each query receives 1000 records
 */
 #define STRESS_LARGE_RMPP_THR 4000
 #define STRESS_LARGE_PR_RMPP_THR 20000
+#define STRESS_GET_PR 100000
 
 extern const char *const p_file;
 
diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
index bb2d6bc..4bb9f82 100644
--- a/opensm/osmtest/main.c
+++ b/opensm/osmtest/main.c
@@ -143,9 +143,10 @@ void show_usage()
 	       "          Stress test options are as follows:\n"
 	       "          OPT    Description\n"
 	       "          ---    -----------------\n"
-	       "          -s1  - Single-MAD response SA queries\n"
+	       "          -s1  - Single-MAD (RMPP) response SA queries\n"
 	       "          -s2  - Multi-MAD (RMPP) response SA queries\n"
 	       "          -s3  - Multi-MAD (RMPP) Path Record SA queries\n"
+	       "          -s4  - Single-MAD (non RMPP) get Path Record SA queries\n"
 	       "          Without -s, stress testing is not performed\n\n");
 	printf("-M\n"
 	       "--Multicast_Mode\n"
@@ -499,6 +500,9 @@ int main(int argc, char *argv[])
 			case 3:
 				printf("Large Path Record SA queries\n");
 				break;
+			case 4:
+				printf("SA Get Path Record queries\n");
+				break;
 			default:
 				printf("Unknown value %u (ignored)\n",
 				       opt.stress);
diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
index 986a8d2..8357d90 100644
--- a/opensm/osmtest/osmtest.c
+++ b/opensm/osmtest/osmtest.c
@@ -2882,6 +2882,151 @@ Exit:
 
 /**********************************************************************
  **********************************************************************/
+ib_api_status_t
+osmtest_stress_path_recs_by_lid(IN osmtest_t * const p_osmt,
+				IN int mode,
+				OUT uint32_t * const p_num_recs,
+				OUT uint32_t * const p_num_queries)
+{
+	osmtest_req_context_t context;
+	ib_path_rec_t *p_rec;
+	cl_status_t status;
+	ib_net16_t dlid, slid;
+	int num_recs, i;
+
+	OSM_LOG_ENTER(&p_osmt->log);
+
+	memset(&context, 0, sizeof(context));
+
+	slid = cl_ntoh16(p_osmt->local_port.lid);
+	if (!mode)
+		dlid = cl_ntoh16(p_osmt->local_port.sm_lid);
+	else
+		dlid = cl_ntoh16(p_osmt->local_port.lid);
+
+	/*
+	 * Do a blocking query for the PathRecord.
+	 */
+	status = osmtest_get_path_rec_by_lid_pair(p_osmt, slid, dlid, &context);
+	if (status != IB_SUCCESS) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 000A: "
+			"osmtest_get_path_rec_by_lid_pair failed (%s)\n",
+			ib_get_err_str(status));
+		goto Exit;
+	}
+
+	/*
+	 * Populate the database with the received records.
+	 */
+	num_recs = context.result.result_cnt;
+	*p_num_recs += num_recs;
+	++*p_num_queries;
+
+	if (osm_log_is_active(&p_osmt->log, OSM_LOG_VERBOSE)) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_VERBOSE,
+			"Received %u records\n", num_recs);
+
+		for (i = 0; i < num_recs; i++) {
+			p_rec = osmv_get_query_path_rec(context.result.p_result_madw, 0);
+			osm_dump_path_record(&p_osmt->log, p_rec, OSM_LOG_VERBOSE);
+		}
+	}
+
+Exit:
+	/*
+	 * Return the IB query MAD to the pool as necessary.
+	 */
+	if (context.result.p_result_madw != NULL) {
+		osm_mad_pool_put(&p_osmt->mad_pool,
+				 context.result.p_result_madw);
+		context.result.p_result_madw = NULL;
+	}
+
+	OSM_LOG_EXIT(&p_osmt->log);
+	return (status);
+}
+
+/**********************************************************************
+ **********************************************************************/
+static ib_api_status_t osmtest_stress_get_pr(IN osmtest_t * const p_osmt,
+					     IN int mode)
+{
+	ib_api_status_t status = IB_SUCCESS;
+	uint64_t num_recs = 0;
+	uint64_t num_queries = 0;
+	uint32_t delta_recs;
+	uint32_t delta_queries;
+	uint32_t print_freq = 0;
+	int num_timeouts = 0;
+	struct timeval start_tv, end_tv;
+	long sec_diff, usec_diff;
+
+	OSM_LOG_ENTER(&p_osmt->log);
+	gettimeofday(&start_tv, NULL);
+	printf("-I- Start time is : %09ld:%06ld [sec:usec]\n",
+	       start_tv.tv_sec, (long)start_tv.tv_usec);
+
+	while ((num_queries < STRESS_GET_PR) && (num_timeouts < 100)) {
+		delta_recs = 0;
+		delta_queries = 0;
+
+		status = osmtest_stress_path_recs_by_lid(p_osmt, mode,
+							 &delta_recs,
+							 &delta_queries);
+		if (status != IB_SUCCESS)
+			goto Exit;
+
+		num_recs += delta_recs;
+		num_queries += delta_queries;
+
+		print_freq += delta_recs;
+		if (print_freq > 5000) {
+			gettimeofday(&end_tv, NULL);
+			printf("%" PRIu64 " records, %" PRIu64 " queries\n",
+			       num_recs, num_queries);
+			if (end_tv.tv_usec > start_tv.tv_usec) {
+				sec_diff = end_tv.tv_sec - start_tv.tv_sec;
+				usec_diff = end_tv.tv_usec - start_tv.tv_usec;
+			} else {
+				sec_diff = end_tv.tv_sec - start_tv.tv_sec - 1;
+				usec_diff =
+				    1000000 - (start_tv.tv_usec -
+					       end_tv.tv_usec);
+			}
+			printf("-I- End time is : %09ld:%06ld [sec:usec]\n",
+			       end_tv.tv_sec, (long)end_tv.tv_usec);
+			printf("-I- Querying %" PRId64
+			       " path_rec queries took %04ld:%06ld [sec:usec]\n",
+			       num_queries, sec_diff, usec_diff);
+			print_freq = 0;
+		}
+	}
+
+Exit:
+	gettimeofday(&end_tv, NULL);
+	printf("-I- End time is : %09ld:%06ld [sec:usec]\n",
+	       end_tv.tv_sec, (long)end_tv.tv_usec);
+	if (end_tv.tv_usec > start_tv.tv_usec) {
+		sec_diff = end_tv.tv_sec - start_tv.tv_sec;
+		usec_diff = end_tv.tv_usec - start_tv.tv_usec;
+	} else {
+		sec_diff = end_tv.tv_sec - start_tv.tv_sec - 1;
+		usec_diff = 1000000 - (start_tv.tv_usec - end_tv.tv_usec);
+	}
+
+	printf("-I- Querying %" PRId64
+	       " path_rec queries took %04ld:%06ld [sec:usec]\n",
+	       num_queries, sec_diff, usec_diff);
+	if (num_timeouts > 50) {
+		status = IB_TIMEOUT;
+	}
+	/* Exit: */
+	OSM_LOG_EXIT(&p_osmt->log);
+	return (status);
+}
+
+/**********************************************************************
+ **********************************************************************/
 static void
 osmtest_prepare_db_generic(IN osmtest_t * const p_osmt,
 			   IN cl_qmap_t * const p_tbl)
@@ -7247,6 +7392,16 @@ ib_api_status_t osmtest_run(IN osmtest_t * const p_osmt)
 					goto Exit;
 				}
 				break;
+			case 4: /* SA Get PR to SA LID */
+				status = osmtest_stress_get_pr(p_osmt, 0);
+				if (status != IB_SUCCESS) {
+					OSM_LOG(&p_osmt->log, OSM_LOG_ERROR,
+						"ERR 014B: "
+						"SA Get PR stress test failed (%s)\n",
+						ib_get_err_str(status));
+					goto Exit;
+				}
+				break;
 			default:
 				OSM_LOG(&p_osmt->log, OSM_LOG_ERROR,
 					"ERR 0144: "


From donald.j.meyer at intel.com  Mon Aug 31 12:29:36 2009
From: donald.j.meyer at intel.com (Meyer, Donald J)
Date: Mon, 31 Aug 2009 12:29:36 -0700
Subject: [ofa-general] question about partitioning IB networks
Message-ID: <6203933669E90E4AB42B5BC4EDE38D350C7D048C32@orsmsx510.amr.corp.intel.com>

I am trying to partition my IB network but I don't seem to be able to understand the opensm man page.

First it says "The default partition has P_Key value 0x7fff. OpenSMÂ´s port will have full membership in default partition. All other end ports will have partial membership." but I don't see the difference defined between full and partial membership anywhere.  Is it possible the reference was to full and limited membership instead?  Does this partition have to exist on all CA's so the SM can "talk" them?  Also it says the default partition will be created "unconditionally even when partition configuration file does not exist or cannot be accessed."  Will it also be created if the partition configuration file exists but does not have a default partition defined?

Second, I see where CA's can be members of multiple partitions (have multiple P_keys).  If a CA is in multiple partitions (has multiple P_Keys assigned to it), which partition does it "send" on when the CA has packets to send if more than one partition can reach the destination CA?  Also do switches (or any non CA's) have to have P_Keys assigned for any reason?

Just as a sanity check, my interpretation so far is that my network should have a partition configuration file similar to the following.  Can anyone tell me if I have this correct?  In this example configuration, I am trying to create two partitions.  One with rack one and two, the other with rack three and four:

#Default partition (for SM control of the CA's)
Default=0x7fff,ipoib,rate=7:ALL=limited;
#rack1
rack1=0x111,ipoib,rate=7,defmember=full:<GUID_list>;
#rack2
rack2=0x111,ipoib,rate=7,defmember=full:<GUID_list>;
#rack3
rack3=0x112,ipoib,rate=7,defmember=full:<GUID_list>;
#rack4
rack4=0x112,ipoib,rate=7,defmember=full:<GUID_list>;

Thanks,
Don Meyer
Senior Network/System Engineer/Programmer
US+ (253) 371-9532 iNet 8-371-9532
*Other names and brands may be claimed as the property of others

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090831/5636cb6a/attachment.html>

From rdreier at cisco.com  Mon Aug 31 14:08:45 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 31 Aug 2009 14:08:45 -0700
Subject: [ofa-general] Re: [PATCH V3] mlx4: Do not allow ib userspace open
	following a fatal event
In-Reply-To: <200908301331.51212.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 30 Aug 2009 13:31:51 +0300")
References: <200908301331.51212.jackm@dev.mellanox.co.il>
Message-ID: <adamy5fpu76.fsf@cisco.com>

Applied, thanks for redoing this.


From rdreier at cisco.com  Mon Aug 31 14:10:44 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 31 Aug 2009 14:10:44 -0700
Subject: [ofa-general] [PATCH] IB: dereference of dev->ibdev.iwcm in
	c2_register_device()
In-Reply-To: <4A998EC2.70500@gmail.com> (Roel Kluin's message of "Sat, 29 Aug
	2009 22:25:38 +0200")
References: <4A998EC2.70500@gmail.com>
Message-ID: <adaiqg3pu3v.fsf@cisco.com>


 > --- a/drivers/infiniband/hw/amso1100/c2_provider.c
 > +++ b/drivers/infiniband/hw/amso1100/c2_provider.c
 > @@ -851,6 +851,10 @@ int c2_register_device(struct c2_dev *dev)
 >  	dev->ibdev.post_recv = c2_post_receive;
 >  
 >  	dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL);
 > +	if (dev->ibdev.iwcm == NULL) {
 > +		ret = -ENOMEM;
 > +		goto out1;
 > +	}
 >  	dev->ibdev.iwcm->add_ref = c2_add_ref;
 >  	dev->ibdev.iwcm->rem_ref = c2_rem_ref;
 >  	dev->ibdev.iwcm->get_qp = c2_get_qp;

Looks like a real fix to me -- but then don't we need to kfree() this
memory if any of the later initialization fails (to avoid a leak)?


From rdreier at cisco.com  Mon Aug 31 14:25:59 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 31 Aug 2009 14:25:59 -0700
Subject: [ofa-general] Re: [PATCH] IB/ehca: Construct MAD redirect replies
	from request MAD
In-Reply-To: <200908261337.56128.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Wed, 26 Aug 2009 13:37:55 +0200")
References: <200908261337.56128.fenkes@de.ibm.com>
Message-ID: <adaeiqrpteg.fsf@cisco.com>

this seems reasonable to me, applied, thanks.


From worleys at gmail.com  Mon Aug 31 16:04:05 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 31 Aug 2009 17:04:05 -0600
Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and 
	eventually hangs
In-Reply-To: <e2e108260908112315y3e902d7ay775d65c16d8e561e@mail.gmail.com>
References: <f3177b9e0908091009x23813cbdq4fbd9ebe6d8e174f@mail.gmail.com>
	<e2e108260908100340p71efed9u72cf996be0843edd@mail.gmail.com>
	<f3177b9e0908111452k3d531657tcff3d2cfee030196@mail.gmail.com>
	<e2e108260908112315y3e902d7ay775d65c16d8e561e@mail.gmail.com>
Message-ID: <f3177b9e0908311604k4dc460b8n8745aac516e759bb@mail.gmail.com>

On Wed, Aug 12, 2009 at 12:15 AM, Bart Van
Assche<bart.vanassche at gmail.com> wrote:
> On Tue, Aug 11, 2009 at 11:52 PM, Chris Worley<worleys at gmail.com> wrote:
>> I setup my target exactly as you prescribe... but my initiator is
>> still Windows (version of WInOF at top): performance as relayed by
>> IOMeter starts high and the average slowly decreases.  Watching the
>> instantaneous throughput, there seem to be longer and longer lags of
>> poor performance. between moments of good performance.  I need to run
>> this against a Linux initiator to see if the problems are w/ WinOF.
>>
>> Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the
>> performance was steady and getting close to acceptable.  In a 15 hour
>> test that cycles through sequential and random LBA's and R/W mixes
>> from block sizes from 1MB to 512B, it worked well and got decent
>> performance until it hit 1KB sequential reads which hung IOMeter; no
>> messages on the Linux side (all looked okay).  IBSRP on the Windows
>> side just said "a reset to device was issued" every 15 to 30 seconds
>> after the problem started. I reloaded the IB stack on the Linux side,
>> and was able to get it restarted.
>>
>> Still a lot of combinations to test.
>
> Which trace settings are you using on the target ? Enabling the proper
> trace settings via /proc/scsi_tgt/trace_level might reveal whether you
> are e.g. hitting the QUEUE_FULL condition. See also scst/README.

I've found a good kernel/scst mix to easily repeat this; I can get it
to repeatedly hang w/ 8K block transfers running Ubuntu 9.04 w/ the
2.6.27-14-server kernel on _both_ target and initiator (i.e. no WinOF
or OFED at all) and SCST rev 1062 on the target using one drive
(performance is >600MB/s, >80K IOPS, on the 8KB block sizes being
used).

Although the problem doesn't occur in Windows until blocks are <2KB
and the RHEL5.2/OFED configuration does not repeat the issue using a
Linux initiator, it seems like a very similar hang, so I'm hoping it's
the same issue.

To repeat the issue, I run 8KB block random reads w/ 64 threads,
running AIO calls w/ a depth of 64 (using "fio" on the initiator):

# fio --rw=randrw --bs=8k --rwmixread=100 --numjobs=64 --iodepth=64
--sync=0 --direct=1 --randrepeat=0 --ioengine=libaio
--filename=/dev/sdn --name=test --loops=10000 --size=16091503001

The "size" represents 10% of the drive.  It doesn't seem to ever
happen on writes, but I've seen it happen on mixed reads/writes.

With tracing set to "default", there was still nothing in the target
logs at the time of the hang.

With tracing set thusly on the target:

echo "all" >/proc/scsi_tgt/trace_level
echo "all" >/proc/scsi_tgt/vdisk/trace_level

The last few lines of dmesg look like:

[255354.313411]    0: 28 00 01 84 54 90 00 00 10 00 00 00 00 00 00 00
 (...T...........
[255354.313420] [0]: scst: scst_cmd_init_done:214:tag=62, lun=0, CDB
len=16, queue_type=1 (cmd ffff880102b4a568)
[255354.313443] [26358]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880102b4a3a0), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
[255354.313420] [0]: scst_cmd_init_done:216:Recieving CDB:
[255354.313452] [8602]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880102b49e48 (sg_cnt 0, sg ffff880132579f60, sg[0].page
ffffe200042b7180)
[255354.313457] [8604]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880102b4a010 (sg_cnt 0, sg ffff8802e9806f60, sg[0].page
ffffe2000bc129c0)
[255354.313426]  (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
[255354.313426]    0: 28 00 01 bc 5d 10 00 00 10 00 00 00 00 00 00 00
 (...]...........
[255354.313468] [26358]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880102b4a568), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
[255354.313484] [8602]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880102b4a1d8 (sg_cnt 0, sg ffff8802e98064c0, sg[0].page
ffffe2000bc633c0)
[255354.313551] [8604]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880102b4a3a0 (sg_cnt 0, sg ffff88018a877060, sg[0].page
ffffe20004300200)
[255354.313556] [8602]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880102b4a568 (sg_cnt 0, sg ffff880142581100, sg[0].page
ffffe20004066d40)

... and there's a section like:

[255354.310177]    0: 28 00 01 25 df 50 00 00 10 00 00 00 00 00 00 00
 (..%.P..........
[255354.310177] [0]: scst: scst_cmd_init_done:214:tag=57, lun=0, CDB
len=16, queue_type=1 (cmd ffff8801642e2730)
[255354.310177] [0]: scst_cmd_init_done:216:Recieving CDB:
[255354.310177]  (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
[255354.310177]    0: 28 00 01 5e 22 c0 00 00 10 00 00 00 00 00 00 00
 (..^"...........
[255354.310966] [26369]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880168a9e3a0), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
[255354.310973] [26361]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880168a9e010), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
[255354.310980] [26365]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880168a9e1d8), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
[255354.310986] [26359]: scst: scst_pre_parse:417:op_name <READ(10)>
(cmd ffff880168a9de48), direction=2 (expected 2, set yes),
transfer_len=16 (expected len 8192), flags=1
...
[255354.311221] [8604]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880168a9e1d8 (sg_cnt 0, sg ffff880173ca8060, sg[0].page
ffffe20004325d00)
[255354.311226] [8602]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880168a9ee50 (sg_cnt 0, sg ffff880173ca8c40, sg[0].page
ffffe20005847ec0)
[255354.311233] [8604]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880168a9dc80 (sg_cnt 0, sg ffff8802f0143c40, sg[0].page
ffffe2000bc04880)
[255354.311238] [8602]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880168a9e568 (sg_cnt 0, sg ffff8802f08361a0, sg[0].page
ffffe2000bbf2400)
[255354.311242] [8604]: scst: scst_xmit_response:3004:Xmitting data
for cmd ffff880168a9d560 (sg_cnt 0, sg ffff88010acd74c0, sg[0].page
ffffe200047e7280)

... but, prior to that, messages are unreadably garbled, as in:

Aug 31 22:37:00 nameme kernel: t]9l ft48 r(09 ,83_5p  s20 sg:303
_00s3]c_=cs  _00ad0000e_003a6_0031_4(ea5 9arg )_2As_05s_8[7:c8[f3 _178
087gff0 .R nt]9i0tmpd1:ft st06s68 5i9[301602_106)o6 _001e4 0<s0 3>)0
.3E3_28a9102 pft0>e_o[.eo[<_2n05 98_0f8_i xpe1f0 D<98s np8one:21_0
30f3006=e_ ax R8gs=h62]= 2.pd_ pad555mlf
1_]f8=.05lf i7gxs_ac3 m_0c0:]5i3087[_ 5e sg,00[dc3e,_ 0[ ( 1<[t]F]
..eb 4t_ ah1,_1_]10.h45_]2,5__12C5o 37 d_.)b_g4f850s, t1e c80.ite.8pE
ue2.4f[.ft0 5c5_1effft 5530 f len=16, 5v03,em_cs4e 05fc78.5r5. n
,45ft45ff<if_:4fnd5c<ts54c078f9]_0c0a0efee04f[,1n 0 __5deff588=f82
.t)m9.8)9.8077=s  _C 3 i8 .tlsf5_[0s0 (2u fu 4
5fco5fnr.n0a05_34f__4fd_4n Bs60fn4pB.tor7=s
_i8s7=0_.tl:c>l3e0.51_654.30350en.m C30 C3 e f.dtm0=2_1e0n]6qe  d.>_
76 d=f _esr_tp 9_50.tnf50[cs.,
Aug 31 22:37:00 nameme kernel: e .0 5 B , 45 0<s382 3_
Aug 31 22:37:00 nameme kernel:  c2< s0< cm38cf58.[f10 002< c3De
_)088m8 9c5299pected__F
Aug 31 22:37:00 nameme kernel: tran50 pt48)=8]=s59etl5pe4e6d)0c6
ei_2(e_<3cc_ ea51es_0_sras A >cmdtesafe4 3[m 3.rer7:[ 1b00s5
Aug 31 22:37:00 nameme kernel: ] 2a015ffs.35fff  B__ a
6cmd9spre3se9_2e3806(3_csA_  1 ns38ge0sre0
Aug 31 22:37:00 nameme kernel: <g data  sf9_ _ 6d  0se5245f_26._2
.,76.9<g fe t_]t6:(E...:s5D.s0_<Rte46>0330B005]08s3 __ r40r._5x,<Re08
:2ec_ :06cs1_0ti1d l:253064enfe7]0 abd5 0f>196.t b 7.(008ni]
0s09.r650t, <24]__ s1=in03 s0p c2>>[4ein.1:ooD..ps210a>[25534_r6,:t
n4.]4(8 e2 .r c 2n1g9360]10>(  00 00 00 00[fd[2
[2g_re53  le_6c_md8t_ftc883tf03c  m_0 :8r8fmd63m3:0] 25 c6>[2n_e:fa2e84_0
Aug 31 22:37:00 nameme kernel: c,
Aug 31 22:37:00 nameme kernel: .=0>5f=1s5=1d6_(de:d
2l_25:0edg25fm>ff40 l440 e,AFg l)AF0 0o[1088. 1aggB
0n=d9(16a.5oeX6csf00s0: ._, (=10es_(1 7 5c___oR5st_42p3d 7
C9d=5_:(3__7mD4_ 0m4_ed
04,5.,[s55.d4c,,25=,c8__q,[(meet9303_mr0ue9m0u_032__fy2se
Aug 31 22:37:00 nameme kernel: >  y>i

... so other suggestions on trace settings would be appreciated.

Thanks,

Chris
>
> Bart.
>


From weiny2 at llnl.gov  Mon Aug 31 17:01:44 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 31 Aug 2009 17:01:44 -0700
Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc:
 Introduce a context object.
In-Reply-To: <20090823120609.GG9547@me>
References: <20090813204306.dffc3237.weiny2@llnl.gov>
	<20090816110200.GS25501@me>
	<20090817083023.da17378b.weiny2@llnl.gov>
	<20090823120609.GG9547@me>
Message-ID: <20090831170144.da0e7185.weiny2@llnl.gov>

Hey Sasha,

On Sun, 23 Aug 2009 15:06:09 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 08:30 Mon 17 Aug     , Ira Weiny wrote:
> > 
> > The immediate benefit is coming with the multi-threaded implementation where
> > I plan on adding the following function.[*]

The discussion on the list has digressed from this patch.  I still think this
patch is valid and adds a level of flexibility which is needed regardless of
what is decided about libibmad.  Do you agree?

Also, the last patch in the series ([PATCH 5/5] infiniband-diags/libibnetdisc:
remove members of the fabric struct which are used in the scan only) cleans up
some stuff from the external interface.  If you really don't want to introduce
a context object, then I can regenerate that final patch without the context.

Ira

> 
> Ok, but could we discuss first how will multithreading architecture be
> implemented with libibnetdisc: goals (in particular is it support for
> multithreaded apps or just multithreaded discovery function), interaction
> with caller application, etc.?
> 
> One of the desired feature of this I could think would be to keep API
> simple for single threaded stuff.
> 
> Sasha


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov