From vlad at lists.openfabrics.org  Sun Feb  1 03:11:43 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  1 Feb 2009 03:11:43 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090201-0200 daily build status
Message-ID: <20090201111144.1E366E60EFD@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From Jie.Cai at cs.anu.edu.au  Sun Feb  1 23:14:35 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Mon, 02 Feb 2009 18:14:35 +1100
Subject: [ofa-general] Multiports single HCA uDAPL program problem
In-Reply-To: <E3280858FA94444CA49D2BA02341C983381AE85E@orsmsx506.amr.corp.intel.com>
References: <20090129200005.20863E61234@openfabrics.org>
	<4982A3D8.5030503@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C983381AE85E@orsmsx506.amr.corp.intel.com>
Message-ID: <49869D5B.9020004@cs.anu.edu.au>

One more problem happened when trying to establish 1 connection per 
rail, as illustrated
in the graph.

          node0                    node1
rail0: psp0 <----------------> ep0         (port 0 on hca)
rail1: psp1 <----------------> ep1         (port 1 on hca)

rail0 got connected first and connection are always stable and correct.
However rail1 sometime connected properly sometime doesn't.
Following is the error message:

11836 Waiting for connect response
11836 Error unexpected conn event : DAT_CONNECTION_EVENT_NON_PEER_REJECTED
11836 Error connect_ep: DAT_ABORT

The program establishes the connection for both rail exactly the same.
What may caused this?

Regards,

-- 
Jie Cai


Davis, Arlin R wrote:
> This looks like an ARP issue across your IPoIB interfaces. 
>
> Please see section 6 of the uDAPL OFED BKM.
>
> http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf
>  
> 6. Multi IB port configuration, IPoIB arp reply issues
>
> When two interfaces running one interface may reply to an ARP
> directed to the other interface on the system. The following
> configuration will cause the interfaces to ignore ARP requests if
> not specifically for their IP address.
>
> Add the following lines to /etc/sysctl.conf
> net.ipv4.conf.all.arp_ignore=1
> net.ipv4.conf.ib0.arp_ignore=1
> net.ipv4.conf.ib1.arp_ignore=1
>
> or use sysctl:
> sysctl -w net.ipv4.conf.all.arp_ignore=1
> sysctl -w net.ipv4.conf.ib0.arp_ignore=1
> sysctl -w net.ipv4.conf.ib1.arp_ignore=1
>
> -arlin
>
>   
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org 
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai
>> Sent: Thursday, January 29, 2009 10:53 PM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] Multiports single HCA uDAPL program problem
>>
>> Hi All,
>>
>> I am kind of noob on IB and uDAPL program. Currently, I am trying to
>> write a program with multirail that utilizes 2 ports on a 
>> single Mallenox
>> ConnectX HCA on both nodes.
>>
>> OFED1.3 has been installed on a SUSE 10.3 linux system.
>>
>> The current problem is that IB connection via uDAPL are very unstable,
>> and sometime the connection can't be established.
>> Error message is usually like:
>>
>> 20350 Server waiting for connect request on port 45248
>> accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2)
>> 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR
>> 20350 Error connect_ep: DAT_INTERNAL_ERROR
>>
>> The status of both port are active:
>> hca_id:    mlx4_0
>>    fw_ver:                2.3.000
>>    node_guid:            0003:ba00:0100:702c
>>    sys_image_guid:            0003:ba00:0100:702f
>>    vendor_id:            0x02c9
>>    vendor_part_id:            25418
>>    hw_ver:                0xA0
>>    board_id:            SUN0070000001
>>    phys_port_cnt:            2
>>        port:    1
>>            state:            PORT_ACTIVE (4)
>>            max_mtu:        2048 (4)
>>            active_mtu:        2048 (4)
>>            sm_lid:            10
>>            port_lid:        8
>>            port_lmc:        0x00
>>
>>        port:    2
>>            state:            PORT_ACTIVE (4)
>>            max_mtu:        2048 (4)
>>            active_mtu:        2048 (4)
>>            sm_lid:            10
>>            port_lid:        9
>>            port_lmc:        0x00
>>
>>
>> I haven't done any specific configuration for multi-port. I assume that
>> OFED1.3 can do it automatically.
>>
>> Would please any one help me on this?
>>
>> Regards,
>> Jie
>>
>> --
>> Jie Cai
>>
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>     
> >


From vlad at lists.openfabrics.org  Mon Feb  2 03:11:52 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon,  2 Feb 2009 03:11:52 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090202-0200 daily build status
Message-ID: <20090202111152.CADA8E60F0E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From halr at obsidianresearch.com  Mon Feb  2 08:18:03 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 09:18:03 -0700
Subject: [ofa-general] [PATCH] libibmad/(mad.h fields.c): Add support for
	PerfMgt ClassPortInfo
Message-ID: <1233591483.8992.368.camel@bertha1.edm.orcorp.ca>

Sasha,

Attached is a patch to add support for PerfMgt ClassPortInfo attribute
into libibmad.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-libibmad-mad.h-fields.c-Add-support-for-PerfMgt-C.patch
Type: application/mbox
Size: 5498 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/850a8337/attachment.mbox>

From halr at obsidianresearch.com  Mon Feb  2 08:18:41 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 09:18:41 -0700
Subject: [ofa-general] [PATCH] ibsim/sim_mad.c: Add sim support for PerfMgt
	ClassPortInfo
Message-ID: <1233591521.8992.369.camel@bertha1.edm.orcorp.ca>

Sasha,

Attached is a patch to add simulator support for PerfMgt ClassPortInfo
(subsequent to previous libibmad patch).

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ibsim-sim_mad.c-Add-sim-support-for-PerfMgt-ClassPo.patch
Type: application/mbox
Size: 2157 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/c23db161/attachment.mbox>

From swise at opengridcomputing.com  Mon Feb  2 08:25:14 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 02 Feb 2009 10:25:14 -0600
Subject: [ofa-general] dapl attribute bug
Message-ID: <49871E6A.9000901@opengridcomputing.com>

Hey Arlin,

We've uncovered a problem with the DAPL attribute mappings to the linux 
rdma device attributes.

The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code 
maps this to the linux ib_device_attr->max_mr_size which is u64.

This causes dapltest to fail in some cases when running over chelsio 
which sets max_mr_size to 0x100000000 (4GB).  The dapl code truncates 
the value to 0. See dapl/openib_cma/dapl_ib_util.c.

I'm not sure what the fix should be, but maybe the dapl code should set 
anything over 32 bits to 0xffffffff?


Steve.


From rdreier at cisco.com  Mon Feb  2 09:00:53 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Feb 2009 09:00:53 -0800
Subject: [ofa-general] Re: [PATCH v2] RDMA/nes: Account for freed pbl after
	hw operation
In-Reply-To: <20090123212445.GA6248@ctung-MOBL> (Chien Tung's message of "Fri, 
	23 Jan 2009 15:24:45 -0600")
References: <20090123212445.GA6248@ctung-MOBL>
Message-ID: <adahc3cpyd6.fsf@cisco.com>

 > @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 >  	nesmr->ibmw.rkey = ibfmr->rkey;
 >  	nesmr->ibmw.uobject = NULL;
 >  
 > +	rc = nes_dealloc_mw(&nesmr->ibmw);
 > +
 >  	if (nesfmr->nesmr.pbls_used != 0) {
 >  		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
 >  		if (nesfmr->nesmr.pbl_4k) {

Can this be right?  nes_dealloc_mw() fails, so the HW still thinks it
owns the resources, and then the function just continues and releases
the PBLs before returning?

[And same issue seems to be there for the change to nes_dereg_mr]

 - R.


From arlin.r.davis at intel.com  Mon Feb  2 10:01:32 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 2 Feb 2009 10:01:32 -0800
Subject: [ofa-general] Multiports single HCA uDAPL program problem
In-Reply-To: <49869D5B.9020004@cs.anu.edu.au>
References: <20090129200005.20863E61234@openfabrics.org>
	<4982A3D8.5030503@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C983381AE85E@orsmsx506.amr.corp.intel.com>
	<49869D5B.9020004@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C983381AF3D4@orsmsx506.amr.corp.intel.com>

 
>One more problem happened when trying to establish 1 connection per 
>rail, as illustrated
>in the graph.
>
>          node0                    node1
>rail0: psp0 <----------------> ep0         (port 0 on hca)
>rail1: psp1 <----------------> ep1         (port 1 on hca)
>
>rail0 got connected first and connection are always stable and correct.
>However rail1 sometime connected properly sometime doesn't.
>Following is the error message:
>
>11836 Waiting for connect response
>11836 Error unexpected conn event : 
>DAT_CONNECTION_EVENT_NON_PEER_REJECTED
>11836 Error connect_ep: DAT_ABORT
>
>The program establishes the connection for both rail exactly the same.
>What may caused this?

rdma_cm is rejecting the connect request. Turn on warnings for more information:

 export DAPL_DBG_TYPE=0x0003

-arlin


From halr at obsidianresearch.com  Mon Feb  2 10:58:35 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 11:58:35 -0700
Subject: [ofa-general] [PATCHv2] libibmad/(mad.h fields.c): Add support for
	PerfMgt ClassPortInfo
Message-ID: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca>

Sasha,

Attached is v2 of a patch to add support for PerfMgt ClassPortInfo attribute
into libibmad.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-libibmad-mad.h-fields.c-Add-support-for-PerfMgt-C.patch
Type: application/mbox
Size: 5505 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/d517e411/attachment.mbox>

From halr at obsidianresearch.com  Mon Feb  2 10:58:46 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 11:58:46 -0700
Subject: [ofa-general] [PATCHv2] ibsim/sim_mad.c: Add sim support for PerfMgt
	ClassPortInfo
Message-ID: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca>

Sasha,

Attached is v2 of a patch to add simulator support for PerfMgt ClassPortInfo
(subsequent to previous libibmad patch).

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ibsim-sim_mad.c-Add-sim-support-for-PerfMgt-ClassPo.patch
Type: application/mbox
Size: 2164 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/ca560e8e/attachment.mbox>

From halr at obsidianresearch.com  Mon Feb  2 11:06:50 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 12:06:50 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_perfmgr_db.h: Remove
	unused typedef
Message-ID: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca>

Sasha,

Trivial patch to remove an unused typedef in perfmgr.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-opensm-osm_perfmgr_db.h-Remove-unused-typedef.patch
Type: application/mbox
Size: 1006 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/342e8275/attachment.mbox>

From halr at obsidianresearch.com  Mon Feb  2 11:07:01 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 02 Feb 2009 12:07:01 -0700
Subject: [ofa-general] [PATCH][MINOR] opensm/osm_perfmgr.c: Eliminate memory
	leak on error
Message-ID: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca>

Sasha,

Minor patch to osm_perfmgr.c to eliminate a memory leak on error in
osm_perfmgr_init.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-opensm-osm_perfmgr.c-In-osm_perfmgr_init-eliminate.patch
Type: application/mbox
Size: 1378 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090202/5e597eac/attachment.mbox>

From sashak at voltaire.com  Mon Feb  2 12:29:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 2 Feb 2009 22:29:04 +0200
Subject: [ofa-general] [PATCH 3/4] opensm/osm_log.c save log_max_size
	in subnet opt in MB
In-Reply-To: <497DC9B6.5010200@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC9B6.5010200@gmail.com>
Message-ID: <20090202202904.GD5910@sashak.voltaire.com>

Hi Eli,

On 16:33 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>  save log_max_size in subnet opt in MB
>  the max_size in the log object is converted to bytes.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/opensm/main.c    |    5 ++---
>  opensm/opensm/osm_log.c |    2 +-
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 0f7b822..de38056 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -778,9 +778,8 @@ int main(int argc, char *argv[])
>  			break;
>  
>  		case 'L':
> -			opt.log_max_size =
> -			    strtoul(optarg, NULL, 0) * (1024 * 1024);
> -			printf(" Log file max size is %lu bytes\n",
> +			opt.log_max_size = strtoul(optarg, NULL, 0);
> +			printf(" Log file max size is %lu MBytes\n",
>  			       opt.log_max_size);
>  			break;
>  
> diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c
> index 88633ab..d5e1af6 100644
> --- a/opensm/opensm/osm_log.c
> +++ b/opensm/opensm/osm_log.c
> @@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * const p_log,
>  	p_log->level = log_flags;
>  	p_log->flush = flush;
>  	p_log->count = 0;
> -	p_log->max_size = max_size;
> +	p_log->max_size = max_size << 20; /* convert size in MB to bytes */
>  	p_log->accum_log_file = accum_log_file;
>  	p_log->log_file_name = (char *)log_file;

This is obviously not sufficient change. If you decided to store max log
file size value in MB in options structure then all places where it is
parsed/dumped should be changed. Something like this:


diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index f786192..6f0d85e 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -777,9 +777,8 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'L':
-			opt.log_max_size =
-			    strtoul(optarg, NULL, 0) * (1024 * 1024);
-			printf(" Log file max size is %lu bytes\n",
+			opt.log_max_size = strtoul(optarg, NULL, 0);
+			printf(" Log file max size is %lu MBytes\n",
 			       opt.log_max_size);
 			break;
 
diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c
index 88633ab..d5e1af6 100644
--- a/opensm/opensm/osm_log.c
+++ b/opensm/opensm/osm_log.c
@@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * const p_log,
 	p_log->level = log_flags;
 	p_log->flush = flush;
 	p_log->count = 0;
-	p_log->max_size = max_size;
+	p_log->max_size = max_size << 20; /* convert size in MB to bytes */
 	p_log->accum_log_file = accum_log_file;
 	p_log->log_file_name = (char *)log_file;
 
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 94b6332..2141899 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1141,7 +1141,6 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 
 		opts_unpack_uint32("log_max_size", p_key, p_val,
 				   (void *) & p_opts->log_max_size);
-		p_opts->log_max_size *= 1024 * 1024; /* convert to MB */
 
 		opts_unpack_charp("partition_config_file",
 				  p_key, p_val, &p_opts->partition_config_file);
@@ -1620,7 +1619,7 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		p_opts->log_flags,
 		p_opts->force_log_flush ? "TRUE" : "FALSE",
 		p_opts->log_file,
-		p_opts->log_max_size/1024/1024,
+		p_opts->log_max_size,
 		p_opts->accum_log_file ? "TRUE" : "FALSE",
 		p_opts->dump_files_dir,
 		p_opts->enable_quirks ? "TRUE" : "FALSE",


I'm committing this with change above.

Sasha


From sashak at voltaire.com  Mon Feb  2 12:59:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 2 Feb 2009 22:59:31 +0200
Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <497DC96F.3000902@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
Message-ID: <20090202205924.GF5910@sashak.voltaire.com>

On 16:32 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>  rescan subnet configuration after SIGHUP
>  call osm_subn_rescan_conf_files() after SIGHUP.
>  this is important when priority is changed and SM is in standby.
>  in that case it will not send capability mask trap and will not become master.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/opensm/main.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index f786192..0f7b822 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
>  			osm_hup_flag = 0;
>  			/* a HUP signal should only start a new heavy sweep */
>  			p_osm->subn.force_heavy_sweep = TRUE;
> +			osm_subn_rescan_conf_files(&p_osm->subn);

Is it synchronized with sweep? If regular (scheduled by timer) sweep
starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters
are freed..., etc.). I think it is not.

Sasha

>  			osm_opensm_sweep(p_osm);
>  		}
>  	}
> -- 
> 1.5.5
> 


From sean.hefty at intel.com  Mon Feb  2 14:11:45 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Feb 2009 14:11:45 -0800
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing
	from osm_vendor_t ?
In-Reply-To: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
Message-ID: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>

Forwarding to general list and copying Sasha.

>Hello,
>  The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in
>opensm\user\include\vendor\osm_vendor_al.h) is missing
>the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'.
>saquery.c expects to find ca_names in osm_vendor_t.
>
>A couple of observations:
>1) Windows currently supports a much older version of opensm than what OFED 1.4
>tools expect.
>
>2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad
>interfaces.
>   If libibmad from saquery, then OpenSM would not need libibmad references for
>Windows.
>
>3) How bad is it to create libibmad dependencies for OpenSM?
>
>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces;
>the rest use
>   libibmad.
>
>Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'.
>This cmd-line input is resolved thru
>libibmad interfaces.
>Saquery is no exception in that it expects to match the '-C ca_name' against
>osm_vendor_t.ca_names[]. 'ibstat -l' lists
>CA names.
>
>The question becomes how best to resolve the missing ca_names?
>
>1) modify saquery to call libibmad interface to get CA names; leaves
>osm_vendor_t unmodified.
>   So far, saquery is the only diag pgm which uses OSM mad interfaces;
>expecting ca_names
>   in osm_vendor_t.
>
>2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate
>ca_names
>   from OpenSM code? Creates libibmad dependencies for opensm.
>
>Comments?
>
>Thanks,
>
>Stan.


From sean.hefty at intel.com  Mon Feb  2 14:51:56 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Feb 2009 14:51:56 -0800
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names
	missing	from osm_vendor_t ?
In-Reply-To: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
Message-ID: <CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>

>>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces;
>>the rest use libibmad.

Looking briefly at the saquery code, I don't understand the benefit to using the
opensm vendor interfaces, versus using libibmad or even libibumad directly, and
switching to libibumad looks doable.  (It's not clear to me that there are
benefits to using libibmad over libibumad for saquery.)

- osm_bind_handle_t looks like it could map to a libibumad port_id (int).
- osmv_query_sa() could map to umad_send(), followed by umad_recv() to
  obtain the result.  (Replace osmv_query_sa with a new function.)
- There are a couple other calls that are used to loop through all returned
  attributes in a response MAD.  We could use the MAD attribute offset
  directly.  (Update loops where osmv_get_query_* is called.)

Are there technical reasons why the opensm vendor library was chosen for
saquery?  Would there be any objection to changing saquery to use libibumad
directly?  

- Sean


From weiny2 at llnl.gov  Mon Feb  2 15:06:58 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 2 Feb 2009 15:06:58 -0800
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names
	missing	from osm_vendor_t ?
In-Reply-To: <CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
	<CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
Message-ID: <20090202150658.0af72134.weiny2@llnl.gov>

On Mon, 2 Feb 2009 14:51:56 -0800
"Sean Hefty" <sean.hefty at intel.com> wrote:

> >>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces;
> >>the rest use libibmad.
> 
> Looking briefly at the saquery code, I don't understand the benefit to using the
> opensm vendor interfaces, versus using libibmad or even libibumad directly, and
> switching to libibumad looks doable.  (It's not clear to me that there are
> benefits to using libibmad over libibumad for saquery.)
> 
> - osm_bind_handle_t looks like it could map to a libibumad port_id (int).
> - osmv_query_sa() could map to umad_send(), followed by umad_recv() to
>   obtain the result.  (Replace osmv_query_sa with a new function.)
> - There are a couple other calls that are used to loop through all returned
>   attributes in a response MAD.  We could use the MAD attribute offset
>   directly.  (Update loops where osmv_get_query_* is called.)
> 
> Are there technical reasons why the opensm vendor library was chosen for
> saquery?  Would there be any objection to changing saquery to use libibumad
> directly?  

I don't remember the exact details but at the time saquery was first written, 
ibmad/ibumad did not have all the functionality I needed and the OpenSM vendor
library did.  That may no longer be the case and if not then I would support
converting to using those other libraries.

Ira

> 
> - Sean
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 


From donald.e.wood at intel.com  Mon Feb  2 15:07:56 2009
From: donald.e.wood at intel.com (Wood, Donald E)
Date: Mon, 2 Feb 2009 16:07:56 -0700
Subject: [ofa-general] RE: [PATCH v2] RDMA/nes: Account for freed pbl after
	hw operation
In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383032085A8F4@azsmsx501.amr.corp.intel.com>
References: <60BEFF3FBD4C6047B0F13F205CAFA383032085A8F4@azsmsx501.amr.corp.intel.com>
Message-ID: <588992150B702C48B3312184F1B810AD03A516FC3D@azsmsx501.amr.corp.intel.com>


> > @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
> >  	nesmr->ibmw.rkey = ibfmr->rkey;
> >  	nesmr->ibmw.uobject = NULL;
> >  
> > +	rc = nes_dealloc_mw(&nesmr->ibmw);
> > +
> >  	if (nesfmr->nesmr.pbls_used != 0) {
> >  		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
> >  		if (nesfmr->nesmr.pbl_4k) {
>
> Can this be right?  nes_dealloc_mw() fails, so the HW still thinks it
> owns the resources, and then the function just continues and releases
> the PBLs before returning?

You are right, the code in nes_dealloc_fmr is missing a check 
of the return code.  This will be updated in a patch to follow.

> [And same issue seems to be there for the change to nes_dereg_mr]

I believe that nes_dereg_mr is correctly checking return codes 
and does not need to be changed.  Please let me know if you 
still see a problem here.

Don Wood


From chien.tin.tung at intel.com  Mon Feb  2 15:15:21 2009
From: chien.tin.tung at intel.com (Chien Tung)
Date: Mon, 2 Feb 2009 17:15:21 -0600
Subject: [ofa-general] [PATCH v3] RDMA/nes: Account for freed pbl after hw
	operation
Message-ID: <20090202231521.GA6220@ctung-MOBL>

From: Don Wood <donald.e.wood at intel.com>

Fix occurrences where the software pbl counts were changed
before the hardware was updated.  This bug allowed another thread
to overallocate the hardware resources.

Add proper pbl accounting in case nes_reg_mr failed.

Signed-off-by: Don Wood <donald.e.wood at intel.com>
---
V3 change:

In nes_dealloc_fmr(), check return code from nes_dealloc_mw before
pbl accounting.


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 4cfb4d9..b42b17a 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -551,6 +551,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 	struct nes_device *nesdev = nesvnic->nesdev;
 	struct nes_adapter *nesadapter = nesdev->nesadapter;
 	int i = 0;
+	int rc;
 
 	/* free the resources */
 	if (nesfmr->leaf_pbl_cnt == 0) {
@@ -572,7 +573,9 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 	nesmr->ibmw.rkey = ibfmr->rkey;
 	nesmr->ibmw.uobject = NULL;
 
-	if (nesfmr->nesmr.pbls_used != 0) {
+	rc = nes_dealloc_mw(&nesmr->ibmw);
+
+	if ((rc == 0) && (nesfmr->nesmr.pbls_used != 0)) {
 		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
 		if (nesfmr->nesmr.pbl_4k) {
 			nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used;
@@ -584,7 +587,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 		spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
 	}
 
-	return nes_dealloc_mw(&nesmr->ibmw);
+	return rc;
 }
 
 
@@ -1993,7 +1996,16 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd,
 			stag, ret, cqp_request->major_code, cqp_request->minor_code);
 	major_code = cqp_request->major_code;
 	nes_put_cqp_request(nesdev, cqp_request);
-
+	if ((!ret || major_code) && pbl_count != 0) {
+		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
+		if (pbl_count > 1)
+			nesadapter->free_4kpbl += pbl_count+1;
+		else if (residual_page_count > 32)
+			nesadapter->free_4kpbl += pbl_count;
+		else
+			nesadapter->free_256pbl += pbl_count;
+		spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+	}
 	if (!ret)
 		return -ETIME;
 	else if (major_code)
@@ -2607,24 +2619,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr)
 	cqp_request->waiting = 1;
 	cqp_wqe = &cqp_request->cqp_wqe;
 
-	spin_lock_irqsave(&nesadapter->pbl_lock, flags);
-	if (nesmr->pbls_used != 0) {
-		if (nesmr->pbl_4k) {
-			nesadapter->free_4kpbl += nesmr->pbls_used;
-			if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) {
-				printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n",
-						nesadapter->free_4kpbl, nesadapter->max_4kpbl);
-			}
-		} else {
-			nesadapter->free_256pbl += nesmr->pbls_used;
-			if (nesadapter->free_256pbl > nesadapter->max_256pbl) {
-				printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n",
-						nesadapter->free_256pbl, nesadapter->max_256pbl);
-			}
-		}
-	}
-
-	spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
 	nes_fill_init_cqp_wqe(cqp_wqe, nesdev);
 	set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX,
 			NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO |
@@ -2642,11 +2636,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr)
 			" CQP Major:Minor codes = 0x%04X:0x%04X\n",
 			ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code);
 
-	nes_free_resource(nesadapter, nesadapter->allocated_mrs,
-			(ib_mr->rkey & 0x0fffff00) >> 8);
-
-	kfree(nesmr);
-
 	major_code = cqp_request->major_code;
 	minor_code = cqp_request->minor_code;
 
@@ -2662,8 +2651,33 @@ static int nes_dereg_mr(struct ib_mr *ib_mr)
 				" to destroy STag, ib_mr=%p, rkey = 0x%08X\n",
 				major_code, minor_code, ib_mr, ib_mr->rkey);
 		return -EIO;
-	} else
-		return 0;
+	}
+
+	if (nesmr->pbls_used != 0) {
+		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
+		if (nesmr->pbl_4k) {
+			nesadapter->free_4kpbl += nesmr->pbls_used;
+			if (nesadapter->free_4kpbl > nesadapter->max_4kpbl)
+				printk(KERN_ERR PFX "free 4KB PBLs(%u) has "
+					"exceeded the max(%u)\n",
+					nesadapter->free_4kpbl,
+					nesadapter->max_4kpbl);
+		} else {
+			nesadapter->free_256pbl += nesmr->pbls_used;
+			if (nesadapter->free_256pbl > nesadapter->max_256pbl)
+				printk(KERN_ERR PFX "free 256B PBLs(%u) has "
+					"exceeded the max(%u)\n",
+					nesadapter->free_256pbl,
+					nesadapter->max_256pbl);
+		}
+		spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+	}
+	nes_free_resource(nesadapter, nesadapter->allocated_mrs,
+			(ib_mr->rkey & 0x0fffff00) >> 8);
+
+	kfree(nesmr);
+
+	return 0;
 }
 
 
-- 
1.5.3.3


From sean.hefty at intel.com  Mon Feb  2 15:19:31 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Feb 2009 15:19:31 -0800
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names
	missing	from osm_vendor_t ?
In-Reply-To: <20090202150658.0af72134.weiny2@llnl.gov>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>	<CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
	<20090202150658.0af72134.weiny2@llnl.gov>
Message-ID: <9632920386E943489C39D8637052F404@amr.corp.intel.com>

>I don't remember the exact details but at the time saquery was first written,
>ibmad/ibumad did not have all the functionality I needed and the OpenSM vendor
>library did.  That may no longer be the case and if not then I would support
>converting to using those other libraries.

libibumad does require the user to provide the address to the SA.  Providing a
libibumad helper function to fill out ib_mad_addr_t for the local SA seems
reasonable.  I guess we can look at what it would take to convert it in detail
to see if anything is still missing from the lower libraries.

- Sean


From weiny2 at llnl.gov  Mon Feb  2 18:54:25 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 2 Feb 2009 18:54:25 -0800
Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs for
 cleaner function interfaces
Message-ID: <20090202185425.729a80b3.weiny2@llnl.gov>

Begining to clean up the libibmad interface.

Ira


>From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at llnl.gov>
Date: Mon, 2 Feb 2009 10:21:18 -0800
Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces


Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |   38 ++++++++++++++++++------------------
 libibmad/src/fields.c             |   22 ++++++++++----------
 libibmad/src/resolve.c            |   10 ++++----
 3 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 9ff4a3e..f235ab0 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -203,7 +203,7 @@ typedef struct ib_field {
 	ib_mad_dump_fn *def_dump_fn;
 } ib_field_t;
 
-enum MAD_FIELDS {
+typedef enum MAD_FIELDS {
 	IB_NO_FIELD,
 
 	IB_GID_PREFIX_F,
@@ -525,7 +525,7 @@ enum MAD_FIELDS {
 	IB_GUID_GUID0_F,
 
 	IB_FIELD_LAST_		/* must be last */
-};
+} mad_field_t;
 
 /*
  * SA RMPP section
@@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
 #define MAD_DEF_RETRIES		3
 #define MAD_DEF_TIMEOUT_MS	1000
 
-enum {
+typedef enum {
 	IB_DEST_LID,
 	IB_DEST_DRPATH,
 	IB_DEST_GUID,
 	IB_DEST_DRSLID,
-};
+} mad_dest_t;
 
-enum {
+typedef enum {
 	IB_NODE_CA = 1,
 	IB_NODE_SWITCH,
 	IB_NODE_ROUTER,
 	NODE_RNIC,
 
 	IB_NODE_MAX = NODE_RNIC
-};
+} mad_node_type_t;
 
 /******************************************************************************/
 
@@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey)
 }
 
 /* fields.c */
-MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field,
+MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field);
+MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field,
 			      uint32_t val);
 /* field must be byte aligned */
-MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field,
+MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field);
+MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field,
 				uint64_t val);
-MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT int mad_print_field(int field, const char *name, void *val);
-MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val);
-MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val);
+MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val);
+MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val);
+MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val);
+MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val);
+MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val);
+MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val);
+MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val);
 
 /* mad.c */
 MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath,
@@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
 			       ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
-				     int dest_type, ib_portid_t * sm_id);
+				     mad_dest_t dest, ib_portid_t * sm_id);
 MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
 			       ibmad_gid_t * gid);
 
@@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 			ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      mad_dest_t dest, ib_portid_t * sm_id,
 			      const void *srcport);
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
 			const void *srcport);
diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index d5a1eb4..d435a2f 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f,
 	memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8);
 }
 
-uint32_t mad_get_field(void *buf, int base_offs, int field)
+uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field)
 {
 	return _get_field(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field(void *buf, int base_offs, int field, uint32_t val)
+void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val)
 {
 	_set_field(buf, base_offs, ib_mad_f + field, val);
 }
 
-uint64_t mad_get_field64(void *buf, int base_offs, int field)
+uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field)
 {
 	return _get_field64(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field64(void *buf, int base_offs, int field, uint64_t val)
+void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val)
 {
 	_set_field64(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_set_array(void *buf, int base_offs, int field, void *val)
+void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val)
 {
 	_set_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_get_array(void *buf, int base_offs, int field, void *val)
+void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val)
 {
 	_get_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_decode_field(uint8_t * buf, int field, void *val)
+void mad_decode_field(uint8_t * buf, mad_field_t field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val)
 	_get_array(buf, 0, f, val);
 }
 
-void mad_encode_field(uint8_t * buf, int field, void *val)
+void mad_encode_field(uint8_t * buf, mad_field_t field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val,
 			 valsz ? valsz : ALIGN(f->bitlen, 8) / 8);
 }
 
-int mad_print_field(int field, const char *name, void *val)
+int mad_print_field(mad_field_t field, const char *name, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return -1;
 	return _mad_print_field(ib_mad_f + field, name, val, 0);
 }
 
-char *mad_dump_field(int field, char *buf, int bufsz, void *val)
+char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
 	return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val);
 }
 
-char *mad_dump_val(int field, char *buf, int bufsz, void *val)
+char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index b62360b..faac1f9 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 }
 
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      mad_dest_t dest, ib_portid_t * sm_id,
 			      const void *srcport)
 {
 	uint64_t guid;
@@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 	ib_portid_t selfportid = { 0 };
 	int selfport = 0;
 
-	switch (dest_type) {
+	switch (dest) {
 	case IB_DEST_LID:
 		lid = strtol(addr_str, 0, 0);
 		if (!IB_LID_VALID(lid))
@@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 		return 0;
 
 	default:
-		IBWARN("bad dest_type %d", dest_type);
+		IBWARN("bad dest %d", dest);
 	}
 
 	return -1;
 }
 
-int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type,
+int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest,
 			  ib_portid_t * sm_id)
 {
-	return ib_resolve_portid_str_via(portid, addr_str, dest_type,
+	return ib_resolve_portid_str_via(portid, addr_str, dest,
 					 sm_id, NULL);
 }
 
-- 
1.5.4.5


From krkumar2 at in.ibm.com  Mon Feb  2 19:25:14 2009
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Tue, 3 Feb 2009 08:55:14 +0530
Subject: [ofa-general] Support for CXGB3 RNIC on P6
Message-ID: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>


Hi,

My colleague (at a different site) is trying to get couple of Chelsio RNIC
adapters working on
p6 systems but for some reason the cards aren't recognized on bootup. The
same cards works
on my xseries systems, and following are the messages I get (there are no
messages on his p6
systems):

Feb  1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version
1.1.1-ko
Feb  1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: PCI INT A -> GSI 17
(level, low) -> IRQ 17
Feb  1 11:42:49 localhost kernel: input: Power Button (FF) as
/class/input/input1
Feb  1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF]
Feb  1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: Port 0 using 4 queue
sets.
Feb  1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4)
PCI Express x8 MSI-X
Feb  1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX,
S/N: PT49070050

Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing
something to get it to work?

thanks,

- KK


From sean.hefty at intel.com  Mon Feb  2 21:29:16 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Feb 2009 21:29:16 -0800
Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs
	for cleaner function interfaces
In-Reply-To: <20090202185425.729a80b3.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
Message-ID: <475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com>

>@@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
> #define MAD_DEF_RETRIES                3
> #define MAD_DEF_TIMEOUT_MS     1000
>
>-enum {
>+typedef enum {
>        IB_DEST_LID,
>        IB_DEST_DRPATH,
>        IB_DEST_GUID,
>        IB_DEST_DRSLID,
>-};
>+} mad_dest_t;
>
>-enum {
>+typedef enum {
>        IB_NODE_CA = 1,
>        IB_NODE_SWITCH,
>        IB_NODE_ROUTER,
>        NODE_RNIC,
>
>        IB_NODE_MAX = NODE_RNIC
>-};
>+} mad_node_type_t;

For consistency, should these be named enums?  (MAD_DEST and MAD_NODE_TYPE)

- Sean


From devesh28 at gmail.com  Mon Feb  2 23:49:26 2009
From: devesh28 at gmail.com (Devesh Sharma)
Date: Tue, 3 Feb 2009 13:19:26 +0530
Subject: ***SPAM*** Re: ***SPAM*** [ofa-general] compiling OFED-1.2 with
	RHEL5.1
In-Reply-To: <309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com>
References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com>
	<4958CB6A.3090306@mellanox.co.il>
	<309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com>
Message-ID: <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com>

Hello list,

I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch?

On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma <devesh28 at gmail.com> wrote:

> hello Tziporet, thanks for replying, I will try to do this, how many
> changes do you think I will have to made, are they many?
> If there are some problems I will contact to you for further help
>
> -Devesh
>
> On Mon, Dec 29, 2008 at 6:36 PM, Tziporet Koren <
> tziporet at dev.mellanox.co.il> wrote:
>
>>  Devesh Sharma wrote:
>>
>>> Hello all,
>>>  I am trying to compile OFED-1.2 with RHEL5.1 I know that this OS is not
>>> supported by this
>>> distribution, is there any work around other than switing to OFED-1.2.5
>>> or OFED-1.3?
>>>
>>>
>> I don't think there is a workaround
>> You can try to take RHEL 5.1 backports from 1.2.5 and use them on 1.2 but
>> I guess you will have to change them
>>
>> Tziporet
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/03439c2e/attachment.html>

From tziporet at dev.mellanox.co.il  Tue Feb  3 00:22:37 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 03 Feb 2009 10:22:37 +0200
Subject: ***SPAM*** [ofa-general] compiling OFED-1.2 with RHEL5.1
In-Reply-To: <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com>
References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com>	
	<4958CB6A.3090306@mellanox.co.il>	
	<309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com>
	<309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com>
Message-ID: <4987FECD.6000409@mellanox.co.il>

Devesh Sharma wrote:
> Hello list,
>
> I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch?
>
> On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma <devesh28 at gmail.com 
> <mailto:devesh28 at gmail.com>> wrote:
>
>     hello Tziporet, thanks for replying, I will try to do this, how
>     many changes do you think I will have to made, are they many?
>     If there are some problems I will contact to you for further help
>

Why not - maybe someone will make use of it too

Tziporet


From devesh28 at gmail.com  Tue Feb  3 00:35:36 2009
From: devesh28 at gmail.com (Devesh Sharma)
Date: Tue, 3 Feb 2009 14:05:36 +0530
Subject: ***SPAM*** Re: ***SPAM*** [ofa-general] compiling OFED-1.2 with
	RHEL5.1
In-Reply-To: <4987FECD.6000409@mellanox.co.il>
References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com>
	<4958CB6A.3090306@mellanox.co.il>
	<309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com>
	<309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com>
	<4987FECD.6000409@mellanox.co.il>
Message-ID: <309a667c0902030035i1873124au367a05b35fc8eed9@mail.gmail.com>

I am in processes to develop a script to add the backport kernel_addons
taken from OFED-1.3 to OFED-1.2 once that is complete I will post the patch
and script to the list...:)

On Tue, Feb 3, 2009 at 1:52 PM, Tziporet Koren
<tziporet at dev.mellanox.co.il>wrote:

> Devesh Sharma wrote:
>
>> Hello list,
>>
>> I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch?
>>
>> On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma <devesh28 at gmail.com<mailto:
>> devesh28 at gmail.com>> wrote:
>>
>>    hello Tziporet, thanks for replying, I will try to do this, how
>>    many changes do you think I will have to made, are they many?
>>    If there are some problems I will contact to you for further help
>>
>>
> Why not - maybe someone will make use of it too
>
> Tziporet
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/a5f301c9/attachment.html>

From ogerlitz at voltaire.com  Tue Feb  3 00:36:21 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Feb 2009 10:36:21 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h:
 Definition of Congestion Control MADs
In-Reply-To: <4868B928.4070307@dev.mellanox.co.il>
References: <4868B928.4070307@dev.mellanox.co.il>
Message-ID: <49880205.7070605@voltaire.com>

Yevgeny Kliteynik wrote:
> Adding definition of all the Congestion Control (CC) MADs to ib_types.h.
> V2 - fixed comment typo
>   
Hi Yevgeny, Sasha

I wonder where this patch stands, any reason not to merge it?

Or.


From kliteyn at dev.mellanox.co.il  Tue Feb  3 01:09:54 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 03 Feb 2009 11:09:54 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h:
 Definition of Congestion Control MADs
In-Reply-To: <49880205.7070605@voltaire.com>
References: <4868B928.4070307@dev.mellanox.co.il>
	<49880205.7070605@voltaire.com>
Message-ID: <498809E2.1050306@dev.mellanox.co.il>

Hi Or,

Or Gerlitz wrote:
> Yevgeny Kliteynik wrote:
>> Adding definition of all the Congestion Control (CC) MADs to ib_types.h.
>> V2 - fixed comment typo
>>   
> Hi Yevgeny, Sasha
> 
> I wonder where this patch stands, any reason not to merge it?

The updated CC Annex that will contain many packets
format changes hasn't been published yet.

-- Yevgeny


> Or.
> 
> 
> 


From ogerlitz at voltaire.com  Tue Feb  3 01:21:23 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 3 Feb 2009 11:21:23 +0200 (IST)
Subject: [ofa-general] impossibility to bind a device/port with the rdma-cm
 when the port is down
Message-ID: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>

Hi Sean,

It turns out that with the IPOIB port-space, its impossible to bind a device/port
through the rdma-cm when the port is down. This is as of the following call sequence
cma_acquire_dev --> cma_set_qkey/ps=IPOIB --> ib_sa_get_mcmember_rec where the
latter returns EADDRNOTAVAIL since when the port went down the core multicast code
flushed its data base. I see that the qkey is actually used by the rdma-cm when
the user attepts to join a multicast group, when "connect" a UD QP and when creating
a UD QP. I  assume there must be a way to defer this resolving to a later stage such
that binding would be possible when the port is down, thoughts?

Or.

$ udaddy -b 10.10.5.157  -p 2
udaddy: starting server
udaddy: bind address failed: -1
test complete
return status -1

$ strace udaddy -b 10.10.5.157  -p 2
[...]
write(5, "\2\0\0\0(\0\0\0\0\0\0\0\0\0\0\0\2\0\6\34\n\n\5\235\0\0\0\0\0\0\0\0"..., 48) = -1 EADDRNOTAVAIL (Cannot assign requested address)
write(1, "udaddy: bind address failed: -1\n", 32udaddy: bind address failed: -1) = 32


From dorfman.eli at gmail.com  Tue Feb  3 01:28:45 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 03 Feb 2009 11:28:45 +0200
Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <20090202205924.GF5910@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
Message-ID: <49880E4D.2090107@gmail.com>

Sasha Khapyorsky wrote:
> On 16:32 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>>  rescan subnet configuration after SIGHUP
>>  call osm_subn_rescan_conf_files() after SIGHUP.
>>  this is important when priority is changed and SM is in standby.
>>  in that case it will not send capability mask trap and will not become master.
>>
>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>> ---
>>  opensm/opensm/main.c |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index f786192..0f7b822 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
>>  			osm_hup_flag = 0;
>>  			/* a HUP signal should only start a new heavy sweep */
>>  			p_osm->subn.force_heavy_sweep = TRUE;
>> +			osm_subn_rescan_conf_files(&p_osm->subn);
> 
> Is it synchronized with sweep? If regular (scheduled by timer) sweep
> starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters
> are freed..., etc.). I think it is not.
> 
i assume it is not.
what about the the following (though it uses yet another flag...)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 8863e47..88c977d 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -169,6 +169,7 @@ typedef struct osm_subn_opt {
 	uint32_t polling_retry_number;
 	uint32_t max_msg_fifo_timeout;
 	boolean_t force_heavy_sweep;
+	boolean_t rescan_conf_file;
 	uint8_t log_flags;
 	char *dump_files_dir;
 	char *log_file;
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index de38056..f2d7846 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
 			osm_hup_flag = 0;
 			/* a HUP signal should only start a new heavy sweep */
 			p_osm->subn.force_heavy_sweep = TRUE;
-			osm_subn_rescan_conf_files(&p_osm->subn);
+			p_osm->subn.rescan_conf_file  = TRUE;
 			osm_opensm_sweep(p_osm);
 		}
 	}
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index fc7ceb9..87a5db9 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm)
 	ib_api_status_t status;
 	osm_remote_sm_t *p_remote_sm;
 
+	if (sm->p_subn->rescan_conf_file) {
+		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
+			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
+				"osm_subn_rescan_conf_file failed\n");
+		sm->p_subn->rescan_conf_file = FALSE;
+	}
+
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
 	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
 		return;


From ogerlitz at Voltaire.com  Tue Feb  3 01:43:31 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Tue, 03 Feb 2009 11:43:31 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h:
 Definition of Congestion Control MADs
In-Reply-To: <498809E2.1050306@dev.mellanox.co.il>
References: <4868B928.4070307@dev.mellanox.co.il>
	<49880205.7070605@voltaire.com>
	<498809E2.1050306@dev.mellanox.co.il>
Message-ID: <498811C3.1020005@Voltaire.com>

Yevgeny Kliteynik wrote:
> The updated CC Annex that will contain many packets
> format changes hasn't been published yet.

OK, got it.

Or.


From o.w.saastad at usit.uio.no  Tue Feb  3 01:44:02 2009
From: o.w.saastad at usit.uio.no (Ole Widar Saastad)
Date: Tue, 03 Feb 2009 10:44:02 +0100
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
Message-ID: <1233654242.1364.39.camel@pyren.uio.no>


I have problems using the OFED 1.4 software on the Sun x4600 nodes.
Need help to get this to work. We plan to run GPFS over IB on these
nodes in addition to MPI.

Sun 4600 nodes with 8 quad core cpus,
Quad-Core AMD Opteron(tm) Processor 8380

OS is Rocks release 4.
centos-release-4-4.2/x86_64/

Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8
11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with
OFED 1.4 (and 1.3), they have the almost the same kernel : 
Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
Same except  ELsmp and not ELlargesmp.

More information:

dmesg prints out the following error message :

Losing some ticks... checking if CPU frequency changed.
modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp 0000007fbfffcfd8 error 6
mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
mlx4_core: Initializing 0000:02:00.0
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:02:00.0 to 64
mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, reducing to 1.
MSI INIT SUCCESS
mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1
mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5)
mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting.
mlx4_core: probe of 0000:02:00.0 failed with error -5

The following software is installed:

Select Option [1-5]:3
kernel-ib
libibverbs
libibverbs-devel
libibverbs-utils
libmthca
libmlx4
libcxgb3
libnes
libipathverbs
libibcommon
libibcommon-devel
libibumad
libibumad-devel
ofed-docs
ofed-scripts
ibvexdmtools
qlgc_vnic_daemon


Just to be sure the card is present :
lspci returns :
02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)


-- 
Ole W. Saastad, dr. scient.
Scientific Computing Group, USIT, University of Oslo
http://hpc.uio.no


From vlad at lists.openfabrics.org  Tue Feb  3 03:18:10 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue,  3 Feb 2009 03:18:10 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090203-0200 daily build status
Message-ID: <20090203111810.EC436E6114B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Tue Feb  3 04:24:50 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 14:24:50 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <497DC9FC.2050907@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
Message-ID: <20090203122450.GB11874@sashak.voltaire.com>

On 16:34 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>  support subnet configuration rescan and update
>   subnet configuration parameters are rescanned every heavy sweep.
>   every parameter is parsed by parse function according to its type.
>   some params require special post update function to setup them.
>   every parameter has also a flag that specifies whether it
>   can be updated during runtime.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

I'm applying this with several changes:

- disable update option and setup function for all string parameter -
  as I commented originally opts_parse_charp() will leak memory and this
  cannot be ignored if config file is rescanned. Exception is QoS string
  parameters where memory leak is handled.
- small fixes I mentioned in original review.

Sasha


From sashak at voltaire.com  Tue Feb  3 04:32:49 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 14:32:49 +0200
Subject: [ofa-general] [PATCH 1/4] opensm/osm_opensm.[ch] make setup
	and destroy routing engines fucntions global
In-Reply-To: <497DC937.7020102@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC937.7020102@gmail.com>
Message-ID: <20090203123249.GC11874@sashak.voltaire.com>

On 16:31 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>  make setup and destroy routing engines fucntions global.
>  change setup_routing_engines() and destroy_routing_engines()
>  declaration

Below is a comment about this patch.

I'm not applying this yet and will comment separately about its usage.

> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/include/opensm/osm_opensm.h |   53 ++++++++++++++++++++++++++++++++++++
>  opensm/opensm/osm_opensm.c         |    5 ++-
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h
> index c121be4..5b0a1dd 100644
> --- a/opensm/include/opensm/osm_opensm.h
> +++ b/opensm/include/opensm/osm_opensm.h
> @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm,
>  * SEE ALSO
>  *********/
>  
> +/****f* OpenSM: OpenSM/setup_routing_engines
> +* NAME
> +*	setup_routing_engines
> +*
> +* DESCRIPTION
> +*	This function constructs an routing engines.
> +*
> +* SYNOPSIS
> +*/
> +void setup_routing_engines(osm_opensm_t *osm, const char *name);
> +/*
> +* PARAMETERS
> +*	p_osm
> +*		[in] Pointer to a OpenSM object to construct.
> +*
> +*	name
> +*		[in] Routing engine names.
> +*
> +* RETURN VALUE
> +*	This function does not return a value.
> +*
> +* NOTES
> +*	Setup of routing engines
> +*
> +* SEE ALSO
> +*	destroy_routing_engines
> +*********/
> +
> +/****f* OpenSM: OpenSM/destroy_routing_engines
> +* NAME
> +*	destroy_routing_engines
> +*
> +* DESCRIPTION
> +*	This function constructs an routing engines.
> +*
> +* SYNOPSIS
> +*/
> +void destroy_routing_engines(osm_opensm_t *osm);

For public names we are using 'osm_' prefix in OpenSM.

Sasha

> +/*
> +* PARAMETERS
> +*	p_osm
> +*		[in] Pointer to a OpenSM object to construct.
> +*
> +* RETURN VALUE
> +*	This function does not return a value.
> +*
> +* NOTES
> +*	Setup of routing engines
> +*
> +* SEE ALSO
> +*	setup_routing_engines
> +*********/
> +
>  /****f* OpenSM: OpenSM/osm_routing_engine_type_str
>  * NAME
>  *	osm_routing_engine_type_str
> diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c
> index 7de2e5b..8ecb942 100644
> --- a/opensm/opensm/osm_opensm.c
> +++ b/opensm/opensm/osm_opensm.c
> @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name)
>  		"cannot find or setup routing engine \'%s\'", name);
>  }
>  
> -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names)
> +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names)
>  {
>  	char *name, *str, *p;
>  
> @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm)
>  
>  /**********************************************************************
>   **********************************************************************/
> -static void destroy_routing_engines(osm_opensm_t *osm)
> +void destroy_routing_engines(osm_opensm_t *osm)
>  {
>  	struct osm_routing_engine *r, *next;
>  
> @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm)
>  			r->delete(r->context);
>  		free(r);
>  	}
> +	osm->routing_engine_list = NULL;
>  }
>  
>  /**********************************************************************
> -- 
> 1.5.5
> 


From sashak at voltaire.com  Tue Feb  3 04:37:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 14:37:06 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <497DC9FC.2050907@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
Message-ID: <20090203123706.GD11874@sashak.voltaire.com>

On 16:34 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:

[snip...]
> +
> +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val)
> +{
> +	char *routing_engine_names = (char *) p_val;
> +
> +	destroy_routing_engines(p_subn->p_osm);
> +	setup_routing_engines(p_subn->p_osm, routing_engine_names);
> +}

This probably can work with updn and minhops, but it certainly will be
destructive when LASH routing engine is used. LASH stores internal data
between sweep cycles, it is used to answer correct SL value in SA
PathRecord queries. So I think routing engine "switch" should  be a bit
smarter.

Sasha


From sashak at voltaire.com  Tue Feb  3 04:44:07 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 14:44:07 +0200
Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <49880E4D.2090107@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
Message-ID: <20090203124407.GE11874@sashak.voltaire.com>

On 11:28 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
> Sasha Khapyorsky wrote:
> > On 16:32 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
> >>  rescan subnet configuration after SIGHUP
> >>  call osm_subn_rescan_conf_files() after SIGHUP.
> >>  this is important when priority is changed and SM is in standby.
> >>  in that case it will not send capability mask trap and will not become master.
> >>
> >> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> >> ---
> >>  opensm/opensm/main.c |    1 +
> >>  1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> >> index f786192..0f7b822 100644
> >> --- a/opensm/opensm/main.c
> >> +++ b/opensm/opensm/main.c
> >> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
> >>  			osm_hup_flag = 0;
> >>  			/* a HUP signal should only start a new heavy sweep */
> >>  			p_osm->subn.force_heavy_sweep = TRUE;
> >> +			osm_subn_rescan_conf_files(&p_osm->subn);
> > 
> > Is it synchronized with sweep? If regular (scheduled by timer) sweep
> > starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters
> > are freed..., etc.). I think it is not.
> > 
> i assume it is not.
> what about the the following (though it uses yet another flag...)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index 8863e47..88c977d 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -169,6 +169,7 @@ typedef struct osm_subn_opt {
>  	uint32_t polling_retry_number;
>  	uint32_t max_msg_fifo_timeout;
>  	boolean_t force_heavy_sweep;
> +	boolean_t rescan_conf_file;
>  	uint8_t log_flags;
>  	char *dump_files_dir;
>  	char *log_file;
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index de38056..f2d7846 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
>  			osm_hup_flag = 0;
>  			/* a HUP signal should only start a new heavy sweep */
>  			p_osm->subn.force_heavy_sweep = TRUE;
> -			osm_subn_rescan_conf_files(&p_osm->subn);
> +			p_osm->subn.rescan_conf_file  = TRUE;
>  			osm_opensm_sweep(p_osm);
>  		}
>  	}
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index fc7ceb9..87a5db9 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm)
>  	ib_api_status_t status;
>  	osm_remote_sm_t *p_remote_sm;
>  
> +	if (sm->p_subn->rescan_conf_file) {
> +		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
> +			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> +				"osm_subn_rescan_conf_file failed\n");
> +		sm->p_subn->rescan_conf_file = FALSE;
> +	}
> +

What would be wrong with using exiting 'force_heavy_sweep' flag?

Another issue with this patch - config file will be rescanned later
again (during heavy sweep). It would be really nice to avoid such
obviously unneeded double parsing.

Sasha

>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
>  	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
>  		return;


From dorfman.eli at gmail.com  Tue Feb  3 05:40:50 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 03 Feb 2009 15:40:50 +0200
Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <20090203124407.GE11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
Message-ID: <49884962.5070601@gmail.com>

Sasha Khapyorsky wrote:
> On 11:28 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
>> Sasha Khapyorsky wrote:
>>> On 16:32 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
>>>>  rescan subnet configuration after SIGHUP
>>>>  call osm_subn_rescan_conf_files() after SIGHUP.
>>>>  this is important when priority is changed and SM is in standby.
>>>>  in that case it will not send capability mask trap and will not become master.
>>>>
>>>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>>>> ---
>>>>  opensm/opensm/main.c |    1 +
>>>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>>>> index f786192..0f7b822 100644
>>>> --- a/opensm/opensm/main.c
>>>> +++ b/opensm/opensm/main.c
>>>> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
>>>>  			osm_hup_flag = 0;
>>>>  			/* a HUP signal should only start a new heavy sweep */
>>>>  			p_osm->subn.force_heavy_sweep = TRUE;
>>>> +			osm_subn_rescan_conf_files(&p_osm->subn);
>>> Is it synchronized with sweep? If regular (scheduled by timer) sweep
>>> starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters
>>> are freed..., etc.). I think it is not.
>>>
>> i assume it is not.
>> what about the the following (though it uses yet another flag...)
>>
>> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
>> index 8863e47..88c977d 100644
>> --- a/opensm/include/opensm/osm_subnet.h
>> +++ b/opensm/include/opensm/osm_subnet.h
>> @@ -169,6 +169,7 @@ typedef struct osm_subn_opt {
>>  	uint32_t polling_retry_number;
>>  	uint32_t max_msg_fifo_timeout;
>>  	boolean_t force_heavy_sweep;
>> +	boolean_t rescan_conf_file;
>>  	uint8_t log_flags;
>>  	char *dump_files_dir;
>>  	char *log_file;
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index de38056..f2d7846 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
>>  			osm_hup_flag = 0;
>>  			/* a HUP signal should only start a new heavy sweep */
>>  			p_osm->subn.force_heavy_sweep = TRUE;
>> -			osm_subn_rescan_conf_files(&p_osm->subn);
>> +			p_osm->subn.rescan_conf_file  = TRUE;
>>  			osm_opensm_sweep(p_osm);
>>  		}
>>  	}
>> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
>> index fc7ceb9..87a5db9 100644
>> --- a/opensm/opensm/osm_state_mgr.c
>> +++ b/opensm/opensm/osm_state_mgr.c
>> @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm)
>>  	ib_api_status_t status;
>>  	osm_remote_sm_t *p_remote_sm;
>>  
>> +	if (sm->p_subn->rescan_conf_file) {
>> +		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
>> +			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
>> +				"osm_subn_rescan_conf_file failed\n");
>> +		sm->p_subn->rescan_conf_file = FALSE;
>> +	}
>> +
> 
> What would be wrong with using exiting 'force_heavy_sweep' flag?
> 
'force_heavy_sweep' flag is set in other occasions as well

> Another issue with this patch - config file will be rescanned later
> again (during heavy sweep). It would be really nice to avoid such
> obviously unneeded double parsing.
>
that is correct, but we need a special flag to handle the priority change when SM
is in standby.
In that case a rescan at the beginning of do_sweep is a must, otherwise it will 
simply return without doing anything.
what was the reason of putting rescan not in the beginning of do_sweep().
If none then we can simply rescan as first step.

Eli

> Sasha
> 
>>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
>>  	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
>>  		return;


From dorfman.eli at gmail.com  Tue Feb  3 05:43:21 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 03 Feb 2009 15:43:21 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <20090203123706.GD11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
	<20090203123706.GD11874@sashak.voltaire.com>
Message-ID: <498849F9.1030700@gmail.com>

Sasha Khapyorsky wrote:
> On 16:34 Mon 26 Jan     , Eli Dorfman (Voltaire) wrote:
> 
> [snip...]
>> +
>> +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val)
>> +{
>> +	char *routing_engine_names = (char *) p_val;
>> +
>> +	destroy_routing_engines(p_subn->p_osm);
>> +	setup_routing_engines(p_subn->p_osm, routing_engine_names);
>> +}
> 
> This probably can work with updn and minhops, but it certainly will be
> destructive when LASH routing engine is used. LASH stores internal data
> between sweep cycles, it is used to answer correct SL value in SA
> PathRecord queries. So I think routing engine "switch" should  be a bit
> smarter.
> 

that means that destroy and setup routing engine functions should be improved.
what do you suggest in the meantime? limit this to minhop/updn?

Eli


From sashak at voltaire.com  Tue Feb  3 05:42:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 15:42:04 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_perfmgr_db.h: Remove
	unused typedef
In-Reply-To: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca>
References: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090203134204.GG11874@sashak.voltaire.com>

On 12:06 Mon 02 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Trivial patch to remove an unused typedef in perfmgr.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Feb  3 05:42:23 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 15:42:23 +0200
Subject: [ofa-general] Re: [PATCH][MINOR] opensm/osm_perfmgr.c: Eliminate
	memory leak on error
In-Reply-To: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca>
References: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090203134223.GH11874@sashak.voltaire.com>

On 12:07 Mon 02 Feb     , Hal Rosenstock wrote:
> 
> Minor patch to osm_perfmgr.c to eliminate a memory leak on error in
> osm_perfmgr_init.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue Feb  3 05:48:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 15:48:31 +0200
Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <49884962.5070601@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
Message-ID: <20090203134831.GI11874@sashak.voltaire.com>

On 15:40 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
> >> --- a/opensm/opensm/osm_state_mgr.c
> >> +++ b/opensm/opensm/osm_state_mgr.c
> >> @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm)
> >>  	ib_api_status_t status;
> >>  	osm_remote_sm_t *p_remote_sm;
> >>  
> >> +	if (sm->p_subn->rescan_conf_file) {
> >> +		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
> >> +			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> >> +				"osm_subn_rescan_conf_file failed\n");
> >> +		sm->p_subn->rescan_conf_file = FALSE;
> >> +	}
> >> +
> > 
> > What would be wrong with using exiting 'force_heavy_sweep' flag?
> > 
> 'force_heavy_sweep' flag is set in other occasions as well

Yes. And file is rescanned on heavy sweep (later) anyway :)

> 
> > Another issue with this patch - config file will be rescanned later
> > again (during heavy sweep). It would be really nice to avoid such
> > obviously unneeded double parsing.
> >
> that is correct, but we need a special flag to handle the priority change when SM
> is in standby.
> In that case a rescan at the beginning of do_sweep is a must, otherwise it will 
> simply return without doing anything.
> what was the reason of putting rescan not in the beginning of do_sweep().

I don't remember many details, but originally it was used for updating
only selected parameters. Also (I guess) for eliminating rescanning on
light sweeps.

> If none then we can simply rescan as first step.

After fast thinking - it could be an option, just verify that it will
not break functionality a lot.

Sasha


From sashak at voltaire.com  Tue Feb  3 05:53:50 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 15:53:50 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <498849F9.1030700@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
	<20090203123706.GD11874@sashak.voltaire.com>
	<498849F9.1030700@gmail.com>
Message-ID: <20090203135350.GJ11874@sashak.voltaire.com>

On 15:43 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
> > 
> > This probably can work with updn and minhops, but it certainly will be
> > destructive when LASH routing engine is used. LASH stores internal data
> > between sweep cycles, it is used to answer correct SL value in SA
> > PathRecord queries. So I think routing engine "switch" should  be a bit
> > smarter.
> > 
> 
> that means that destroy and setup routing engine functions should be improved.
> what do you suggest in the meantime?

I meant that instead of destroy/setup pair we need an update function
which will carefully compare a current routing engine list against
requested one and in any case will not destroy an routing engine(s)
which is in use.

> limit this to minhop/updn?

No.

Sasha


From dorfman.eli at gmail.com  Tue Feb  3 06:03:08 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 03 Feb 2009 16:03:08 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <20090203135350.GJ11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
	<20090203123706.GD11874@sashak.voltaire.com>
	<498849F9.1030700@gmail.com>
	<20090203135350.GJ11874@sashak.voltaire.com>
Message-ID: <49884E9C.1090704@gmail.com>

Sasha Khapyorsky wrote:
> On 15:43 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
>>> This probably can work with updn and minhops, but it certainly will be
>>> destructive when LASH routing engine is used. LASH stores internal data
>>> between sweep cycles, it is used to answer correct SL value in SA
>>> PathRecord queries. So I think routing engine "switch" should  be a bit
>>> smarter.
>>>
>> that means that destroy and setup routing engine functions should be improved.
>> what do you suggest in the meantime?
> 
> I meant that instead of destroy/setup pair we need an update function
> which will carefully compare a current routing engine list against
> requested one and in any case will not destroy an routing engine(s)
> which is in use.

so if we find the diff between the old routing engine and new one and
use destroy to remove non used and setup for new engines - is it good enough?

> 
>> limit this to minhop/updn?
> 
> No.
> 
> Sasha


From sashak at voltaire.com  Tue Feb  3 06:05:46 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 16:05:46 +0200
Subject: [ofa-general]  [PATCH 4/4] opensm/osm_subnet.c support subnet
	configuration rescan and update
In-Reply-To: <49884E9C.1090704@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
	<20090203123706.GD11874@sashak.voltaire.com>
	<498849F9.1030700@gmail.com>
	<20090203135350.GJ11874@sashak.voltaire.com>
	<49884E9C.1090704@gmail.com>
Message-ID: <20090203140546.GL11874@sashak.voltaire.com>

On 16:03 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
> > 
> > I meant that instead of destroy/setup pair we need an update function
> > which will carefully compare a current routing engine list against
> > requested one and in any case will not destroy an routing engine(s)
> > which is in use.
> 
> so if we find the diff between the old routing engine and new one and
> use destroy to remove non used and setup for new engines - is it good enough?

Maybe (I didn't look at code now). Also if currently used routing engine
is going to be switched it should be cleaned up after switch too.

Sasha


From dorfman.eli at gmail.com  Tue Feb  3 06:11:46 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 03 Feb 2009 16:11:46 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c
	rescan subnet configuration after SIGHUP
In-Reply-To: <20090203134831.GI11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
Message-ID: <498850A2.8090701@gmail.com>

 rescan configuration as first step on every heavy sweep
 this is a must in case of priority change (increase) for standby SM

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_state_mgr.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index fc7ceb9..622867b 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm)
 	ib_api_status_t status;
 	osm_remote_sm_t *p_remote_sm;
 
+	if (sm->p_subn->force_heavy_sweep && 
+	    osm_subn_rescan_conf_files(sm->p_subn) < 0) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
+			"osm_subn_rescan_conf_file failed\n");
+	}
+
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
 	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
 		return;
@@ -1131,11 +1137,6 @@ _repeat_discovery:
 	sm->p_subn->force_reroute = FALSE;
 	sm->p_subn->subnet_initialization_error = FALSE;
 
-	/* rescan configuration updates */
-	if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
-		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
-			"osm_subn_rescan_conf_file failed\n");
-
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
 		sm->p_subn->need_update = 1;
 
-- 
1.5.5


From devesh28 at gmail.com  Tue Feb  3 06:09:01 2009
From: devesh28 at gmail.com (Devesh Sharma)
Date: Tue, 3 Feb 2009 19:39:01 +0530
Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][PATCH v1] compiling OFED-1.2
	with RHEL5.1
Message-ID: <309a667c0902030609m4ba4a685pa18a14d8fd34f7f2@mail.gmail.com>

Following is the patch that must be applied to ofa_kernel-1.2 to be
able to compile it
with RHEL5.1, whole there is one more patch I will be posting after
this. It deals with
the declarations of  kmem_cache_create().
One configuration script also written, derived from ofed_patch.sh to
add backport directory 2.6.28-EL5.1
in ofa_kernel-1.2 and build rpm with changes. The scripts assumes the
names of patches are
OFED-1.2_RHEL5.1_fix.patch for this patch
kmem_cache_create_fix.patch for kmem_cache related patch.

diff -ruN ofa_kernel-1.2/configure ofa_kernel-1.2_try2/configure
--- ofa_kernel-1.2/configure	2009-02-03 02:12:23.000000000 +0530
+++ ofa_kernel-1.2_try2/configure	2009-02-03 00:46:15.000000000 +0530
@@ -218,9 +218,12 @@
         2.6.17*)
                 echo 2.6.17
         ;;
-        2.6.18-*fc[56]*|2.6.18-*el5*)
+        2.6.18-*fc[56]*|2.6.18-8.el5)
                 echo 2.6.18_FC6
         ;;
+	2.6.18-53.el5)
+		echo 2.6.18-EL5.1
+	;;
         2.6.18*)
                 echo 2.6.18
         ;;
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,8 @@
+#ifndef ASM_PROM_BACKPORT_TO_2_6_21_H
+#define ASM_PROM_BACKPORT_TO_2_6_21_H
+
+#include_next <asm/prom.h>
+
+#define of_get_property(a, b, c)	get_property((a), (b), (c))
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,5 @@
+#if defined(__ia64__)
+#include <linux/pci.h>
+#endif
+#include <asm/types.h>
+#include_next <asm/scatterlist.h>
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,8 @@
+#ifndef BACKPORT_LINUX_COMPILER_TO_2_6_22_H
+#define BACKPORT_LINUX_COMPILER_TO_2_6_22_H
+
+#include_next <linux/compiler.h>
+
+#define uninitialized_var(x) x = x
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,54 @@
+#ifndef BACKPORT_LINUX_CRYPTO_H
+#define BACKPORT_LINUX_CRYPTO_H
+
+#include_next <linux/crypto.h>
+
+#define CRYPTO_ALG_ASYNC               0x00000080
+
+struct hash_desc
+{
+	struct crypto_tfm *tfm;
+	u32 flags;
+};
+
+static inline int crypto_hash_init(struct hash_desc *desc)
+{
+	crypto_digest_init(desc->tfm);
+	return 0;
+}
+
+static inline int crypto_hash_digest(struct hash_desc *desc,
+                                    struct scatterlist *sg,
+                                    unsigned int nbytes, u8 *out)
+{
+	crypto_digest_digest(desc->tfm, sg, 1, out);
+	return nbytes;
+}
+
+static inline int crypto_hash_update(struct hash_desc *desc,
+                                    struct scatterlist *sg,
+                                    unsigned int nbytes)
+{
+	crypto_digest_update(desc->tfm, sg, 1);
+	return nbytes;
+}
+
+static inline int crypto_hash_final(struct hash_desc *desc, u8 *out)
+{
+	crypto_digest_final(desc->tfm, out);
+	return 0;
+}
+
+static inline struct crypto_tfm *crypto_alloc_hash(const char *alg_name,
+                                                   u32 type, u32 mask)
+{
+	struct crypto_tfm *ret = crypto_alloc_tfm(alg_name ,type);
+	return ret ? ret : ERR_PTR(-ENOMEM);
+}
+
+static inline void crypto_free_hash(struct crypto_tfm *tfm)
+{
+	crypto_free_tfm(tfm);
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,15 @@
+#ifndef BACKPORT_LINUX_ETHERDEVICE
+#define BACKPORT_LINUX_ETHERDEVICE
+
+#include_next <linux/etherdevice.h>
+
+static inline unsigned short backport_eth_type_trans(struct sk_buff *skb,
+						     struct net_device *dev)
+{
+	skb->dev = dev;
+	return eth_type_trans(skb, dev);
+}
+
+#define eth_type_trans backport_eth_type_trans
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,42 @@
+/*
+ * Basic general purpose allocator for managing special purpose memory
+ * not managed by the regular kmalloc/kfree interface.
+ * Uses for this includes on-device special memory, uncached memory
+ * etc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+
+/*
+ *  General purpose special memory pool descriptor.
+ */
+struct gen_pool {
+	rwlock_t lock;
+	struct list_head chunks;	/* list of chunks in this pool */
+	int min_alloc_order;		/* minimum allocation order */
+};
+
+/*
+ *  General purpose special memory pool chunk descriptor.
+ */
+struct gen_pool_chunk {
+	spinlock_t lock;
+	struct list_head next_chunk;	/* next chunk in pool */
+	unsigned long start_addr;	/* starting address of memory chunk */
+	unsigned long end_addr;		/* ending address of memory chunk */
+	unsigned long bits[0];		/* bitmap for allocating memory chunk */
+};
+
+extern struct gen_pool *ib_gen_pool_create(int, int);
+extern int ib_gen_pool_add(struct gen_pool *, unsigned long, size_t, int);
+extern void ib_gen_pool_destroy(struct gen_pool *);
+extern unsigned long ib_gen_pool_alloc(struct gen_pool *, size_t);
+extern void ib_gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+#define gen_pool_create ib_gen_pool_create
+#define gen_pool_add ib_gen_pool_add
+#define gen_pool_destroy ib_gen_pool_destroy
+#define gen_pool_alloc ib_gen_pool_alloc
+#define gen_pool_free ib_gen_pool_free
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,8 @@
+#ifndef __BACKPORT_LINUX_IF_ETHER_H_TO_2_6_21__
+#define __BACKPORT_LINUX_IF_ETHER_H_TO_2_6_21__
+
+#include_next <linux/if_ether.h>
+
+#define ETH_FCS_LEN     4               /* Octets in the FCS             */
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,17 @@
+#ifndef __BACKPORT_LINUX_IF_VLAN_H_TO_2_6_20__
+#define __BACKPORT_LINUX_IF_VLAN_H_TO_2_6_20__
+
+#include_next <linux/if_vlan.h>
+
+static inline struct net_device *vlan_group_get_device(struct
vlan_group *vg, int vlan_id)
+{
+	return vg->vlan_devices[vlan_id];
+}
+
+static inline void vlan_group_set_device(struct vlan_group *vg, int vlan_id,
+					 struct net_device *dev)
+{
+	vg->vlan_devices[vlan_id] = dev;
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,20 @@
+#ifndef BACKPORT_LINUX_INTERRUPT_TO_2_6_18
+#define BACKPORT_LINUX_INTERRUPT_TO_2_6_18
+#include_next <linux/interrupt.h>
+
+typedef irqreturn_t (*backport_irq_handler_t)(int, void *);
+
+static inline int
+backport_request_irq(unsigned int irq,
+                     irqreturn_t (*handler)(int, void *),
+                     unsigned long flags, const char *dev_name, void *dev_id)
+{
+	return request_irq(irq,
+		           (irqreturn_t (*)(int, void *, struct pt_regs *))handler,
+			   flags, dev_name, dev_id);
+}
+
+#define request_irq backport_request_irq
+#define irq_handler_t backport_irq_handler_t
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,11 @@
+#ifndef __LINUX_IP_BACKPORT_TO_2_6_21__
+#define __LINUX_IP_BACKPORT_TO_2_6_21__
+
+#include_next <linux/ip.h>
+
+static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
+{
+	return (struct iphdr *)skb_network_header(skb);
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,14 @@
+#ifndef BACKPORT_KERNEL_H_2_6_22
+#define BACKPORT_KERNEL_H_2_6_22
+
+#include_next <linux/kernel.h>
+
+#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16))
+
+#endif
+#ifndef BACKPORT_KERNEL_H_2_6_19
+#define BACKPORT_KERNEL_H_2_6_19
+
+#include <linux/log2.h>
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,169 @@
+/* Integer base 2 logarithm calculation
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells at redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_LOG2_H
+#define _LINUX_LOG2_H
+
+#include_next <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+
+/*
+ * deal with unrepresentable constant logarithms
+ */
+extern __attribute__((const, noreturn))
+int ____ilog2_NaN(void);
+
+/*
+ * non-constant log of base 2 calculators
+ * - the arch may override these in asm/bitops.h if they can be implemented
+ *   more efficiently than using fls() and fls64()
+ * - the arch is not required to handle n==0 if implementing the fallback
+ */
+#ifndef CONFIG_ARCH_HAS_ILOG2_U32
+static inline __attribute__((const))
+int __ilog2_u32(u32 n)
+{
+	return fls(n) - 1;
+}
+#endif
+
+#ifndef CONFIG_ARCH_HAS_ILOG2_U64
+static inline __attribute__((const))
+int __ilog2_u64(u64 n)
+{
+	return fls64(n) - 1;
+}
+#endif
+
+/*
+ *  Determine whether some value is a power of two, where zero is
+ * *not* considered a power of two.
+ */
+
+static inline __attribute__((const))
+bool is_power_of_2(unsigned long n)
+{
+	return (n != 0 && ((n & (n - 1)) == 0));
+}
+
+/*
+ * round up to nearest power of two
+ */
+static inline __attribute__((const))
+unsigned long __roundup_pow_of_two(unsigned long n)
+{
+	return 1UL << fls_long(n - 1);
+}
+
+/**
+ * ilog2 - log of base 2 of 32-bit or a 64-bit unsigned value
+ * @n - parameter
+ *
+ * constant-capable log of base 2 calculation
+ * - this can be used to initialise global variables from constant data, hence
+ *   the massive ternary operator construction
+ *
+ * selects the appropriately-sized optimised version depending on sizeof(n)
+ */
+#define ilog2(n)				\
+(						\
+	__builtin_constant_p(n) ? (		\
+		(n) < 1 ? ____ilog2_NaN() :	\
+		(n) & (1ULL << 63) ? 63 :	\
+		(n) & (1ULL << 62) ? 62 :	\
+		(n) & (1ULL << 61) ? 61 :	\
+		(n) & (1ULL << 60) ? 60 :	\
+		(n) & (1ULL << 59) ? 59 :	\
+		(n) & (1ULL << 58) ? 58 :	\
+		(n) & (1ULL << 57) ? 57 :	\
+		(n) & (1ULL << 56) ? 56 :	\
+		(n) & (1ULL << 55) ? 55 :	\
+		(n) & (1ULL << 54) ? 54 :	\
+		(n) & (1ULL << 53) ? 53 :	\
+		(n) & (1ULL << 52) ? 52 :	\
+		(n) & (1ULL << 51) ? 51 :	\
+		(n) & (1ULL << 50) ? 50 :	\
+		(n) & (1ULL << 49) ? 49 :	\
+		(n) & (1ULL << 48) ? 48 :	\
+		(n) & (1ULL << 47) ? 47 :	\
+		(n) & (1ULL << 46) ? 46 :	\
+		(n) & (1ULL << 45) ? 45 :	\
+		(n) & (1ULL << 44) ? 44 :	\
+		(n) & (1ULL << 43) ? 43 :	\
+		(n) & (1ULL << 42) ? 42 :	\
+		(n) & (1ULL << 41) ? 41 :	\
+		(n) & (1ULL << 40) ? 40 :	\
+		(n) & (1ULL << 39) ? 39 :	\
+		(n) & (1ULL << 38) ? 38 :	\
+		(n) & (1ULL << 37) ? 37 :	\
+		(n) & (1ULL << 36) ? 36 :	\
+		(n) & (1ULL << 35) ? 35 :	\
+		(n) & (1ULL << 34) ? 34 :	\
+		(n) & (1ULL << 33) ? 33 :	\
+		(n) & (1ULL << 32) ? 32 :	\
+		(n) & (1ULL << 31) ? 31 :	\
+		(n) & (1ULL << 30) ? 30 :	\
+		(n) & (1ULL << 29) ? 29 :	\
+		(n) & (1ULL << 28) ? 28 :	\
+		(n) & (1ULL << 27) ? 27 :	\
+		(n) & (1ULL << 26) ? 26 :	\
+		(n) & (1ULL << 25) ? 25 :	\
+		(n) & (1ULL << 24) ? 24 :	\
+		(n) & (1ULL << 23) ? 23 :	\
+		(n) & (1ULL << 22) ? 22 :	\
+		(n) & (1ULL << 21) ? 21 :	\
+		(n) & (1ULL << 20) ? 20 :	\
+		(n) & (1ULL << 19) ? 19 :	\
+		(n) & (1ULL << 18) ? 18 :	\
+		(n) & (1ULL << 17) ? 17 :	\
+		(n) & (1ULL << 16) ? 16 :	\
+		(n) & (1ULL << 15) ? 15 :	\
+		(n) & (1ULL << 14) ? 14 :	\
+		(n) & (1ULL << 13) ? 13 :	\
+		(n) & (1ULL << 12) ? 12 :	\
+		(n) & (1ULL << 11) ? 11 :	\
+		(n) & (1ULL << 10) ? 10 :	\
+		(n) & (1ULL <<  9) ?  9 :	\
+		(n) & (1ULL <<  8) ?  8 :	\
+		(n) & (1ULL <<  7) ?  7 :	\
+		(n) & (1ULL <<  6) ?  6 :	\
+		(n) & (1ULL <<  5) ?  5 :	\
+		(n) & (1ULL <<  4) ?  4 :	\
+		(n) & (1ULL <<  3) ?  3 :	\
+		(n) & (1ULL <<  2) ?  2 :	\
+		(n) & (1ULL <<  1) ?  1 :	\
+		(n) & (1ULL <<  0) ?  0 :	\
+		____ilog2_NaN()			\
+				   ) :		\
+	(sizeof(n) <= 4) ?			\
+	__ilog2_u32(n) :			\
+	__ilog2_u64(n)				\
+ )
+
+/**
+ * roundup_pow_of_two - round the given value up to nearest power of two
+ * @n - parameter
+ *
+ * round the given balue up to the nearest power of two
+ * - the result is undefined when n == 0
+ * - this can be used to initialise global variables from constant data
+ */
+#define roundup_pow_of_two(n)			\
+(						\
+	__builtin_constant_p(n) ? (		\
+		(n == 1) ? 0 :			\
+		(1UL << (ilog2((n) - 1) + 1))	\
+				   ) :		\
+	__roundup_pow_of_two(n)			\
+ )
+
+#endif /* _LINUX_LOG2_H */
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,16 @@
+#ifndef BACKPORT_LINUX_NETDEVICE_TO_2_6_18
+#define BACKPORT_LINUX_NETDEVICE_TO_2_6_18
+#include_next <linux/netdevice.h>
+
+static inline int skb_checksum_help_to_2_6_18(struct sk_buff *skb)
+{
+        return skb_checksum_help(skb, 0);
+}
+
+#define skb_checksum_help skb_checksum_help_to_2_6_18
+
+#undef SET_ETHTOOL_OPS
+#define SET_ETHTOOL_OPS(netdev, ops) \
+	(netdev)->ethtool_ops = (struct ethtool_ops *)(ops)
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,7 @@
+#ifndef BACKPORT_LINUX_NET_H
+#define BACKPORT_LINUX_NET_H
+
+#include_next <linux/net.h>
+#include <linux/random.h>
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,14 @@
+#ifndef BACKPORT_LINUX_NETLINK_H
+#define BACKPORT_LINUX_NETLINK_H
+
+#include_next <linux/netlink.h>
+
+/*#define netlink_kernel_create(net, uint, groups, input, mutex, mod) \
+       netlink_kernel_create(uint, groups, input, mod)*/
+
+static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb)
+{
+	return (struct nlmsghdr *)skb->data;
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,19 @@
+#ifndef LINUX_NOTIFIER_BACKPORT_TO_2_6_21_H
+#define LINUX_NOTIFIER_BACKPORT_TO_2_6_21_H
+
+#include_next <linux/notifier.h>
+
+
+/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend
+ * operation in progress
+ */
+#define CPU_TASKS_FROZEN       0x0010
+
+#define CPU_ONLINE_FROZEN      (CPU_ONLINE | CPU_TASKS_FROZEN)
+#define CPU_UP_PREPARE_FROZEN  (CPU_UP_PREPARE | CPU_TASKS_FROZEN)
+#define CPU_UP_CANCELED_FROZEN (CPU_UP_CANCELED | CPU_TASKS_FROZEN)
+#define CPU_DOWN_PREPARE_FROZEN        (CPU_DOWN_PREPARE | CPU_TASKS_FROZEN)
+#define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
+#define CPU_DEAD_FROZEN                (CPU_DEAD | CPU_TASKS_FROZEN)
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,21 @@
+#ifndef __BACKPORT_LINUX_PCI_TO_2_6_19__
+#define __BACKPORT_LINUX_PCI_TO_2_6_19__
+
+#include_next <linux/pci.h>
+
+/**
+ * PCI_VDEVICE - macro used to describe a specific pci device in short form
+ * @vend: the vendor name
+ * @dev: the 16 bit PCI Device ID
+ *
+ * This macro is used to create a struct pci_device_id that matches a
+ * specific PCI device.  The subvendor, and subdevice fields will be set
+ * to PCI_ANY_ID. The macro allows the next field to follow as the device
+ * private data.
+ */
+
+#define PCI_VDEVICE(vendor, device)            \
+	PCI_VENDOR_ID_##vendor, (device),       \
+	PCI_ANY_ID, PCI_ANY_ID, 0, 0
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,8 @@
+#ifndef BACKPORT_LINUX_RANDOM_TO_2_6_18
+#define BACKPORT_LINUX_RANDOM_TO_2_6_18
+#include_next <linux/random.h>
+#include_next <linux/net.h>
+
+#define random32() net_random()
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,10 @@
+#ifndef BACKPORT_LINUX_RBTREE_TO_2_6_18
+#define BACKPORT_LINUX_RBTREE_TO_2_6_18
+#include_next <linux/rbtree.h>
+
+/* Band-aid for buggy rbtree.h */
+#undef RB_EMPTY_NODE
+#define RB_EMPTY_NODE(node)	(rb_parent(node) == node)
+
+#endif
+
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,33 @@
+#ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_2_6_23__
+#define __BACKPORT_LINUX_SCATTERLIST_H_TO_2_6_23__
+#include_next<linux/scatterlist.h>
+
+static inline void sg_set_page(struct scatterlist *sg, struct page *page,
+                               unsigned int len, unsigned int offset)
+{
+	sg->page = page;
+	sg->offset = offset;
+	sg->length = len;
+}
+
+static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
+{
+	sg->page = page;
+}
+
+#define sg_page(a) (a)->page
+#define sg_init_table(a, b)
+
+#define for_each_sg(sglist, sg, nr, __i)	\
+	for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++)
+
+static inline struct scatterlist *sg_next(struct scatterlist *sg)
+{
+	if (!sg) {
+		BUG();
+		return NULL;
+	}
+	return sg + 1;
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,85 @@
+#ifndef LINUX_SKBUFF_H_BACKPORT
+#define LINUX_SKBUFF_H_BACKPORT
+
+#include_next <linux/skbuff.h>
+
+#define CHECKSUM_PARTIAL CHECKSUM_HW
+#define CHECKSUM_COMPLETE CHECKSUM_HW
+
+#endif
+#ifndef __BACKPORT_LINUX_SKBUFF_H_TO_2_6_21__
+#define __BACKPORT_LINUX_SKBUFF_H_TO_2_6_21__
+
+#include_next <linux/skbuff.h>
+
+#define transport_header h.raw
+#define network_header nh.raw
+
+static inline void skb_reset_mac_header(struct sk_buff *skb)
+{
+	skb->mac.raw = skb->data;
+}
+
+static inline void skb_reset_network_header(struct sk_buff *skb)
+{
+	skb->network_header = skb->data;
+}
+
+#if 0
+static inline void skb_copy_from_linear_data(const struct sk_buff *skb,
+					     void *to,
+					     const unsigned int len)
+{
+	memcpy(to, skb->data, len);
+}
+
+static inline void skb_copy_to_linear_data(struct sk_buff *skb,
+                                           const void *from,
+                                           const unsigned int len)
+{
+        memcpy(skb->data, from, len);
+}
+#endif
+
+static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
+{
+	return skb->end;
+}
+
+static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
+{
+	return skb->transport_header;
+}
+
+static inline unsigned char *skb_network_header(const struct sk_buff *skb)
+{
+	return skb->network_header;
+}
+
+static inline void skb_reset_transport_header(struct sk_buff *skb)
+{
+	skb->transport_header = skb->data;
+}
+
+static inline int skb_transport_offset(const struct sk_buff *skb)
+{
+	return skb_transport_header(skb) - skb->data;
+}
+
+static inline int skb_network_offset(const struct sk_buff *skb)
+{
+	return skb_network_header(skb) - skb->data;
+}
+static inline void skb_set_transport_header(struct sk_buff *skb,
+                                            const int offset)
+{
+        skb->h.raw = skb->data + offset;
+}
+
+static inline void skb_set_network_header(struct sk_buff *skb,
+                                            const int offset)
+{
+        skb->nh.raw = skb->data + offset;
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,20 @@
+#include_next <linux/slab.h>
+
+#ifndef LINUX_SLAB_BACKPORT_tO_2_6_22_H
+#define LINUX_SLAB_BACKPORT_tO_2_6_22_H
+
+#include_next <linux/slab.h>
+
+static inline
+struct kmem_cache *
+kmem_cache_create_for_2_6_22 (const char *name, size_t size, size_t align,
+			      unsigned long flags,
+			      void (*ctor)(void*, struct kmem_cache *, unsigned long)
+			      )
+{
+	return kmem_cache_create(name, size, align, flags, ctor, NULL);
+}
+
+#define kmem_cache_create kmem_cache_create_for_2_6_22
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,11 @@
+#ifndef __BACKPORT_LINUX_TCP_TO_2_6_21__
+#define __BACKPORT_LINUX_TCP_TO_2_6_21__
+
+#include_next <linux/tcp.h>
+
+static inline struct tcphdr *tcp_hdr(const struct sk_buff *skb)
+{
+	return (struct tcphdr *)skb_transport_header(skb);
+}
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,9 @@
+#ifndef BACKPORT_LINUX_TYPES_TO_2_6_19
+#define BACKPORT_LINUX_TYPES_TO_2_6_19
+
+#include_next <linux/types.h>
+
+typedef _Bool bool;
+typedef __u16	__sum16;
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,62 @@
+#ifndef BACKPORT_LINUX_WORKQUEUE_TO_2_6_19
+#define BACKPORT_LINUX_WORKQUEUE_TO_2_6_19
+
+#include_next <linux/workqueue.h>
+
+struct delayed_work {
+	struct work_struct work;
+};
+
+static inline void
+backport_INIT_WORK(struct work_struct *work, void *func)
+{
+	INIT_WORK(work, func, work);
+}
+
+static inline int backport_queue_delayed_work(struct workqueue_struct *wq,
+					      struct delayed_work *work,
+					      unsigned long delay)
+{
+	if (likely(!delay))
+		return queue_work(wq, &work->work);
+	else
+		return queue_delayed_work(wq, &work->work, delay);
+}
+
+static inline int
+backport_cancel_delayed_work(struct delayed_work *work)
+{
+	return cancel_delayed_work(&work->work);
+}
+
+static inline void
+backport_cancel_rearming_delayed_workqueue(struct workqueue_struct
*wq, struct delayed_work *work)
+{
+	cancel_rearming_delayed_workqueue(wq, &work->work);
+}
+
+static inline
+int backport_schedule_delayed_work(struct delayed_work *work,
unsigned long delay)
+{
+	if (likely(!delay))
+		return schedule_work(&work->work);
+	else
+		return schedule_delayed_work(&work->work, delay);
+}
+
+#undef INIT_WORK
+#define INIT_WORK(_work, _func) backport_INIT_WORK(_work, _func)
+#define INIT_DELAYED_WORK(_work, _func) INIT_WORK(&(_work)->work, _func)
+
+#undef DECLARE_WORK
+#define DECLARE_WORK(n, f) \
+	struct work_struct n = __WORK_INITIALIZER(n, (void (*)(void *))f, &(n))
+#define DECLARE_DELAYED_WORK(n, f) \
+	struct delayed_work n = { .work = __WORK_INITIALIZER(n.work, (void
(*)(void *))f, &(n.work)) }
+
+#define queue_delayed_work backport_queue_delayed_work
+#define cancel_delayed_work backport_cancel_delayed_work
+#define cancel_rearming_delayed_workqueue
backport_cancel_rearming_delayed_workqueue
+#define schedule_delayed_work backport_schedule_delayed_work
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,7 @@
+#ifndef __BACKPORT_NET_IP_H_TO_2_6_23__
+#define __BACKPORT_NET_IP_H_TO_2_6_23__
+
+#include_next<net/ip.h>
+#define inet_get_local_port_range(a, b) { *(a) =
sysctl_local_port_range[0]; *(b) = sysctl_local_port_range[1]; }
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,8 @@
+#ifndef __BACKPORT_NET_NEIGHBOUR_TO_2_6_20__
+#define __BACKPORT_NET_NEIGHBOUR_TO_2_6_20__
+
+#include_next <net/neighbour.h>
+
+#define neigh_cleanup neigh_destructor
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,23 @@
+#ifndef SCSI_SCSI_CMND_BACKPORT_TO_2_6_22_H
+#define SCSI_SCSI_CMND_BACKPORT_TO_2_6_22_H
+
+#include_next <scsi/scsi_cmnd.h>
+
+#define scsi_sg_count(cmd) ((cmd)->use_sg)
+#define scsi_sglist(cmd) ((struct scatterlist *)(cmd)->request_buffer)
+#define scsi_bufflen(cmd) ((cmd)->request_bufflen)
+
+static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid)
+{
+	cmd->resid = resid;
+}
+
+static inline int scsi_get_resid(struct scsi_cmnd *cmd)
+{
+	return cmd->resid;
+}
+
+#define scsi_for_each_sg(cmd, sg, nseg, __i)			\
+	for (__i = 0, sg = scsi_sglist(cmd); __i < (nseg); __i++, (sg)++)
+
+#endif
diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c
ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c
--- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c	2009-02-03
00:42:05.000000000 +0530
@@ -0,0 +1,198 @@
+/*
+ * Basic general purpose allocator for managing special purpose memory
+ * not managed by the regular kmalloc/kfree interface.
+ * Uses for this includes on-device special memory, uncached memory
+ * etc.
+ *
+ * Copyright 2005 (C) Jes Sorensen <jes at trained-monkey.org>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/genalloc.h>
+
+
+/**
+ * gen_pool_create - create a new special memory pool
+ * @min_alloc_order: log base 2 of number of bytes each bitmap bit represents
+ * @nid: node id of the node the pool structure should be allocated on, or -1
+ *
+ * Create a new special memory pool that can be used to manage special purpose
+ * memory not managed by the regular kmalloc/kfree interface.
+ */
+struct gen_pool *gen_pool_create(int min_alloc_order, int nid)
+{
+	struct gen_pool *pool;
+
+	pool = kmalloc_node(sizeof(struct gen_pool), GFP_KERNEL, nid);
+	if (pool != NULL) {
+		rwlock_init(&pool->lock);
+		INIT_LIST_HEAD(&pool->chunks);
+		pool->min_alloc_order = min_alloc_order;
+	}
+	return pool;
+}
+EXPORT_SYMBOL(gen_pool_create);
+
+/**
+ * gen_pool_add - add a new chunk of special memory to the pool
+ * @pool: pool to add new memory chunk to
+ * @addr: starting address of memory chunk to add to pool
+ * @size: size in bytes of the memory chunk to add to pool
+ * @nid: node id of the node the chunk structure and bitmap should be
+ *       allocated on, or -1
+ *
+ * Add a new chunk of special memory to the specified pool.
+ */
+int gen_pool_add(struct gen_pool *pool, unsigned long addr, size_t size,
+		 int nid)
+{
+	struct gen_pool_chunk *chunk;
+	int nbits = size >> pool->min_alloc_order;
+	int nbytes = sizeof(struct gen_pool_chunk) +
+				(nbits + BITS_PER_BYTE - 1) / BITS_PER_BYTE;
+
+	chunk = kmalloc_node(nbytes, GFP_KERNEL, nid);
+	if (unlikely(chunk == NULL))
+		return -1;
+
+	memset(chunk, 0, nbytes);
+	spin_lock_init(&chunk->lock);
+	chunk->start_addr = addr;
+	chunk->end_addr = addr + size;
+
+	write_lock(&pool->lock);
+	list_add(&chunk->next_chunk, &pool->chunks);
+	write_unlock(&pool->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(gen_pool_add);
+
+/**
+ * gen_pool_destroy - destroy a special memory pool
+ * @pool: pool to destroy
+ *
+ * Destroy the specified special memory pool. Verifies that there are no
+ * outstanding allocations.
+ */
+void gen_pool_destroy(struct gen_pool *pool)
+{
+	struct list_head *_chunk, *_next_chunk;
+	struct gen_pool_chunk *chunk;
+	int order = pool->min_alloc_order;
+	int bit, end_bit;
+
+
+	write_lock(&pool->lock);
+	list_for_each_safe(_chunk, _next_chunk, &pool->chunks) {
+		chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+		list_del(&chunk->next_chunk);
+
+		end_bit = (chunk->end_addr - chunk->start_addr) >> order;
+		bit = find_next_bit(chunk->bits, end_bit, 0);
+		BUG_ON(bit < end_bit);
+
+		kfree(chunk);
+	}
+	kfree(pool);
+	return;
+}
+EXPORT_SYMBOL(gen_pool_destroy);
+
+/**
+ * gen_pool_alloc - allocate special memory from the pool
+ * @pool: pool to allocate from
+ * @size: number of bytes to allocate from the pool
+ *
+ * Allocate the requested number of bytes from the specified pool.
+ * Uses a first-fit algorithm.
+ */
+unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
+{
+	struct list_head *_chunk;
+	struct gen_pool_chunk *chunk;
+	unsigned long addr, flags;
+	int order = pool->min_alloc_order;
+	int nbits, bit, start_bit, end_bit;
+
+	if (size == 0)
+		return 0;
+
+	nbits = (size + (1UL << order) - 1) >> order;
+
+	read_lock(&pool->lock);
+	list_for_each(_chunk, &pool->chunks) {
+		chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+
+		end_bit = (chunk->end_addr - chunk->start_addr) >> order;
+		end_bit -= nbits + 1;
+
+		spin_lock_irqsave(&chunk->lock, flags);
+		bit = -1;
+		while (bit + 1 < end_bit) {
+			bit = find_next_zero_bit(chunk->bits, end_bit, bit + 1);
+			if (bit >= end_bit)
+				break;
+
+			start_bit = bit;
+			if (nbits > 1) {
+				bit = find_next_bit(chunk->bits, bit + nbits,
+							bit + 1);
+				if (bit - start_bit < nbits)
+					continue;
+			}
+
+			addr = chunk->start_addr +
+					    ((unsigned long)start_bit << order);
+			while (nbits--)
+				__set_bit(start_bit++, &chunk->bits);
+			spin_unlock_irqrestore(&chunk->lock, flags);
+			read_unlock(&pool->lock);
+			return addr;
+		}
+		spin_unlock_irqrestore(&chunk->lock, flags);
+	}
+	read_unlock(&pool->lock);
+	return 0;
+}
+EXPORT_SYMBOL(gen_pool_alloc);
+
+/**
+ * gen_pool_free - free allocated special memory back to the pool
+ * @pool: pool to free to
+ * @addr: starting address of memory to free back to pool
+ * @size: size in bytes of memory to free
+ *
+ * Free previously allocated special memory back to the specified pool.
+ */
+void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
+{
+	struct list_head *_chunk;
+	struct gen_pool_chunk *chunk;
+	unsigned long flags;
+	int order = pool->min_alloc_order;
+	int bit, nbits;
+
+	nbits = (size + (1UL << order) - 1) >> order;
+
+	read_lock(&pool->lock);
+	list_for_each(_chunk, &pool->chunks) {
+		chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+
+		if (addr >= chunk->start_addr && addr < chunk->end_addr) {
+			BUG_ON(addr + size > chunk->end_addr);
+			spin_lock_irqsave(&chunk->lock, flags);
+			bit = (addr - chunk->start_addr) >> order;
+			while (nbits--)
+				__clear_bit(bit++, &chunk->bits);
+			spin_unlock_irqrestore(&chunk->lock, flags);
+			break;
+		}
+	}
+	BUG_ON(nbits > 0);
+	read_unlock(&pool->lock);
+}
+EXPORT_SYMBOL(gen_pool_free);
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,82 @@
+diff --git a/drivers/infiniband/core/uverbs_main.c
b/drivers/infiniband/core/uverbs_main.c
+index a617ca7..4e16314 100644
+--- a/drivers/infiniband/core/uverbs_main.c
++++ b/drivers/infiniband/core/uverbs_main.c
+@@ -534,9 +534,9 @@ struct file *ib_uverbs_alloc_event_file(struct
ib_uverbs_file *uverbs_file,
+ 	 * module reference.
+ 	 */
+ 	filp->f_op 	   = fops_get(&uverbs_event_fops);
+-	filp->f_path.mnt 	   = mntget(uverbs_event_mnt);
+-	filp->f_path.dentry 	   = dget(uverbs_event_mnt->mnt_root);
+-	filp->f_mapping    = filp->f_path.dentry->d_inode->i_mapping;
++	filp->f_vfsmnt 	   = mntget(uverbs_event_mnt);
++	filp->f_dentry 	   = dget(uverbs_event_mnt->mnt_root);
++	filp->f_mapping    = filp->f_dentry->d_inode->i_mapping;
+ 	filp->f_flags      = O_RDONLY;
+ 	filp->f_mode       = FMODE_READ;
+ 	filp->private_data = ev_file;
+diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c
b/drivers/infiniband/hw/ipath/ipath_file_ops.c
+index b932bcb..ddbcabd 100644
+--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
++++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
+@@ -1744,9 +1744,9 @@ static int ipath_assign_port(struct file *fp,
+ 		goto done;
+ 	}
+
+-	i_minor = iminor(fp->f_path.dentry->d_inode) - IPATH_USER_MINOR_BASE;
++	i_minor = iminor(fp->f_dentry->d_inode) - IPATH_USER_MINOR_BASE;
+ 	ipath_cdbg(VERBOSE, "open on dev %lx (minor %d)\n",
+-		   (long)fp->f_path.dentry->d_inode->i_rdev, i_minor);
++		   (long)fp->f_dentry->d_inode->i_rdev, i_minor);
+
+ 	if (i_minor)
+ 		ret = find_free_port(i_minor - 1, fp, uinfo);
+diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c
b/drivers/infiniband/hw/ipath/ipath_fs.c
+index 79a60f0..d9ff283 100644
+--- a/drivers/infiniband/hw/ipath/ipath_fs.c
++++ b/drivers/infiniband/hw/ipath/ipath_fs.c
+@@ -118,7 +118,7 @@ static ssize_t atomic_counters_read(struct file
*file, char __user *buf,
+ 	u16 i;
+ 	struct ipath_devdata *dd;
+
+-	dd = file->f_path.dentry->d_inode->i_private;
++	dd = file->f_dentry->d_inode->i_private;
+
+ 	for (i = 0; i < NUM_COUNTERS; i++)
+ 		counters[i] = ipath_snap_cntr(dd, i);
+@@ -138,7 +138,7 @@ static ssize_t atomic_node_info_read(struct file
*file, char __user *buf,
+ 	struct ipath_devdata *dd;
+ 	u64 guid;
+
+-	dd = file->f_path.dentry->d_inode->i_private;
++	dd = file->f_dentry->d_inode->i_private;
+
+ 	guid = be64_to_cpu(dd->ipath_guid);
+
+@@ -177,7 +177,7 @@ static ssize_t atomic_port_info_read(struct file
*file, char __user *buf,
+ 	u32 tmp, tmp2;
+ 	struct ipath_devdata *dd;
+
+-	dd = file->f_path.dentry->d_inode->i_private;
++	dd = file->f_dentry->d_inode->i_private;
+
+ 	/* so we only initialize non-zero fields. */
+ 	memset(portinfo, 0, sizeof portinfo);
+@@ -324,7 +324,7 @@ static ssize_t flash_read(struct file *file, char
__user *buf,
+ 		goto bail;
+ 	}
+
+-	dd = file->f_path.dentry->d_inode->i_private;
++	dd = file->f_dentry->d_inode->i_private;
+ 	if (ipath_eeprom_read(dd, pos, tmp, count)) {
+ 		ipath_dev_err(dd, "failed to read from flash\n");
+ 		ret = -ENXIO;
+@@ -377,7 +377,7 @@ static ssize_t flash_write(struct file *file,
const char __user *buf,
+ 		goto bail_tmp;
+ 	}
+
+-	dd = file->f_path.dentry->d_inode->i_private;
++	dd = file->f_dentry->d_inode->i_private;
+ 	if (ipath_eeprom_write(dd, pos, tmp, count)) {
+ 		ret = -ENXIO;
+ 		ipath_dev_err(dd, "failed to write to flash\n");
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,50 @@
+>Post a replacement to 2_misc_device_to_2_6_19.patch, we'll test.
+
+I did not test this patch, but you can try replacing the contents of
+the 2_misc_device_to_2_6_19.patch with the changes below.  (It's
+possible that this may lead to some conflict further down in the patch
+chain...)  The function prototype for show_abi_version changed between
+2.6.20 to 2.6.19; this was the missing piece in the original backport
+patch.  I would have expected a build warning for this.
+
+Signed-off-by: Sean Hefty <sean.hefty at intel.com>
+
+---
+--- ofa_kernel-1.2/drivers/infiniband/core/ucma.c	2007-03-08
12:11:37.000000000 -0800
++++ b/drivers/infiniband/core/ucma.c	2007-03-08 12:13:13.000000000 -0800
+@@ -847,13 +847,11 @@ static struct miscdevice ucma_misc = {
+ 	.fops	= &ucma_fops,
+ };
+
+-static ssize_t show_abi_version(struct device *dev,
+-				struct device_attribute *attr,
+-				char *buf)
++static ssize_t show_abi_version(struct class_device *class_dev, char *buf)
+ {
+ 	return sprintf(buf, "%d\n", RDMA_USER_CM_ABI_VERSION);
+ }
+-static DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL);
++static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL);
+
+ static int __init ucma_init(void)
+ {
+@@ -863,7 +861,8 @@ static int __init ucma_init(void)
+ 	if (ret)
+ 		return ret;
+
+-	ret = device_create_file(ucma_misc.this_device, &dev_attr_abi_version);
++	ret = class_device_create_file(ucma_misc.class,
++				       &class_device_attr_abi_version);
+ 	if (ret) {
+ 		printk(KERN_ERR "rdma_ucm: couldn't create abi_version attr\n");
+ 		goto err;
+@@ -876,7 +875,8 @@ err:
+
+ static void __exit ucma_cleanup(void)
+ {
+-	device_remove_file(ucma_misc.this_device, &dev_attr_abi_version);
++	class_device_remove_file(ucma_misc.class,
++				 &class_device_attr_abi_version);
+ 	misc_deregister(&ucma_misc);
+ 	idr_destroy(&ctx_idr);
+ }
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,12 @@
+diff --git a/drivers/net/cxgb3/Makefile b/drivers/net/cxgb3/Makefile
+index 3434679..bb008b6 100755
+--- a/drivers/net/cxgb3/Makefile
++++ b/drivers/net/cxgb3/Makefile
+@@ -1,6 +1,7 @@
+ #
+ # Chelsio T3 driver
+ #
++NOSTDINC_FLAGS:= $(NOSTDINC_FLAGS) $(LINUXINCLUDE)
+
+ obj-$(CONFIG_CHELSIO_T3) += cxgb3.o
+
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,352 @@
+BACKPORT - use old IRQ infrastructure on 2.6.18 and earlier
+
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/Kconfig
+--- a/drivers/infiniband/hw/ipath/Kconfig	Thu Mar 08 14:02:44 2007 -0800
++++ b/drivers/infiniband/hw/ipath/Kconfig	Thu Mar 08 14:04:08 2007 -0800
+@@ -1,6 +1,6 @@ config INFINIBAND_IPATH
+ config INFINIBAND_IPATH
+ 	tristate "QLogic InfiniPath Driver"
+-	depends on (PCI_MSI || HT_IRQ) && 64BIT && INFINIBAND && NET
++	depends on PCI_MSI && 64BIT && INFINIBAND && NET
+ 	---help---
+ 	This is a driver for QLogic InfiniPath host channel adapters,
+ 	including InfiniBand verbs support.  This driver allows these
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/Makefile
+--- a/drivers/infiniband/hw/ipath/Makefile	Thu Mar 08 14:02:44 2007 -0800
++++ b/drivers/infiniband/hw/ipath/Makefile	Thu Mar 08 14:04:08 2007 -0800
+@@ -32,7 +32,7 @@ ib_ipath-y := \
+ 	ipath_verbs_mcast.o \
+ 	ipath_verbs.o
+
+-ib_ipath-$(CONFIG_HT_IRQ) += ipath_iba6110.o
++ib_ipath-y += ipath_iba6110.o
+ ib_ipath-$(CONFIG_PCI_MSI) += ipath_iba6120.o
+
+ ib_ipath-$(CONFIG_X86_64) += ipath_wc_x86_64.o
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_driver.c
+--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Mar 08 14:02:44 2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Mar 08 14:04:08 2007 -0800
+@@ -42,6 +42,8 @@
+ #include "ipath_verbs.h"
+ #include "ipath_common.h"
+
++#define CONFIG_HT_IRQ
++
+ static void ipath_update_pio_bufs(struct ipath_devdata *);
+
+ const char *ipath_get_unit_name(int unit)
+@@ -347,7 +349,7 @@ static int __devinit ipath_init_one(stru
+ 	}
+ 	addr = pci_resource_start(pdev, 0);
+ 	len = pci_resource_len(pdev, 0);
+-	ipath_cdbg(VERBOSE, "regbase (0) %llx len %d pdev->irq %d, vend %x/%x "
++	ipath_cdbg(VERBOSE, "regbase (0) %llx len %d irq %x, vend %x/%x "
+ 		   "driver_data %lx\n", addr, len, pdev->irq, ent->vendor,
+ 		   ent->device, ent->driver_data);
+
+@@ -530,15 +532,15 @@ static int __devinit ipath_init_one(stru
+ 	 * check 0 irq after we return from chip-specific bus setup, since
+ 	 * that can affect this due to setup
+ 	 */
+-	if (!dd->ipath_irq)
++	if (!pdev->irq)
+ 		ipath_dev_err(dd, "irq is 0, BIOS error?  Interrupts won't "
+ 			      "work\n");
+ 	else {
+-		ret = request_irq(dd->ipath_irq, ipath_intr, IRQF_SHARED,
++		ret = request_irq(pdev->irq, ipath_intr, IRQF_SHARED,
+ 				  IPATH_DRV_NAME, dd);
+ 		if (ret) {
+ 			ipath_dev_err(dd, "Couldn't setup irq handler, "
+-				      "irq=%d: %d\n", dd->ipath_irq, ret);
++				      "irq=%d: %d\n", pdev->irq, ret);
+ 			goto bail_iounmap;
+ 		}
+ 	}
+@@ -709,10 +711,11 @@ static void __devexit ipath_remove_one(s
+ 	 * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs
+ 	 * for all versions of the driver, if they were allocated
+ 	 */
+-	if (dd->ipath_irq) {
+-		ipath_cdbg(VERBOSE, "unit %u free irq %d\n",
+-			   dd->ipath_unit, dd->ipath_irq);
+-		dd->ipath_f_free_irq(dd);
++	if (pdev->irq) {
++		ipath_cdbg(VERBOSE,
++			   "unit %u free_irq of irq %x\n",
++			   dd->ipath_unit, pdev->irq);
++		free_irq(pdev->irq, dd);
+ 	} else
+ 		ipath_dbg("irq is 0, not doing free_irq "
+ 			  "for unit %u\n", dd->ipath_unit);
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_iba6110.c
+--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c	Thu Mar 08 14:02:44
2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c	Thu Mar 08 14:04:08
2007 -0800
+@@ -38,7 +38,6 @@
+
+ #include <linux/pci.h>
+ #include <linux/delay.h>
+-#include <linux/htirq.h>
+
+ #include "ipath_kernel.h"
+ #include "ipath_registers.h"
+@@ -914,40 +913,49 @@ static void slave_or_pri_blk(struct ipat
+ 	}
+ }
+
+-static int ipath_ht_intconfig(struct ipath_devdata *dd)
+-{
+-	int ret;
+-
+-	if (dd->ipath_intconfig) {
+-		ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig,
+-				 dd->ipath_intconfig);	/* interrupt address */
+-		ret = 0;
+-	} else {
+-		ipath_dev_err(dd, "No interrupts enabled, couldn't setup "
+-			      "interrupt address\n");
+-		ret = -EINVAL;
+-	}
+-
+-	return ret;
+-}
+-
+-static void ipath_ht_irq_update(struct pci_dev *dev, int irq,
+-				struct ht_irq_msg *msg)
+-{
+-	struct ipath_devdata *dd = pci_get_drvdata(dev);
+-	u64 prev_intconfig = dd->ipath_intconfig;
+-
+-	dd->ipath_intconfig = msg->address_lo;
+-	dd->ipath_intconfig |= ((u64) msg->address_hi) << 32;
+-
+-	/*
+-	 * If the previous value of dd->ipath_intconfig is zero, we're
+-	 * getting configured for the first time, and must not program the
+-	 * intconfig register here (it will be programmed later, when the
+-	 * hardware is ready).  Otherwise, we should.
+-	 */
+-	if (prev_intconfig)
+-		ipath_ht_intconfig(dd);
++static int set_int_handler(struct ipath_devdata *dd, struct pci_dev *pdev,
++			    int pos)
++{
++	u32 int_handler_addr_lower;
++	u32 int_handler_addr_upper;
++	u64 ihandler;
++	u32 intvec;
++
++	/* use indirection register to get the intr handler */
++	pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x10);
++	pci_read_config_dword(pdev, pos + 4, &int_handler_addr_lower);
++	pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x11);
++	pci_read_config_dword(pdev, pos + 4, &int_handler_addr_upper);
++
++	ihandler = (u64) int_handler_addr_lower |
++		((u64) int_handler_addr_upper << 32);
++
++	/*
++	 * kernels with CONFIG_PCI_MSI set the vector in the irq field of
++	 * struct pci_device, so we use that to program the internal
++	 * interrupt register (not config space) with that value. The BIOS
++	 * must still have done the basic MSI setup.
++	 */
++	intvec = pdev->irq;
++	/*
++	 * clear any vector bits there; normally not set but we'll overload
++	 * this for some debug purposes (setting the HTC debug register
++	 * value from software, rather than GPIOs), so it might be set on a
++	 * driver reload.
++	 */
++	ihandler &= ~0xff0000;
++	/* x86 vector goes in intrinfo[23:16] */
++	ihandler |= intvec << 16;
++	ipath_cdbg(VERBOSE, "ihandler lower %x, upper %x, intvec %x, "
++		   "interruptconfig %llx\n", int_handler_addr_lower,
++		   int_handler_addr_upper, intvec,
++		   (unsigned long long) ihandler);
++
++	/* can't program yet, so save for interrupt setup */
++	dd->ipath_intconfig = ihandler;
++	/* keep going, so we find link control stuff also */
++
++	return ihandler != 0;
+ }
+
+ /**
+@@ -963,19 +971,12 @@ static int ipath_setup_ht_config(struct
+ static int ipath_setup_ht_config(struct ipath_devdata *dd,
+ 				 struct pci_dev *pdev)
+ {
+-	int pos, ret;
+-
+-	ret = __ht_create_irq(pdev, 0, ipath_ht_irq_update);
+-	if (ret < 0) {
+-		ipath_dev_err(dd, "Couldn't create interrupt handler: "
+-			      "err %d\n", ret);
+-		goto bail;
+-	}
+-	dd->ipath_irq = ret;
+-	ret = 0;
+-
+-	/*
+-	 * Handle clearing CRC errors in linkctrl register if necessary.  We
++	int pos, ret = 0;
++	int ihandler = 0;
++
++	/*
++	 * Read the capability info to find the interrupt info, and also
++	 * handle clearing CRC errors in linkctrl register if necessary.  We
+ 	 * do this early, before we ever enable errors or hardware errors,
+ 	 * mostly to avoid causing the chip to enter freeze mode.
+ 	 */
+@@ -999,8 +1000,16 @@ static int ipath_setup_ht_config(struct
+ 		}
+ 		if (!(cap_type & 0xE0))
+ 			slave_or_pri_blk(dd, pdev, pos, cap_type);
++		else if (cap_type == HT_INTR_DISC_CONFIG)
++			ihandler = set_int_handler(dd, pdev, pos);
+ 	} while ((pos = pci_find_next_capability(pdev, pos,
+ 						 PCI_CAP_ID_HT)));
++
++	if (!ihandler) {
++		ipath_dev_err(dd, "Couldn't find interrupt handler in "
++			      "config space\n");
++		ret = -ENODEV;
++	}
+
+ bail:
+ 	return ret;
+@@ -1351,6 +1360,25 @@ static void ipath_ht_quiet_serdes(struct
+ 	ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val);
+ }
+
++static int ipath_ht_intconfig(struct ipath_devdata *dd)
++{
++	int ret;
++
++	if (!dd->ipath_intconfig) {
++		ipath_dev_err(dd, "No interrupts enabled, couldn't setup "
++			      "interrupt address\n");
++		ret = 1;
++		goto bail;
++	}
++
++	ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig,
++			 dd->ipath_intconfig);	/* interrupt address */
++	ret = 0;
++
++bail:
++	return ret;
++}
++
+ /**
+  * ipath_pe_put_tid - write a TID in chip
+  * @dd: the infinipath device
+@@ -1546,14 +1574,6 @@ static int ipath_ht_get_base_info(struct
+ 	return 0;
+ }
+
+-static void ipath_ht_free_irq(struct ipath_devdata *dd)
+-{
+-	free_irq(dd->ipath_irq, dd);
+-	ht_destroy_irq(dd->ipath_irq);
+-	dd->ipath_irq = 0;
+-	dd->ipath_intconfig = 0;
+-}
+-
+ /**
+  * ipath_init_iba6110_funcs - set up the chip-specific function pointers
+  * @dd: the infinipath device
+@@ -1577,7 +1597,6 @@ void ipath_init_iba6110_funcs(struct ipa
+ 	dd->ipath_f_cleanup = ipath_setup_ht_cleanup;
+ 	dd->ipath_f_setextled = ipath_setup_ht_setextled;
+ 	dd->ipath_f_get_base_info = ipath_ht_get_base_info;
+-	dd->ipath_f_free_irq = ipath_ht_free_irq;
+
+ 	/*
+ 	 * initialize chip-specific variables
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_iba6120.c
+--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c	Thu Mar 08 14:02:44
2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c	Thu Mar 08 14:04:08
2007 -0800
+@@ -856,7 +856,6 @@ static int ipath_setup_pe_config(struct
+ 		ipath_dev_err(dd, "pci_enable_msi failed: %d, "
+ 			      "interrupts may not work\n", ret);
+ 	/* continue even if it fails, we may still be OK... */
+-	dd->ipath_irq = pdev->irq;
+
+ 	if ((pos = pci_find_capability(dd->pcidev, PCI_CAP_ID_MSI))) {
+ 		u16 control;
+@@ -1324,12 +1323,6 @@ done:
+ 	return 0;
+ }
+
+-static void ipath_pe_free_irq(struct ipath_devdata *dd)
+-{
+-	free_irq(dd->ipath_irq, dd);
+-	dd->ipath_irq = 0;
+-}
+-
+ /**
+  * ipath_init_iba6120_funcs - set up the chip-specific function pointers
+  * @dd: the infinipath device
+@@ -1356,7 +1349,6 @@ void ipath_init_iba6120_funcs(struct ipa
+ 	dd->ipath_f_cleanup = ipath_setup_pe_cleanup;
+ 	dd->ipath_f_setextled = ipath_setup_pe_setextled;
+ 	dd->ipath_f_get_base_info = ipath_pe_get_base_info;
+-	dd->ipath_f_free_irq = ipath_pe_free_irq;
+
+ 	/* initialize chip-specific variables */
+ 	dd->ipath_f_tidtemplate = ipath_pe_tidtemplate;
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_intr.c
+--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Mar 08 14:02:44 2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Mar 08 14:04:08 2007 -0800
+@@ -732,14 +732,14 @@ static void ipath_bad_intr(struct ipath_
+ 			 * linuxbios development work, and it may happen in
+ 			 * the future again.
+ 			 */
+-			if (dd->pcidev && dd->ipath_irq) {
++			if (dd->pcidev && dd->pcidev->irq) {
+ 				ipath_dev_err(dd, "Now %u unexpected "
+ 					      "interrupts, unregistering "
+ 					      "interrupt handler\n",
+ 					      *unexpectp);
+-				ipath_dbg("free_irq of irq %d\n",
+-					  dd->ipath_irq);
+-				dd->ipath_f_free_irq(dd);
++				ipath_dbg("free_irq of irq %x\n",
++					  dd->pcidev->irq);
++				free_irq(dd->pcidev->irq, dd);
+ 			}
+ 		}
+ 		if (ipath_read_kreg32(dd, dd->ipath_kregs->kr_intmask)) {
+@@ -775,7 +775,7 @@ static void ipath_bad_regread(struct ipa
+ 		if (allbits == 2) {
+ 			ipath_dev_err(dd, "Still bad interrupt status, "
+ 				      "unregistering interrupt\n");
+-			dd->ipath_f_free_irq(dd);
++			free_irq(dd->pcidev->irq, dd);
+ 		} else if (allbits > 2) {
+ 			if ((allbits % 10000) == 0)
+ 				printk(".");
+diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_kernel.h
+--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Mar 08 14:02:44 2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Mar 08 14:04:08 2007 -0800
+@@ -213,8 +213,6 @@ struct ipath_devdata {
+ 	void (*ipath_f_setextled)(struct ipath_devdata *, u64, u64);
+ 	/* fill out chip-specific fields */
+ 	int (*ipath_f_get_base_info)(struct ipath_portdata *, void *);
+-	/* free irq */
+-	void (*ipath_f_free_irq)(struct ipath_devdata *);
+ 	struct ipath_ibdev *verbs_dev;
+ 	struct timer_list verbs_timer;
+ 	/* total dwords sent (summed from counter) */
+@@ -332,8 +330,6 @@ struct ipath_devdata {
+ 	/* so we can rewrite it after a chip reset */
+ 	u32 ipath_pcibar1;
+
+-	/* interrupt number */
+-	int ipath_irq;
+ 	/* HT/PCI Vendor ID (here for NodeInfo) */
+ 	u16 ipath_vendorid;
+ 	/* HT/PCI Device ID (here for NodeInfo) */
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,26 @@
+BACKPORT - interrupt handler signature changed in 2.6.19
+
+diff -r 8e3a2c4c9490 drivers/infiniband/hw/ipath/ipath_intr.c
+--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Wed Jan 31 16:04:27 2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Wed Jan 31 16:11:22 2007 -0800
+@@ -897,7 +897,7 @@ static void handle_urcv(struct ipath_dev
+ 	}
+ }
+
+-irqreturn_t ipath_intr(int irq, void *data)
++irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *ignored)
+ {
+ 	struct ipath_devdata *dd = data;
+ 	u32 istat, chk0rcv = 0;
+diff -r 8e3a2c4c9490 drivers/infiniband/hw/ipath/ipath_kernel.h
+--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Wed Jan 31 16:04:27 2007 -0800
++++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Wed Jan 31 16:11:22 2007 -0800
+@@ -637,7 +637,7 @@ struct sk_buff *ipath_alloc_skb(struct i
+
+ extern int ipath_diag_inuse;
+
+-irqreturn_t ipath_intr(int irq, void *devid);
++irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *);
+ int ipath_decode_err(char *buf, size_t blen, ipath_err_t err);
+ #if __IPATH_INFO || __IPATH_DBG
+ extern const char *ipath_ibcstatus_str[];
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,17 @@
+diff --git a/drivers/infiniband/core/Makefile
b/drivers/infiniband/core/Makefile
+index 163d991..2cd239f 100644
+--- a/drivers/infiniband/core/Makefile
++++ b/drivers/infiniband/core/Makefile
+@@ -30,3 +30,5 @@ ib_ucm-y :=			ucm.o
+
+ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o \
+ 				uverbs_marshall.o
++
++ib_core-y +=			genalloc.o
+diff --git a/drivers/infiniband/core/genalloc.c
b/drivers/infiniband/core/genalloc.c
+new file mode 100644
+index 0000000..96a48fe
+--- /dev/null
++++ b/drivers/infiniband/core/genalloc.c
+@@ -0,0 +1 @@
++#include "src/genalloc.c"
diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch
ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch
--- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch	1970-01-01
05:30:00.000000000 +0530
+++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch	2009-02-03
00:44:23.000000000 +0530
@@ -0,0 +1,277 @@
+Index: gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.c
+===================================================================
+--- gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check.orig/drivers/scsi/iscsi_tcp.c
++++ gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.c
+@@ -108,8 +108,8 @@ iscsi_hdr_digest(struct iscsi_conn *conn
+ {
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+-	crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc);
+-	buf->sg.length = tcp_conn->hdr_size;
++	crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc);
++	buf->sg.length += sizeof(uint32_t);
+ }
+
+ static inline int
+@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
+
+ 		sg_init_one(&sg, (u8 *)hdr,
+ 			    sizeof(struct iscsi_hdr) + ahslen);
+-		crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length,
+-				   (u8 *)&cdgst);
++		crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst);
+ 		rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) +
+ 				     ahslen);
+ 		if (cdgst != rdgst) {
+@@ -649,9 +648,10 @@ iscsi_ctask_copy(struct iscsi_tcp_conn *
+  *	byte counters.
+  **/
+ static inline int
+-iscsi_tcp_copy(struct iscsi_conn *conn, int buf_size)
++iscsi_tcp_copy(struct iscsi_conn *conn)
+ {
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
++	int buf_size = tcp_conn->in.datalen;
+ 	int buf_left = buf_size - tcp_conn->data_copied;
+ 	int size = min(tcp_conn->in.copy, buf_left);
+ 	int rc;
+@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn,
+ }
+
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_digest_update(tfm, &temp, 1);
+ }
+
+ static void
+@@ -693,7 +693,7 @@ iscsi_recv_digest_update(struct iscsi_tc
+ 	struct scatterlist tmp;
+
+ 	sg_init_one(&tmp, buf, len);
+-	crypto_hash_update(&tcp_conn->rx_hash, &tmp, len);
++	crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1);
+ }
+
+ static int iscsi_scsi_data_in(struct iscsi_conn *conn)
+@@ -747,12 +747,12 @@ static int iscsi_scsi_data_in(struct isc
+ 		if (!rc) {
+ 			if (conn->datadgst_en) {
+ 				if (!offset)
+-					crypto_hash_update(
+-							&tcp_conn->rx_hash,
++					crypto_digest_update(
++							&tcp_conn->rx_tfm,
+ 							&sg[i], sg[i].length);
+ 				else
+ 					partial_sg_digest_update(
+-							&tcp_conn->rx_hash,
++							&tcp_conn->rx_tfm,
+ 							&sg[i],
+ 							sg[i].offset + offset,
+ 							sg[i].length - offset);
+@@ -766,10 +766,9 @@ static int iscsi_scsi_data_in(struct isc
+ 				/*
+ 				 * data-in is complete, but buffer not...
+ 				 */
+-				partial_sg_digest_update(&tcp_conn->rx_hash,
+-							 &sg[i],
+-							 sg[i].offset,
+-							 sg[i].length-rc);
++				partial_sg_digest_update(tcp_conn->rx_tfm,
++						&sg[i],
++						sg[i].offset, sg[i].length-rc);
+ 			rc = 0;
+ 			break;
+ 		}
+@@ -813,7 +812,7 @@ iscsi_data_recv(struct iscsi_conn *conn)
+ 		 * Collect data segment to the connection's data
+ 		 * placeholder
+ 		 */
+-		if (iscsi_tcp_copy(conn, tcp_conn->in.datalen)) {
++		if (iscsi_tcp_copy(conn)) {
+ 			rc = -EAGAIN;
+ 			goto exit;
+ 		}
+@@ -887,7 +886,7 @@ more:
+ 		rc = iscsi_tcp_hdr_recv(conn);
+ 		if (!rc && tcp_conn->in.datalen) {
+ 			if (conn->datadgst_en)
+-				crypto_hash_init(&tcp_conn->rx_hash);
++				crypto_digest_init(tcp_conn->rx_tfm);
+ 			tcp_conn->in_progress = IN_PROGRESS_DATA_RECV;
+ 		} else if (rc) {
+ 			iscsi_conn_failure(conn, rc);
+@@ -900,15 +899,10 @@ more:
+
+ 		debug_tcp("extra data_recv offset %d copy %d\n",
+ 			  tcp_conn->in.offset, tcp_conn->in.copy);
+-		rc = iscsi_tcp_copy(conn, sizeof(uint32_t));
+-		if (rc) {
+-			if (rc == -EAGAIN)
+-				goto again;
+-			iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
+-			return 0;
+-		}
+-
+-		memcpy(&recv_digest, conn->data, sizeof(uint32_t));
++		skb_copy_bits(tcp_conn->in.skb, tcp_conn->in.offset,
++				&recv_digest, 4);
++		tcp_conn->in.offset += 4;
++		tcp_conn->in.copy -= 4;
+ 		if (recv_digest != tcp_conn->in.datadgst) {
+ 			debug_tcp("iscsi_tcp: data digest error!"
+ 				  "0x%x != 0x%x\n", recv_digest,
+@@ -944,14 +938,13 @@ more:
+ 					  tcp_conn->in.padding);
+ 				memset(pad, 0, tcp_conn->in.padding);
+ 				sg_init_one(&sg, pad, tcp_conn->in.padding);
+-				crypto_hash_update(&tcp_conn->rx_hash,
+-						   &sg, sg.length);
++				crypto_digest_update(tcp_conn->rx_tfm,
++						     &sg, 1);
+ 			}
+-			crypto_hash_final(&tcp_conn->rx_hash,
+-					  (u8 *) &tcp_conn->in.datadgst);
++			crypto_digest_final(tcp_conn->rx_tfm,
++					    (u8 *) & tcp_conn->in.datadgst);
+ 			debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst);
+ 			tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV;
+-			tcp_conn->data_copied = 0;
+ 		} else
+ 			tcp_conn->in_progress = IN_PROGRESS_WAIT_HEADER;
+ 	}
+@@ -1193,7 +1186,7 @@ static inline void
+ iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn,
+ 		      struct iscsi_tcp_cmd_task *tcp_ctask)
+ {
+-	crypto_hash_init(&tcp_conn->tx_hash);
++	crypto_digest_init(tcp_conn->tx_tfm);
+ 	tcp_ctask->digest_count = 4;
+ }
+
+@@ -1449,9 +1442,8 @@ iscsi_send_padding(struct iscsi_conn *co
+ 		iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad,
+ 				   tcp_ctask->pad_count);
+ 		if (conn->datadgst_en)
+-			crypto_hash_update(&tcp_conn->tx_hash,
+-					   &tcp_ctask->sendbuf.sg,
+-					   tcp_ctask->sendbuf.sg.length);
++			crypto_digest_update(tcp_conn->tx_tfm,
++					     &tcp_ctask->sendbuf.sg, 1);
+ 	} else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD))
+ 		return 0;
+
+@@ -1483,7 +1475,7 @@ iscsi_send_digest(struct iscsi_conn *con
+ 	tcp_conn = conn->dd_data;
+
+ 	if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) {
+-		crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest);
++		crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest);
+ 		iscsi_buf_init_iov(buf, (char*)digest, 4);
+ 	}
+ 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST;
+@@ -1517,7 +1509,7 @@ iscsi_send_data(struct iscsi_cmd_task *c
+ 		rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent);
+ 		*sent = *sent + buf_sent;
+ 		if (buf_sent && conn->datadgst_en)
+-			partial_sg_digest_update(&tcp_conn->tx_hash,
++			partial_sg_digest_update(tcp_conn->tx_tfm,
+ 				&sendbuf->sg, sendbuf->sg.offset + offset,
+ 				buf_sent);
+ 		if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) {
+@@ -1774,22 +1766,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_tfm)
+ 		goto free_tcp_conn;
+
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_tfm)
+ 		goto free_tx_tfm;
+
+ 	return cls_conn;
+
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_tfm);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,11 +1811,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
+-
++	if (tcp_conn->tx_tfm)
++		crypto_free_tfm(tcp_conn->tx_tfm);
++	if (tcp_conn->rx_tfm)
++		crypto_free_tfm(tcp_conn->rx_tfm);
+ 	kfree(tcp_conn);
+ }
+
+@@ -1835,11 +1822,9 @@ static void
+ iscsi_tcp_conn_stop(struct iscsi_cls_conn *cls_conn, int flag)
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+-	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+ 	iscsi_conn_stop(cls_conn, flag);
+ 	iscsi_tcp_release_conn(conn);
+-	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ }
+
+ static int
+Index: gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.h
+===================================================================
+--- gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check.orig/drivers/scsi/iscsi_tcp.h
++++ gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.h
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+
+-struct crypto_hash;
+ struct socket;
+
+ /* Socket connection recieve helper */
+@@ -82,7 +81,6 @@ struct iscsi_tcp_conn {
+ 						 * stop to terminate */
+ 	/* iSCSI connection-wide sequencing */
+ 	int			hdr_size;	/* PDU header size */
+-
+ 	/* control data */
+ 	struct iscsi_tcp_recv	in;		/* TCP receive context */
+ 	int			in_progress;	/* connection state machine */
+@@ -93,8 +91,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_tfm;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_tfm;	/* CRC32C (Rx) */
+
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
diff -ruN ofa_kernel-1.2/ofed_scripts/configure
ofa_kernel-1.2_try2/ofed_scripts/configure
--- ofa_kernel-1.2/ofed_scripts/configure	2009-02-03 02:12:23.000000000 +0530
+++ ofa_kernel-1.2_try2/ofed_scripts/configure	2009-02-03
02:13:18.000000000 +0530
@@ -218,9 +218,12 @@
         2.6.17*)
                 echo 2.6.17
         ;;
-        2.6.18-*fc[56]*|2.6.18-*el5*)
+        2.6.18-*fc[56]*|2.6.18-8.el5)
                 echo 2.6.18_FC6
         ;;
+	2.6.18-53.el5)
+		echo 2.6.18-EL5.1
+	;;
         2.6.18*)
                 echo 2.6.18
         ;;

-regards
Devesh Sharma


From devesh28 at gmail.com  Tue Feb  3 06:11:34 2009
From: devesh28 at gmail.com (Devesh Sharma)
Date: Tue, 3 Feb 2009 19:41:34 +0530
Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][PATCH v2] compiling OFED-1.2
	with RHEL5.1
Message-ID: <309a667c0902030611j6d0f23eav52afd361e378f968@mail.gmail.com>

Hello list here is the second patch inline

diff -ruN a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
--- a/drivers/infiniband/core/mad.c	2009-02-03 02:37:18.000000000 +0530
+++ b/drivers/infiniband/core/mad.c	2009-02-03 02:37:08.000000000 +0530
@@ -2955,7 +2955,6 @@
 					 sizeof(struct ib_mad_private),
 					 0,
 					 SLAB_HWCACHE_ALIGN,
-					 NULL,
 					 NULL);
 	if (!ib_mad_cache) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad cache\n");
diff -ruN a/drivers/infiniband/hw/amso1100/c2_vq.c
b/drivers/infiniband/hw/amso1100/c2_vq.c
--- a/drivers/infiniband/hw/amso1100/c2_vq.c	2009-02-03 02:31:19.000000000 +0530
+++ b/drivers/infiniband/hw/amso1100/c2_vq.c	2009-02-03 02:51:30.000000000 +0530
@@ -85,7 +85,7 @@
 		(char) ('0' + c2dev->devnum));
 	c2dev->host_msg_cache =
 	    kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0,
-			      SLAB_HWCACHE_ALIGN, NULL, NULL);
+			      SLAB_HWCACHE_ALIGN, NULL);
 	if (c2dev->host_msg_cache == NULL) {
 		return -ENOMEM;
 	}
diff -ruN a/drivers/infiniband/hw/ehca/ehca_av.c
b/drivers/infiniband/hw/ehca/ehca_av.c
--- a/drivers/infiniband/hw/ehca/ehca_av.c	2009-02-03 02:31:19.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_av.c	2009-02-03 02:49:53.000000000 +0530
@@ -257,7 +257,7 @@
 	av_cache = kmem_cache_create("ehca_cache_av",
 				   sizeof(struct ehca_av), 0,
 				   SLAB_HWCACHE_ALIGN,
-				   NULL, NULL);
+				   NULL);
 	if (!av_cache)
 		return -ENOMEM;
 	return 0;
diff -ruN a/drivers/infiniband/hw/ehca/ehca_cq.c
b/drivers/infiniband/hw/ehca/ehca_cq.c
--- a/drivers/infiniband/hw/ehca/ehca_cq.c	2009-02-03 02:37:18.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c	2009-02-03 02:49:08.000000000 +0530
@@ -396,7 +396,7 @@
 	cq_cache = kmem_cache_create("ehca_cache_cq",
 				     sizeof(struct ehca_cq), 0,
 				     SLAB_HWCACHE_ALIGN,
-				     NULL, NULL);
+				     NULL);
 	if (!cq_cache)
 		return -ENOMEM;
 	return 0;
diff -ruN a/drivers/infiniband/hw/ehca/ehca_main.c
b/drivers/infiniband/hw/ehca/ehca_main.c
--- a/drivers/infiniband/hw/ehca/ehca_main.c	2009-02-03 02:37:18.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_main.c	2009-02-03 02:49:30.000000000 +0530
@@ -165,7 +165,7 @@
 	ctblk_cache = kmem_cache_create("ehca_cache_ctblk",
 					EHCA_PAGESIZE, H_CB_ALIGNMENT,
 					SLAB_HWCACHE_ALIGN,
-					NULL, NULL);
+					NULL);
 	if (!ctblk_cache) {
 		ehca_gen_err("Cannot create ctblk SLAB cache.");
 		ehca_cleanup_mrmw_cache();
diff -ruN a/drivers/infiniband/hw/ehca/ehca_mrmw.c
b/drivers/infiniband/hw/ehca/ehca_mrmw.c
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c	2009-02-03 02:37:18.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c	2009-02-03 02:51:00.000000000 +0530
@@ -2234,13 +2234,13 @@
 	mr_cache = kmem_cache_create("ehca_cache_mr",
 				     sizeof(struct ehca_mr), 0,
 				     SLAB_HWCACHE_ALIGN,
-				     NULL, NULL);
+				     NULL);
 	if (!mr_cache)
 		return -ENOMEM;
 	mw_cache = kmem_cache_create("ehca_cache_mw",
 				     sizeof(struct ehca_mw), 0,
 				     SLAB_HWCACHE_ALIGN,
-				     NULL, NULL);
+				     NULL);
 	if (!mw_cache) {
 		kmem_cache_destroy(mr_cache);
 		mr_cache = NULL;
diff -ruN a/drivers/infiniband/hw/ehca/ehca_pd.c
b/drivers/infiniband/hw/ehca/ehca_pd.c
--- a/drivers/infiniband/hw/ehca/ehca_pd.c	2009-02-03 02:31:19.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_pd.c	2009-02-03 02:50:11.000000000 +0530
@@ -101,7 +101,7 @@
 	pd_cache = kmem_cache_create("ehca_cache_pd",
 				     sizeof(struct ehca_pd), 0,
 				     SLAB_HWCACHE_ALIGN,
-				     NULL, NULL);
+				     NULL);
 	if (!pd_cache)
 		return -ENOMEM;
 	return 0;
diff -ruN a/drivers/infiniband/hw/ehca/ehca_qp.c
b/drivers/infiniband/hw/ehca/ehca_qp.c
--- a/drivers/infiniband/hw/ehca/ehca_qp.c	2009-02-03 02:37:18.000000000 +0530
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c	2009-02-03 02:50:29.000000000 +0530
@@ -1440,7 +1440,7 @@
 	qp_cache = kmem_cache_create("ehca_cache_qp",
 				     sizeof(struct ehca_qp), 0,
 				     SLAB_HWCACHE_ALIGN,
-				     NULL, NULL);
+				     NULL);
 	if (!qp_cache)
 		return -ENOMEM;
 	return 0;
diff -ruN a/drivers/infiniband/ulp/iser/iscsi_iser.c
b/drivers/infiniband/ulp/iser/iscsi_iser.c
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c	2009-02-03
02:31:19.000000000 +0530
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c	2009-02-03
02:51:56.000000000 +0530
@@ -625,7 +625,7 @@
 	ig.desc_cache = kmem_cache_create("iser_descriptors",
 					  sizeof (struct iser_desc),
 					  0, SLAB_HWCACHE_ALIGN,
-					  NULL, NULL);
+					  NULL);
 	if (ig.desc_cache == NULL)
 		return -ENOMEM;

diff -ruN a/net/rds/connection.c b/net/rds/connection.c
--- a/net/rds/connection.c	2009-02-03 02:31:19.000000000 +0530
+++ b/net/rds/connection.c	2009-02-03 07:00:14.000000000 +0530
@@ -332,7 +332,7 @@
 {
 	rds_conn_slab = kmem_cache_create("rds_connection",
 					  sizeof(struct rds_connection),
-				          0, 0, NULL, NULL);
+				          0, 0, NULL);
 	if (rds_conn_slab == NULL)
 		return -ENOMEM;

diff -ruN a/net/rds/ib_recv.c b/net/rds/ib_recv.c
--- a/net/rds/ib_recv.c	2009-02-03 02:37:18.000000000 +0530
+++ b/net/rds/ib_recv.c	2009-02-03 07:00:50.000000000 +0530
@@ -752,13 +752,13 @@

 	rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming",
 					sizeof(struct rds_ib_incoming),
-					0, 0, NULL, NULL);
+					0, 0, NULL);
 	if (rds_ib_incoming_slab == NULL)
 		goto out;

 	rds_ib_frag_slab = kmem_cache_create("rds_ib_frag",
 					sizeof(struct rds_page_frag),
-					0, 0, NULL, NULL);
+					0, 0, NULL);
 	if (rds_ib_frag_slab == NULL)
 		kmem_cache_destroy(rds_ib_incoming_slab);
 	else
diff -ruN a/net/rds/tcp.c b/net/rds/tcp.c
--- a/net/rds/tcp.c	2009-02-03 02:31:19.000000000 +0530
+++ b/net/rds/tcp.c	2009-02-03 07:01:35.000000000 +0530
@@ -254,7 +254,7 @@

 	rds_tcp_conn_slab = kmem_cache_create("rds_tcp_connection",
 					      sizeof(struct rds_tcp_connection),
-					      0, 0, NULL, NULL);
+					      0, 0, NULL);
 	if (rds_tcp_conn_slab == NULL) {
 		ret = -ENOMEM;
 		goto out;
diff -ruN a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
--- a/net/rds/tcp_recv.c	2009-02-03 02:31:19.000000000 +0530
+++ b/net/rds/tcp_recv.c	2009-02-03 07:01:59.000000000 +0530
@@ -344,7 +344,7 @@
 {
 	rds_tcp_incoming_slab = kmem_cache_create("rds_tcp_incoming",
 					sizeof(struct rds_tcp_incoming),
-					0, 0, NULL, NULL);
+					0, 0, NULL);
 	if (rds_tcp_incoming_slab == NULL)
 		return -ENOMEM;
 	return 0;

regards
Devesh Sharma


From tziporet at dev.mellanox.co.il  Tue Feb  3 06:14:00 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 3 Feb 2009 16:14:00 +0200
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
In-Reply-To: <1233654242.1364.39.camel@pyren.uio.no>
References: <1233654242.1364.39.camel@pyren.uio.no>
Message-ID: <3d47233f0902030614i29e567f8i46ea3df632936ac6@mail.gmail.com>

I am looking here how to help you.
Can you specify which FW version are you using?
Also - please make sure you have the most updated BIOS for the AMD system

Tziporet


On Tue, Feb 3, 2009 at 11:44 AM, Ole Widar Saastad
<o.w.saastad at usit.uio.no>wrote:

>
> I have problems using the OFED 1.4 software on the Sun x4600 nodes.
> Need help to get this to work. We plan to run GPFS over IB on these
> nodes in addition to MPI.
>
> Sun 4600 nodes with 8 quad core cpus,
> Quad-Core AMD Opteron(tm) Processor 8380
>
> OS is Rocks release 4.
> centos-release-4-4.2/x86_64/
>
> Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8
> 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with
> OFED 1.4 (and 1.3), they have the almost the same kernel :
> Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20
> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> Same except  ELsmp and not ELlargesmp.
>
> More information:
>
> dmesg prints out the following error message :
>
> Losing some ticks... checking if CPU frequency changed.
> modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp
> 0000007fbfffcfd8 error 6
> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
> mlx4_core: Initializing 0000:02:00.0
> ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193
> PCI: Setting latency timer of device 0000:02:00.0 to 64
> mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1,
> reducing to 1.
> MSI INIT SUCCESS
> mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1
> mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5)
> mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting.
> mlx4_core: probe of 0000:02:00.0 failed with error -5
>
> The following software is installed:
>
> Select Option [1-5]:3
> kernel-ib
> libibverbs
> libibverbs-devel
> libibverbs-utils
> libmthca
> libmlx4
> libcxgb3
> libnes
> libipathverbs
> libibcommon
> libibcommon-devel
> libibumad
> libibumad-devel
> ofed-docs
> ofed-scripts
> ibvexdmtools
> qlgc_vnic_daemon
>
>
> Just to be sure the card is present :
> lspci returns :
> 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)
>
>
> --
> Ole W. Saastad, dr. scient.
> Scientific Computing Group, USIT, University of Oslo
> http://hpc.uio.no
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/ed56612b/attachment.html>

From devesh28 at gmail.com  Tue Feb  3 06:14:42 2009
From: devesh28 at gmail.com (Devesh Sharma)
Date: Tue, 3 Feb 2009 19:44:42 +0530
Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][CONFIG SCRIPT] compiling
	OFED-1.2 with RHEL5.1
Message-ID: <309a667c0902030614j7a5bf5b7k5342fd021948fc2@mail.gmail.com>

This configuration script has to be run before following normal
compiling procedure. It must be run from Top Level OFED-1.2 directory
with both the patches in the same directory.

#!/bin/bash
ofed_top_dir=$(pwd)
package_name=ofa_kernel
package=ofa_kernel-1.2
package_rel=0

echo Installing ${package_name} source rpm:
if ! ( set -x && rpm -i --define "_topdir $(pwd)"
SRPMS/${package}-${package_rel}.src.rpm && set +x ); then
        echo "Failed to install ${package}-${package_rel}.src.rpm"
        exit 1
fi

cd SOURCES
tar zxf ofa_kernel-1.2.tgz

cd ofa_kernel-1.2
patch -p1<${ofed_top_dir}/OFED-1.2_RHEL5.1_fix.patch
cp ${ofed_top_dir}/kmem_cache_create_fix.patch
${ofed_top_dir}/SOURCES/ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/
cd -

tar zcf ofa_kernel-1.2.tgz ofa_kernel-1.2
cd ${ofed_top_dir}

echo Rebuilding ${package_name} source rpm:
if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)"
SPECS/${package_name}.spec && set +x ); then
        echo Failed to create ${package}-${package_rel}.src.rpm
        exit 1
fi

rm -rf SOURCES/${package}*

-regards
Devesh Sharma


From sashak at voltaire.com  Tue Feb  3 06:22:48 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 16:22:48 +0200
Subject: [ofa-general] Re: [ofw] saquery & osm vendor AL - ca_names missing
	from osm_vendor_t ?
In-Reply-To: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
Message-ID: <20090203142248.GM11874@sashak.voltaire.com>

On 14:11 Mon 02 Feb     , Sean Hefty wrote:
> Forwarding to general list and copying Sasha.
> 
> >Hello,
> >  The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in
> >opensm\user\include\vendor\osm_vendor_al.h) is missing
> >the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'.
> >saquery.c expects to find ca_names in osm_vendor_t.
> >
> >A couple of observations:
> >1) Windows currently supports a much older version of opensm than what OFED 1.4
> >tools expect.
> >
> >2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad
> >interfaces.
> >   If libibmad from saquery, then OpenSM would not need libibmad references for
> >Windows.
> >
> >3) How bad is it to create libibmad dependencies for OpenSM?

Why we need to? Dependencies without reason is not a good thing.

> >
> >4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces;
> >the rest use
> >   libibmad.

True.

> >
> >Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'.
> >This cmd-line input is resolved thru
> >libibmad interfaces.
> >Saquery is no exception in that it expects to match the '-C ca_name' against
> >osm_vendor_t.ca_names[]. 'ibstat -l' lists
> >CA names.
> >
> >The question becomes how best to resolve the missing ca_names?
> >
> >1) modify saquery to call libibmad interface to get CA names;

That is possible I guess.

> > leaves
> >osm_vendor_t unmodified.
> >   So far, saquery is the only diag pgm which uses OSM mad interfaces;
> >expecting ca_names
> >   in osm_vendor_t.

OpenSM (osm_vendor_ibumad layer) uses this too for port finding/choosing.

> >
> >2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate
> >ca_names
> >   from OpenSM code?

How OpenSM in WinOF choose a port to use?

> > Creates libibmad dependencies for opensm.

ca_names[][] by itself doesn't create such dependencies. For instance
osm_vendor_ibumad.c has ca_names[][] and doesn't have any libibmad
dependency.

Sasha


From sashak at voltaire.com  Tue Feb  3 06:27:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 16:27:18 +0200
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
	<CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
Message-ID: <20090203142718.GN11874@sashak.voltaire.com>

On 14:51 Mon 02 Feb     , Sean Hefty wrote:
> >>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces;
> >>the rest use libibmad.
> 
> Looking briefly at the saquery code, I don't understand the benefit to using the
> opensm vendor interfaces, versus using libibmad or even libibumad directly, and
> switching to libibumad looks doable.  (It's not clear to me that there are
> benefits to using libibmad over libibumad for saquery.)
> 
> - osm_bind_handle_t looks like it could map to a libibumad port_id (int).
> - osmv_query_sa() could map to umad_send(), followed by umad_recv() to
>   obtain the result.  (Replace osmv_query_sa with a new function.)
> - There are a couple other calls that are used to loop through all returned
>   attributes in a response MAD.  We could use the MAD attribute offset
>   directly.  (Update loops where osmv_get_query_* is called.)
> 
> Are there technical reasons why the opensm vendor library was chosen for
> saquery?

AFAIK there are no such reasons.

> Would there be any objection to changing saquery to use libibumad
> directly?  

Not from me.

Sasha


From sashak at voltaire.com  Tue Feb  3 06:30:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 3 Feb 2009 16:30:40 +0200
Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <9632920386E943489C39D8637052F404@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com>
	<CD74E5F748BA467794E0FA18E941775C@amr.corp.intel.com>
	<20090202150658.0af72134.weiny2@llnl.gov>
	<9632920386E943489C39D8637052F404@amr.corp.intel.com>
Message-ID: <20090203143032.GO11874@sashak.voltaire.com>

On 15:19 Mon 02 Feb     , Sean Hefty wrote:
> 
> libibumad does require the user to provide the address to the SA.  Providing a
> libibumad helper function to fill out ib_mad_addr_t for the local SA seems
> reasonable.  I guess we can look at what it would take to convert it in detail
> to see if anything is still missing from the lower libraries.

There are ib_resolve_smlid() and ib_resolve_smlid_via() functions in
libibmad already.

Sasha


From halr at obsidianresearch.com  Tue Feb  3 06:57:33 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 03 Feb 2009 07:57:33 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_node.h:
	osm_node_get_num_physp description fix
Message-ID: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>

Sasha,

Trivial description change to osm_node_get_num_physp.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-opensm-osm_node.h-osm_node_get_num_physp-descriptio.patch
Type: application/mbox
Size: 844 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/818ed80d/attachment.mbox>

From halr at obsidianresearch.com  Tue Feb  3 06:57:36 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 03 Feb 2009 07:57:36 -0700
Subject: [ofa-general] [PATCH] opensm/osm_perfmgr.c: Increase size of memory
	allocation in __collect_guids
Message-ID: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>

Sasha,

Patch to increase size of monitored node in
osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
port number.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-opensm-osm_perfmgr.c-Increase-size-of-memory-alloca.patch
Type: application/mbox
Size: 1508 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/21d1d534/attachment.mbox>

From halr at obsidianresearch.com  Tue Feb  3 06:57:50 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 03 Feb 2009 07:57:50 -0700
Subject: [ofa-general] [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port,
	allow queries on enhanced SP0
Message-ID: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>

Sasha,

Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced
SP0.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-opensm-osm_perfmgr_db.c-In-bad_node_port-allow-que.patch
Type: application/mbox
Size: 3685 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090203/fb50e030/attachment.mbox>

From jon at opengridcomputing.com  Tue Feb  3 07:31:25 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Tue, 3 Feb 2009 09:31:25 -0600
Subject: [ofa-general] Support for CXGB3 RNIC on P6
In-Reply-To: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
References: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
Message-ID: <20090203153124.GA13472@opengridcomputing.com>

On Tue, Feb 03, 2009 at 08:55:14AM +0530, Krishna Kumar2 wrote:
> 
> Hi,
> 
> My colleague (at a different site) is trying to get couple of Chelsio RNIC
> adapters working on
> p6 systems but for some reason the cards aren't recognized on bootup. The
> same cards works
> on my xseries systems, and following are the messages I get (there are no
> messages on his p6
> systems):
> 
> Feb  1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version
> 1.1.1-ko
> Feb  1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: PCI INT A -> GSI 17
> (level, low) -> IRQ 17
> Feb  1 11:42:49 localhost kernel: input: Power Button (FF) as
> /class/input/input1
> Feb  1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF]
> Feb  1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: Port 0 using 4 queue
> sets.
> Feb  1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4)
> PCI Express x8 MSI-X
> Feb  1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX,
> S/N: PT49070050
> 
> Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing
> something to get it to work?

Does the adapter show up on his system when he runs lspci?


> 
> thanks,
> 
> - KK
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Tue Feb  3 08:12:39 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 03 Feb 2009 08:12:39 -0800
Subject: [ofa-general] Support for CXGB3 RNIC on P6
In-Reply-To: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
	(Krishna Kumar2's message of "Tue, 3 Feb 2009 08:55:14 +0530")
References: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
Message-ID: <adaocxjo5xk.fsf@cisco.com>

 > My colleague (at a different site) is trying to get couple of Chelsio RNIC
 > adapters working on
 > p6 systems but for some reason the cards aren't recognized on bootup.

What do you mean by "aren't recognized on bootup"?  More details on the
specific problem are needed to diagnose it.

 - R.


From jackm at dev.mellanox.co.il  Tue Feb  3 08:16:41 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 3 Feb 2009 18:16:41 +0200
Subject: [ofa-general] Kernel panic in IPoIB stability testing
Message-ID: <200902031816.41784.jackm@dev.mellanox.co.il>

We saw the following kernel panic when testing ipoib stability intensively
by simultaneously (i.e., in separate processes, with random wait intervals) doing:
- ifconfig up/down
- opensm up/down
- ipoib ping
- arp delete
- driver up/down

ib0: ib_sa_path_rec_get failed: -11
ib0: ib_sa_path_rec_get failed: -11
Unable to handle kernel NULL pointer dereference at 0000000000000000
RIP:  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
PGD 224ea0067 PUD 225ae9067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
CPU 2
Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff883ac404>]  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
RSP: 0018:ffff810121ee7de0  EFLAGS: 00010046
RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
FS:  00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
Stack:  ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850  ffffffffffffffff 7fffffffffffffff
ffffffffffffffff ffff810121ee8688  ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
Call Trace:  [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
             [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
             [<ffffffff80047c13>] worker_thread+0x0/0x122
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff80047d03>] worker_thread+0xf0/0x122
             [<ffffffff80086c5f>] default_wake_function+0x0/0xe
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff8003216e>] kthread+0xfe/0x132
             [<ffffffff8005bfe5>] child_rip+0xa/0x11
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff80032070>] kthread+0x0/0x132
             [<ffffffff8005bfdb>] child_rip+0x0/0x11

Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
RIP  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
RSP <ffff810121ee7de0>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception

In objdump -ld, we get:
ipoib_mark_paths_invalid():
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
    13f7:       c7 83 e0 00 00 00 00    movl   $0x0,0xe0(%rbx)
    13fe:       00 00 00
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
    1401:       4c 89 e3                mov    %r12,%rbx
==>    1404:       4d 8b a4 24 d0 00 00    mov    0xd0(%r12),%r12
    140b:       00
    140c:       48 8d 93 d0 00 00 00    lea    0xd0(%rbx),%rdx
    1413:       48 8d 45 38             lea    0x38(%rbp),%rax
    1417:       49 81 ec d0 00 00 00    sub    $0xd0,%r12
    141e:       48 39 c2                cmp    %rax,%rdx
    1421:       0f 85 4b ff ff ff       jne    1372 <ipoib_mark_paths_invalid+0x2a>
--------------------------------
and in the source code, we get:

void ipoib_mark_paths_invalid(struct net_device *dev)
{
        struct ipoib_dev_priv *priv = netdev_priv(dev);
        struct ipoib_path *path, *tp;

        spin_lock_irq(&priv->lock);

==>        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
                ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
                        be16_to_cpu(path->pathrec.dlid),
                        IPOIB_GID_ARG(path->pathrec.dgid));
                path->valid =  0;
        }

        spin_unlock_irq(&priv->lock);
}
--------------------------------------------
Any ideas?

- Jack


From yosefe at Voltaire.COM  Tue Feb  3 08:47:28 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Tue, 03 Feb 2009 18:47:28 +0200
Subject: [ofa-general] Kernel panic in IPoIB stability testing
In-Reply-To: <200902031816.41784.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
Message-ID: <49887520.9080906@Voltaire.COM>

What kernel and ofed version is it?

Jack Morgenstein wrote:
> We saw the following kernel panic when testing ipoib stability intensively
> by simultaneously (i.e., in separate processes, with random wait intervals) doing:
> - ifconfig up/down
> - opensm up/down
> - ipoib ping
> - arp delete
> - driver up/down
> 
> ib0: ib_sa_path_rec_get failed: -11
> ib0: ib_sa_path_rec_get failed: -11
> Unable to handle kernel NULL pointer dereference at 0000000000000000
> RIP:  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> PGD 224ea0067 PUD 225ae9067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
> CPU 2
> Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
> sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
> ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
> i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
> cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
> RIP: 0010:[<ffffffff883ac404>]  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP: 0018:ffff810121ee7de0  EFLAGS: 00010046
> RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
> RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
> RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
> R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
> R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
> FS:  00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
> Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
> Stack:  ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850  ffffffffffffffff 7fffffffffffffff
> ffffffffffffffff ffff810121ee8688  ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
> Call Trace:  [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
>              [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
>              [<ffffffff80047c13>] worker_thread+0x0/0x122
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80047d03>] worker_thread+0xf0/0x122
>              [<ffffffff80086c5f>] default_wake_function+0x0/0xe
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8003216e>] kthread+0xfe/0x132
>              [<ffffffff8005bfe5>] child_rip+0xa/0x11
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80032070>] kthread+0x0/0x132
>              [<ffffffff8005bfdb>] child_rip+0x0/0x11
> 
> Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
> RIP  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP <ffff810121ee7de0>
> CR2: 0000000000000000
>  <0>Kernel panic - not syncing: Fatal exception
> 
> In objdump -ld, we get:
> ipoib_mark_paths_invalid():
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
>     13f7:       c7 83 e0 00 00 00 00    movl   $0x0,0xe0(%rbx)
>     13fe:       00 00 00
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
>     1401:       4c 89 e3                mov    %r12,%rbx
> ==>    1404:       4d 8b a4 24 d0 00 00    mov    0xd0(%r12),%r12
>     140b:       00
>     140c:       48 8d 93 d0 00 00 00    lea    0xd0(%rbx),%rdx
>     1413:       48 8d 45 38             lea    0x38(%rbp),%rax
>     1417:       49 81 ec d0 00 00 00    sub    $0xd0,%r12
>     141e:       48 39 c2                cmp    %rax,%rdx
>     1421:       0f 85 4b ff ff ff       jne    1372 <ipoib_mark_paths_invalid+0x2a>
> --------------------------------
> and in the source code, we get:
> 
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         struct ipoib_path *path, *tp;
> 
>         spin_lock_irq(&priv->lock);
> 
> ==>        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
>                 ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
>                         be16_to_cpu(path->pathrec.dlid),
>                         IPOIB_GID_ARG(path->pathrec.dgid));
>                 path->valid =  0;
>         }
> 
>         spin_unlock_irq(&priv->lock);
> }
> --------------------------------------------
> Any ideas?
> 
> - Jack
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 

-- 
--Yossi


From sean.hefty at intel.com  Tue Feb  3 09:37:28 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 3 Feb 2009 09:37:28 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>
Message-ID: <F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>

>cma_acquire_dev --> cma_set_qkey/ps=IPOIB --> ib_sa_get_mcmember_rec where the
>latter returns EADDRNOTAVAIL since when the port went down the core multicast
>code

Why is ib_sa_get_mcmember_rec being called?  Or is this issue separate from what
udaddy is showing?

>I  assume there must be a way to defer this resolving to a later stage such
>that binding would be possible when the port is down, thoughts?

Can you determine what call in the kernel is actually failing during the bind?
(I can try testing later, but I'm not near any systems currently.)  I'm
wondering if the failure is coming from rdma_translate_ip()->ip_dev_find().

- Sean


From yosefe at Voltaire.COM  Tue Feb  3 09:56:40 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Tue, 03 Feb 2009 19:56:40 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <200902031816.41784.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
Message-ID: <49888558.3050506@Voltaire.COM>

I think it comes from unicast_arp_send.

Consider this scenario:
- paths are flushed (opensm up/down).
- unicast_arp_send() is called with a path in priv->path_list. path->valid is 0.
- path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet)
  (see the prints just before the panic).
- unicast_arp_send calls() path_free().
- path memory is overwritten.
- __ipoib_dev_flush() is called again.
- mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic
  because path->list became invalid.

--Yossi

Jack Morgenstein wrote:
> We saw the following kernel panic when testing ipoib stability intensively
> by simultaneously (i.e., in separate processes, with random wait intervals) doing:
> - ifconfig up/down
> - opensm up/down
> - ipoib ping
> - arp delete
> - driver up/down
> 
> ib0: ib_sa_path_rec_get failed: -11
> ib0: ib_sa_path_rec_get failed: -11
> Unable to handle kernel NULL pointer dereference at 0000000000000000
> RIP:  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> PGD 224ea0067 PUD 225ae9067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
> CPU 2
> Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
> sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
> ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
> i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
> cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
> RIP: 0010:[<ffffffff883ac404>]  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP: 0018:ffff810121ee7de0  EFLAGS: 00010046
> RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
> RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
> RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
> R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
> R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
> FS:  00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
> Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
> Stack:  ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850  ffffffffffffffff 7fffffffffffffff
> ffffffffffffffff ffff810121ee8688  ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
> Call Trace:  [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
>              [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
>              [<ffffffff80047c13>] worker_thread+0x0/0x122
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80047d03>] worker_thread+0xf0/0x122
>              [<ffffffff80086c5f>] default_wake_function+0x0/0xe
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8003216e>] kthread+0xfe/0x132
>              [<ffffffff8005bfe5>] child_rip+0xa/0x11
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80032070>] kthread+0x0/0x132
>              [<ffffffff8005bfdb>] child_rip+0x0/0x11
> 
> Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
> RIP  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP <ffff810121ee7de0>
> CR2: 0000000000000000
>  <0>Kernel panic - not syncing: Fatal exception
> 
> In objdump -ld, we get:
> ipoib_mark_paths_invalid():
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
>     13f7:       c7 83 e0 00 00 00 00    movl   $0x0,0xe0(%rbx)
>     13fe:       00 00 00
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
>     1401:       4c 89 e3                mov    %r12,%rbx
> ==>    1404:       4d 8b a4 24 d0 00 00    mov    0xd0(%r12),%r12
>     140b:       00
>     140c:       48 8d 93 d0 00 00 00    lea    0xd0(%rbx),%rdx
>     1413:       48 8d 45 38             lea    0x38(%rbp),%rax
>     1417:       49 81 ec d0 00 00 00    sub    $0xd0,%r12
>     141e:       48 39 c2                cmp    %rax,%rdx
>     1421:       0f 85 4b ff ff ff       jne    1372 <ipoib_mark_paths_invalid+0x2a>
> --------------------------------
> and in the source code, we get:
> 
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         struct ipoib_path *path, *tp;
> 
>         spin_lock_irq(&priv->lock);
> 
> ==>        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
>                 ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
>                         be16_to_cpu(path->pathrec.dlid),
>                         IPOIB_GID_ARG(path->pathrec.dgid));
>                 path->valid =  0;
>         }
> 
>         spin_unlock_irq(&priv->lock);
> }
> --------------------------------------------
> Any ideas?
> 
> - Jack


From krkumar2 at in.ibm.com  Tue Feb  3 10:21:02 2009
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Tue, 3 Feb 2009 23:51:02 +0530
Subject: [ofa-general] Support for CXGB3 RNIC on P6
In-Reply-To: <adaocxjo5xk.fsf@cisco.com>
References: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
	<adaocxjo5xk.fsf@cisco.com>
Message-ID: <OF9CEA4EE5.A53072C1-ON65257552.006455AF-65257552.0064CDC9@in.ibm.com>

Sorry for the vagueness. There are no messages in the /var/log (or the
console/dmesg).
To answer Jon's question: lspci doesn't show the device.

No lights come up on the adapter, which I guess is the normal behavior till
the device is
succesfully probed and recognized.

thanks,

- KK

Roland Dreier <rdreier at cisco.com> wrote on 02/03/2009 09:42:39 PM:

> Roland Dreier <rdreier at cisco.com>
> 02/03/2009 09:42 PM
>
> To
>
> Krishna Kumar2/India/IBM at IBMIN
>
> cc
>
> openfabrics <general at lists.openfabrics.org>
>
> Subject
>
> Re: [ofa-general] Support for CXGB3 RNIC on P6
>
>  > My colleague (at a different site) is trying to get couple of Chelsio
RNIC
>  > adapters working on
>  > p6 systems but for some reason the cards aren't recognized on bootup.
>
> What do you mean by "aren't recognized on bootup"?  More details on the
> specific problem are needed to diagnose it.
>
>  - R.


From jon at opengridcomputing.com  Tue Feb  3 10:43:44 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Tue, 3 Feb 2009 12:43:44 -0600
Subject: [ofa-general] Support for CXGB3 RNIC on P6
In-Reply-To: <OF9CEA4EE5.A53072C1-ON65257552.006455AF-65257552.0064CDC9@in.ibm.com>
References: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>
	<adaocxjo5xk.fsf@cisco.com>
	<OF9CEA4EE5.A53072C1-ON65257552.006455AF-65257552.0064CDC9@in.ibm.com>
Message-ID: <20090203184344.GB13472@opengridcomputing.com>

On Tue, Feb 03, 2009 at 11:51:02PM +0530, Krishna Kumar2 wrote:
> Sorry for the vagueness. There are no messages in the /var/log (or the
> console/dmesg).
> To answer Jon's question: lspci doesn't show the device.

Can you please include the output of dmesg of the failing system as well
as lspci output (lspci -x).

Thanks,
Jon

> 
> No lights come up on the adapter, which I guess is the normal behavior till
> the device is
> succesfully probed and recognized.
> 
> thanks,
> 
> - KK
> 
> Roland Dreier <rdreier at cisco.com> wrote on 02/03/2009 09:42:39 PM:
> 
> > Roland Dreier <rdreier at cisco.com>
> > 02/03/2009 09:42 PM
> >
> > To
> >
> > Krishna Kumar2/India/IBM at IBMIN
> >
> > cc
> >
> > openfabrics <general at lists.openfabrics.org>
> >
> > Subject
> >
> > Re: [ofa-general] Support for CXGB3 RNIC on P6
> >
> >  > My colleague (at a different site) is trying to get couple of Chelsio
> RNIC
> >  > adapters working on
> >  > p6 systems but for some reason the cards aren't recognized on bootup.
> >
> > What do you mean by "aren't recognized on bootup"?  More details on the
> > specific problem are needed to diagnose it.
> >
> >  - R.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ralph.campbell at qlogic.com  Tue Feb  3 11:26:12 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 3 Feb 2009 11:26:12 -0800
Subject: [ofa-general] Possible memory leak and null pointer dereference in
	local_completions()
Message-ID: <1233689172.23327.155.camel@chromite.mv.qlogic.com>

I was doing some tests with different MAD packets and
then reading the infiniband/core/mad.c code.

handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
on the mad_agent_priv->local_work work queue with
local->mad_priv == NULL if device->process_mad() returns
IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
(!ib_response_mad(&mad_priv->mad.mad) ||
 !mad_agent_priv->agent.recv_handler).

In this case, local_completions() will be called with
local->mad_priv == NULL. The code does check for this
case and skips calling recv_mad_agent->agent.recv_handler().
This means recv == 0 so kmem_cache_free() is called with a
NULL pointer.

Even if local->mad_priv != NULL, I don't see how local->mad_priv
is freed when recv == 1. Thus, it appears to be a memory leak.
So, I'm proposing the following patch:

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5c54fc2..93d80e5 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work)
 	struct ib_mad_local_private *local;
 	struct ib_mad_agent_private *recv_mad_agent;
 	unsigned long flags;
-	int recv = 0;
 	struct ib_wc wc;
 	struct ib_mad_send_wc mad_send_wc;
 
@@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work)
 				goto local_send_completion;
 			}
 
-			recv = 1;
 			/*
 			 * Defined behavior is to complete response
 			 * before request
@@ -2422,7 +2420,7 @@ local_send_completion:
 
 		spin_lock_irqsave(&mad_agent_priv->lock, flags);
 		atomic_dec(&mad_agent_priv->refcount);
-		if (!recv)
+		if (local->mad_priv)
 			kmem_cache_free(ib_mad_cache, local->mad_priv);
 		kfree(local);
 	}


From swise at opengridcomputing.com  Tue Feb  3 13:05:24 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 03 Feb 2009 15:05:24 -0600
Subject: [ofa-general] Support for CXGB3 RNIC on P6
In-Reply-To: <OF9CEA4EE5.A53072C1-ON65257552.006455AF-65257552.0064CDC9@in.ibm.com>
References: <OF9FD4325A.11764768-ON65257552.0012171D-65257552.0012CA29@in.ibm.com>	<adaocxjo5xk.fsf@cisco.com>
	<OF9CEA4EE5.A53072C1-ON65257552.006455AF-65257552.0064CDC9@in.ibm.com>
Message-ID: <4988B194.6010706@opengridcomputing.com>


Krishna Kumar2 wrote:
> Sorry for the vagueness. There are no messages in the /var/log (or the
> console/dmesg).
> To answer Jon's question: lspci doesn't show the device.
>
> No lights come up on the adapter, which I guess is the normal behavior till
> the device is
> succesfully probed and recognized.
>   

True.

> thanks,
>
> - KK
>
>   

What pci-e slot config?  4x? 8x?


From weiny2 at llnl.gov  Tue Feb  3 15:47:32 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 3 Feb 2009 15:47:32 -0800
Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs
	for cleaner function interfaces
In-Reply-To: <475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com>
Message-ID: <20090203154732.1fc07a44.weiny2@llnl.gov>

On Mon, 2 Feb 2009 21:29:16 -0800
"Sean Hefty" <sean.hefty at intel.com> wrote:

> >@@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
> > #define MAD_DEF_RETRIES                3
> > #define MAD_DEF_TIMEOUT_MS     1000
> >
> >-enum {
> >+typedef enum {
> >        IB_DEST_LID,
> >        IB_DEST_DRPATH,
> >        IB_DEST_GUID,
> >        IB_DEST_DRSLID,
> >-};
> >+} mad_dest_t;
> >
> >-enum {
> >+typedef enum {
> >        IB_NODE_CA = 1,
> >        IB_NODE_SWITCH,
> >        IB_NODE_ROUTER,
> >        NODE_RNIC,
> >
> >        IB_NODE_MAX = NODE_RNIC
> >-};
> >+} mad_node_type_t;
> 
> For consistency, should these be named enums?  (MAD_DEST and MAD_NODE_TYPE)

Sure, patch attached.

Ira


>From ec5d9def3e92ee7d5ac245401c99de49c5a90e0e Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at llnl.gov>
Date: Mon, 2 Feb 2009 10:21:18 -0800
Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces


Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |   38 ++++++++++++++++++------------------
 libibmad/src/fields.c             |   22 ++++++++++----------
 libibmad/src/resolve.c            |   10 ++++----
 3 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 9ff4a3e..61d0a73 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -203,7 +203,7 @@ typedef struct ib_field {
 	ib_mad_dump_fn *def_dump_fn;
 } ib_field_t;
 
-enum MAD_FIELDS {
+typedef enum MAD_FIELDS {
 	IB_NO_FIELD,
 
 	IB_GID_PREFIX_F,
@@ -525,7 +525,7 @@ enum MAD_FIELDS {
 	IB_GUID_GUID0_F,
 
 	IB_FIELD_LAST_		/* must be last */
-};
+} mad_field_t;
 
 /*
  * SA RMPP section
@@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
 #define MAD_DEF_RETRIES		3
 #define MAD_DEF_TIMEOUT_MS	1000
 
-enum {
+typedef enum MAD_DEST {
 	IB_DEST_LID,
 	IB_DEST_DRPATH,
 	IB_DEST_GUID,
 	IB_DEST_DRSLID,
-};
+} mad_dest_t;
 
-enum {
+typedef enum MAD_NODE_TYPE {
 	IB_NODE_CA = 1,
 	IB_NODE_SWITCH,
 	IB_NODE_ROUTER,
 	NODE_RNIC,
 
 	IB_NODE_MAX = NODE_RNIC
-};
+} mad_node_type_t;
 
 /******************************************************************************/
 
@@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey)
 }
 
 /* fields.c */
-MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field,
+MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field);
+MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field,
 			      uint32_t val);
 /* field must be byte aligned */
-MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field,
+MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field);
+MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field,
 				uint64_t val);
-MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT int mad_print_field(int field, const char *name, void *val);
-MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val);
-MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val);
+MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val);
+MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val);
+MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val);
+MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val);
+MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val);
+MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val);
+MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val);
 
 /* mad.c */
 MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath,
@@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
 			       ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
-				     int dest_type, ib_portid_t * sm_id);
+				     mad_dest_t dest, ib_portid_t * sm_id);
 MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
 			       ibmad_gid_t * gid);
 
@@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 			ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      mad_dest_t dest, ib_portid_t * sm_id,
 			      const void *srcport);
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
 			const void *srcport);
diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index d5a1eb4..d435a2f 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f,
 	memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8);
 }
 
-uint32_t mad_get_field(void *buf, int base_offs, int field)
+uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field)
 {
 	return _get_field(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field(void *buf, int base_offs, int field, uint32_t val)
+void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val)
 {
 	_set_field(buf, base_offs, ib_mad_f + field, val);
 }
 
-uint64_t mad_get_field64(void *buf, int base_offs, int field)
+uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field)
 {
 	return _get_field64(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field64(void *buf, int base_offs, int field, uint64_t val)
+void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val)
 {
 	_set_field64(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_set_array(void *buf, int base_offs, int field, void *val)
+void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val)
 {
 	_set_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_get_array(void *buf, int base_offs, int field, void *val)
+void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val)
 {
 	_get_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_decode_field(uint8_t * buf, int field, void *val)
+void mad_decode_field(uint8_t * buf, mad_field_t field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val)
 	_get_array(buf, 0, f, val);
 }
 
-void mad_encode_field(uint8_t * buf, int field, void *val)
+void mad_encode_field(uint8_t * buf, mad_field_t field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val,
 			 valsz ? valsz : ALIGN(f->bitlen, 8) / 8);
 }
 
-int mad_print_field(int field, const char *name, void *val)
+int mad_print_field(mad_field_t field, const char *name, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return -1;
 	return _mad_print_field(ib_mad_f + field, name, val, 0);
 }
 
-char *mad_dump_field(int field, char *buf, int bufsz, void *val)
+char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
 	return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val);
 }
 
-char *mad_dump_val(int field, char *buf, int bufsz, void *val)
+char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index b62360b..faac1f9 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 }
 
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      mad_dest_t dest, ib_portid_t * sm_id,
 			      const void *srcport)
 {
 	uint64_t guid;
@@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 	ib_portid_t selfportid = { 0 };
 	int selfport = 0;
 
-	switch (dest_type) {
+	switch (dest) {
 	case IB_DEST_LID:
 		lid = strtol(addr_str, 0, 0);
 		if (!IB_LID_VALID(lid))
@@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 		return 0;
 
 	default:
-		IBWARN("bad dest_type %d", dest_type);
+		IBWARN("bad dest %d", dest);
 	}
 
 	return -1;
 }
 
-int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type,
+int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest,
 			  ib_portid_t * sm_id)
 {
-	return ib_resolve_portid_str_via(portid, addr_str, dest_type,
+	return ib_resolve_portid_str_via(portid, addr_str, dest,
 					 sm_id, NULL);
 }
 
-- 
1.5.4.5


From arlin.r.davis at intel.com  Tue Feb  3 16:17:13 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 3 Feb 2009 16:17:13 -0800
Subject: [ofa-general] RE: dapl attribute bug
In-Reply-To: <49871E6A.9000901@opengridcomputing.com>
References: <49871E6A.9000901@opengridcomputing.com>
Message-ID: <E3280858FA94444CA49D2BA02341C983382DC33A@orsmsx506.amr.corp.intel.com>

 
>The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code 
>maps this to the linux ib_device_attr->max_mr_size which is u64.
>
>This causes dapltest to fail in some cases when running over chelsio 
>which sets max_mr_size to 0x100000000 (4GB).  The dapl code truncates 
>the value to 0. See dapl/openib_cma/dapl_ib_util.c.
>
>I'm not sure what the fix should be, but maybe the dapl code 
>should set 
>anything over 32 bits to 0xffffffff?
>

This attribute changed with DAT 2.0 to match the 32-bit ibv_sge
length field. Since there are no direct max lmr segments mappings
I will need add some checks when setting max_lmr_block_size from
max_mr_size. Thanks.

-arlin

From purdy at sgi.com  Tue Feb  3 18:09:08 2009
From: purdy at sgi.com (Dale Purdy)
Date: Tue, 3 Feb 2009 20:09:08 -0600
Subject: [ofa-general] ibdiagnet and ibdmchk credit loop checks
Message-ID: <20090204020908.GA29008@sgi.com>

The ibdiagnet and ibdmchk utilities can report on credit loops in the
topology, but are heavily oriented towards UpDown routing.  Each of
these utilities will try to rank the switches and automaticly
determine root nodes for an UpDown routing engine.  It is important to
check for credit loops with other routing engines, but these utilities
can give incorrect information with the other routing engines.  If
ibdmchk thinks it finds root nodes, it determines credit loops by
checking whether the up/down rules are followed w.r.t. those roots,
which can be wrong.  If ibdmchk fails to find root nodes, it falls
back to doing a real credit loop by doing a DFS in the dependency
graph.  This can be overridden by supplying an explicit root_guids
file.  Why doesn't it just do the real credit loop check in general?
Presumably checking the up/down rules is less costly when UpDown
routing is actually being used.  The following change fixes ibdmchk so
that it only uses root nodes and up/down rules when UpDown routing is
being used (by specifying -u on the command line) and otherwise does a
real credit loop check:

diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp
index 1c18c1c..d8a4202 100644
--- a/ibdm/src/osm_check.cpp
+++ b/ibdm/src/osm_check.cpp
@@ -568,7 +568,7 @@ int main (int argc, char **argv) {
       rootNodes = SubnMgtFindRootNodesByMinHop(&fabric);
     }
 
-  if (!rootNodes.empty()) {
+  if (UseUpDown && !rootNodes.empty()) {
     cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl;
     for (list <IBNode *>::iterator nI = rootNodes.begin();
          nI != rootNodes.end(); nI++) {


ibdiagnet -r is a slightly different story.  It is more useful for
checking a running machine.  However, it doesn't seem to have any
options for indicating whether UpDown routing is being used, or for
supplying a root_guids file, and just does the up/down rule checking
against its idea of root nodes whether that makes sense or not.  Can
ibdiagnet be changed to just do the real credit loop check?  I am not
familiar with tcl and haven't been able to determine what to change.

-- 
Dale


From jackm at dev.mellanox.co.il  Tue Feb  3 22:46:48 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 4 Feb 2009 08:46:48 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <49888558.3050506@Voltaire.COM>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<49888558.3050506@Voltaire.COM>
Message-ID: <200902040846.48370.jackm@dev.mellanox.co.il>

On Tuesday 03 February 2009 19:56, Yossi Etigin wrote:
> I think it comes from unicast_arp_send.
> Consider this scenario:
> - paths are flushed (opensm up/down).
> - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0.
> - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet)
>   (see the prints just before the panic).
> - unicast_arp_send calls() path_free().
> - path memory is overwritten.
> - __ipoib_dev_flush() is called again.
> - mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic
>   because path->list became invalid.

I think you are right.

In unicast_arp_send, we have the following code:
	path = __path_find(dev, phdr->hwaddr + 4);
	if (!path || !path->valid) {
		if (!path)
			path = path_rec_create(dev, phdr->hwaddr + 4);
		if (path) {
			/* put pseudoheader back on for next time */
			skb_push(skb, sizeof *phdr);
			__skb_queue_tail(&path->queue, skb);

			if (path_rec_start(dev, path)) {
				spin_unlock(&priv->lock);
				path_free(dev, path);
				return;
			} else
				__path_add(dev, path);
		} else {
			++dev->stats.tx_dropped;
			dev_kfree_skb_any(skb);
		}

		spin_unlock(&priv->lock);
		return;
	}

It was originally written without the path->valid check in the "if", and so was based on the path record
being allocated within the "if".  In this case, the path record was not yet inserted into the path list.
When you added the "valid" processing, you did not take this into account.

You need code something like the following:

	path = __path_find(dev, phdr->hwaddr + 4);
	if (!path || !path->valid) {
		int had_path = 0;
		if (!path)
			path = path_rec_create(dev, phdr->hwaddr + 4);
		else
		    had_path = 1;
		if (path) {
			/* put pseudoheader back on for next time */
			skb_push(skb, sizeof *phdr);
			__skb_queue_tail(&path->queue, skb);

			if (path_rec_start(dev, path)) {
				if (had_path)
					/* detach from path list here under spinlock */
				spin_unlock(&priv->lock);
				path_free(dev, path);
				return;
			} else if (!had_path)
				__path_add(dev, path);
		} else {
			++dev->stats.tx_dropped;
			dev_kfree_skb_any(skb);
		}

		spin_unlock(&priv->lock);
		return;
	}

- Jack


From mkatiyar at gmail.com  Tue Feb  3 22:54:05 2009
From: mkatiyar at gmail.com (Manish Katiyar)
Date: Wed, 4 Feb 2009 12:24:05 +0530
Subject: [ofa-general] ***SPAM*** Re: [PATCH] : Define debugging variables
	only when CONFIG_INFINIBAND_NES_DEBUG is enabled
In-Reply-To: <ea11fea30901271028u70f559d5y656be5610ab83a41@mail.gmail.com>
References: <ea11fea30901271028u70f559d5y656be5610ab83a41@mail.gmail.com>
Message-ID: <ea11fea30902032254v22d95d35ua3eab9a5a6d4feab@mail.gmail.com>

On Tue, Jan 27, 2009 at 11:58 PM, Manish Katiyar <mkatiyar at gmail.com> wrote:
> Below patch removes following compilation warnings :
> drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused variable 'tmp_addr'
> drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused variable 'tmp_addr'
>

Hi,

Any feedback on this ?

Thanks -
manish

>
> Signed-off-by: Manish Katiyar <mkatiyar at gmail.com>
> ---
>  drivers/infiniband/hw/nes/nes_cm.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/infiniband/hw/nes/nes_cm.c
> b/drivers/infiniband/hw/nes/nes_cm.c
> index a01b448..2b34859 100644
> --- a/drivers/infiniband/hw/nes/nes_cm.c
> +++ b/drivers/infiniband/hw/nes/nes_cm.c
> @@ -778,7 +778,9 @@ static struct nes_cm_node *find_node(struct
> nes_cm_core *cm_core,
>        unsigned long flags;
>        struct list_head *hte;
>        struct nes_cm_node *cm_node;
> +#ifdef CONFIG_INFINIBAND_NES_DEBUG
>        __be32 tmp_addr = cpu_to_be32(loc_addr);
> +#endif
>
>        /* get a handle on the hte */
>        hte = &cm_core->connected_nodes;
> @@ -817,7 +819,9 @@ static struct nes_cm_listener
> *find_listener(struct nes_cm_core *cm_core,
>  {
>        unsigned long flags;
>        struct nes_cm_listener *listen_node;
> +#ifdef CONFIG_INFINIBAND_NES_DEBUG
>        __be32 tmp_addr = cpu_to_be32(dst_addr);
> +#endif
>
>        /* walk list and find cm_node associated with this session ID */
>        spin_lock_irqsave(&cm_core->listen_list_lock, flags);
> --
> 1.5.4.3
>
>
> Thanks -
> Manish
>


From ogerlitz at voltaire.com  Tue Feb  3 23:11:43 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 04 Feb 2009 09:11:43 +0200
Subject: [ofa-general] Re: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>
Message-ID: <49893FAF.3090007@voltaire.com>

Sean Hefty wrote:
> Why is ib_sa_get_mcmember_rec being called?  Or is this issue separate from what
> udaddy is showing?
The IPOIB port space allows for UD interoperability between IPoIB and 
RDMA-CM based apps. For that end, among other params such as the mgid 
derivation, the qkey used for the UD QP must be the same. To achieve 
that, a query on the broadcast group is done to the core multicast 
data-base to retrieve the  associated record from which the qkey is 
extracted. When the port goes down, this db is being flushed and 
ib_sa_get_mcmember_rec returns  EADDRNOTAVAIL which is exactly what 
udaddy is getting (you can also get it with mckey).

> Can you determine what call in the kernel is actually failing during the bind?
mcast_find being called from ib_sa_get_mcmember_rec

Or.


From jackm at dev.mellanox.co.il  Tue Feb  3 23:20:25 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 4 Feb 2009 09:20:25 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <200902040846.48370.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<49888558.3050506@Voltaire.COM>
	<200902040846.48370.jackm@dev.mellanox.co.il>
Message-ID: <200902040920.26062.jackm@dev.mellanox.co.il>

On Wednesday 04 February 2009 08:46, Jack Morgenstein wrote:
> On Tuesday 03 February 2009 19:56, Yossi Etigin wrote:
> > I think it comes from unicast_arp_send.
> > Consider this scenario:
> > - paths are flushed (opensm up/down).
> > - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0.
> > - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet)
> >   (see the prints just before the panic).
> > - unicast_arp_send calls() path_free().
> > - path memory is overwritten.
> > - __ipoib_dev_flush() is called again.
> > - mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic
> >   because path->list became invalid.
> 
> I think you are right.
How about this:
	path = __path_find(dev, phdr->hwaddr + 4);
	if (!path || !path->valid) {
		int had_path = 0;
		if (!path)
			path = path_rec_create(dev, phdr->hwaddr + 4);
		else
			had_path = 1;
		if (path) {
			/* put pseudoheader back on for next time */
			skb_push(skb, sizeof *phdr);
			__skb_queue_tail(&path->queue, skb);

			if (path_rec_start(dev, path)) {
				if (had_path) {
					list_del(&path->list);
					rb_erase(&path->rb_node,
						 &priv->path_tree);
				}
				spin_unlock(&priv->lock);
				path_free(dev, path);
				return;
			} else if (!had_path)
				__path_add(dev, path);
		} else {
			++dev->stats.tx_dropped;
			dev_kfree_skb_any(skb);
		}

		spin_unlock(&priv->lock);
		return;
	}

- Jack


From dorfman.eli at gmail.com  Wed Feb  4 00:00:05 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Wed, 04 Feb 2009 10:00:05 +0200
Subject: [ofa-general] ***SPAM*** [PATCH] libibmad/src/dump.c fix dump
	functions for big endian machines
Message-ID: <49894B05.1090608@gmail.com>

fix dump functions for big endian machines

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 libibmad/src/dump.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c
index 1cf5232..3b49158 100644
--- a/libibmad/src/dump.c
+++ b/libibmad/src/dump.c
@@ -46,10 +46,10 @@ void mad_dump_int(char *buf, int bufsz, void *val, int valsz)
 {
 	switch (valsz) {
 	case 1:
-		snprintf(buf, bufsz, "%d", *(uint8_t *) val);
+		snprintf(buf, bufsz, "%d", *(uint32_t *) val & 0xff);
 		break;
 	case 2:
-		snprintf(buf, bufsz, "%d", *(uint16_t *) val);
+		snprintf(buf, bufsz, "%d", *(uint32_t *) val & 0xffff);
 		break;
 	case 3:
 	case 4:
@@ -71,10 +71,10 @@ void mad_dump_uint(char *buf, int bufsz, void *val, int valsz)
 {
 	switch (valsz) {
 	case 1:
-		snprintf(buf, bufsz, "%u", *(uint8_t *) val);
+		snprintf(buf, bufsz, "%u", *(uint32_t *) val & 0xff);
 		break;
 	case 2:
-		snprintf(buf, bufsz, "%u", *(uint16_t *) val);
+		snprintf(buf, bufsz, "%u", *(uint32_t *) val & 0xffff);
 		break;
 	case 3:
 	case 4:
@@ -96,10 +96,10 @@ void mad_dump_hex(char *buf, int bufsz, void *val, int valsz)
 {
 	switch (valsz) {
 	case 1:
-		snprintf(buf, bufsz, "0x%02x", *(uint8_t *) val);
+		snprintf(buf, bufsz, "0x%02x", *(uint32_t *) val & 0xff);
 		break;
 	case 2:
-		snprintf(buf, bufsz, "0x%04x", *(uint16_t *) val);
+		snprintf(buf, bufsz, "0x%04x", *(uint32_t *) val & 0xffff);
 		break;
 	case 3:
 		snprintf(buf, bufsz, "0x%06x", *(uint32_t *) val & 0xffffff);
@@ -132,10 +132,10 @@ void mad_dump_rhex(char *buf, int bufsz, void *val, int valsz)
 {
 	switch (valsz) {
 	case 1:
-		snprintf(buf, bufsz, "%02x", *(uint8_t *) val);
+		snprintf(buf, bufsz, "%02x", *(uint32_t *) val & 0xff);
 		break;
 	case 2:
-		snprintf(buf, bufsz, "%04x", *(uint16_t *) val);
+		snprintf(buf, bufsz, "%04x", *(uint32_t *) val & 0xffff);
 		break;
 	case 3:
 		snprintf(buf, bufsz, "%06x", *(uint32_t *) val & 0xffffff);
-- 
1.5.5


From kliteyn at dev.mellanox.co.il  Wed Feb  4 02:19:08 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 04 Feb 2009 12:19:08 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on
	index port order incrementation
In-Reply-To: <4981DC18.9030400@ext.bull.net>
References: <4981DC18.9030400@ext.bull.net>
Message-ID: <49896B9C.8040006@dev.mellanox.co.il>

Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
> Hello,
> 
> While doing some routing analysis on fat tree using ibsim we found a 
> "bug" in the fat-tree algorithm.
> Problem happens with a 4 level Fat tree as below:
> 
> 
>                          L3  L3
>        ___________________|__|____________________
>       /          /               \               \                <= All 
> the L2 are connected on 2 L3 switches
>    L2-1         L2-2            L2-1           L2-2
>   /             /                 \              \                 <== 
> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>   L1           L1                 L1             L1
>   /|\         /|\                 /|\           /|\
>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
> multiple set. In each set, all L0 lead to all L1 of their set.
> 
>    L0           L0                 L0           L0
>  /   \        /    \             /    \       /     \
> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
> 
> 
> To detail:
> We have a bunch of sets. Each set contains compute node, L0 and L1 
> switches.
> Plus a common top of L2 and L3 switches.
> 
> In each set, there are groups of compute nodes. Each group is connected 
> to a single L0 switch.
> In a given set, all L0 are connected to all L1.
> 
> The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
> through a L2, the Nth L1 can only see the Nth L1 of the other sets)
> All the L2 are connected to a couple of L3.
> 
> 
> If we dont put the L3. We have a perfectly equilibrated fat tree and 
> well equilibrated routes.
> But when we add the L3, it introduce a huge difference. As it is not 
> necessary, no route is going through L3 (which is fine).
> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 
> 1/4 is twice overused (compared to the equilibrate state).
> 
> This comes from the down_port_groups_idx which is incremented each time 
> the algorithm goes down through a node whether it creates routes to HCA 
> (port != switch)
> or not. As route coming up from a L1 reaches only one L2, the algorithm 
> goes through all the other L2 while going down, incrementing their index.
> Our case here is a bit specific but in a case where your L1 doesn't have 
> full connectivity to all your L2, and another switch rank above, the 
> problem may appear.
> 
> To avoid this problem, I've changed the 
> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns a 
> value to indicate if routes to HCA (in fact to leaf switch) were created.
> With this information, we only increase the index when the algorithm has 
> created routes to HCA.
> After applying this patch and measuring the link usage, we are at 
> perfect equilibrium (L2<->L3 links are still not used but that is to be 
> expected).

Great! I've actually seen this problem on a real clusters, but
couldn't understand what's cusing the lack of equilibrity.

See couple of questions below.

> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  opensm/opensm/osm_ucast_ftree.c |   23 ++++++++++++++---------
>  1 files changed, 14 insertions(+), 9 deletions(-)
> 
> ------------------------------------------------------------------------
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index ebe6612..3474876 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
>   *        assign-up-going-port-by-descending-down to r-port node (recursion)
>   */
>  
> -static void
> +static int
>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  					       IN ftree_sw_t * p_sw,
>  					       IN ftree_sw_t * p_prev_sw,
> @@ -1932,21 +1932,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  	uint16_t i;
>  	uint16_t j;
>  	uint16_t k;
> +	uint8_t created_route=0;
>  
>  	/* we shouldn't enter here if both real_lid and main_path are false */
>  	CL_ASSERT(is_real_lid || is_main_path);
>  
>  	/* if there is no down-going ports */
>  	if (p_sw->down_port_groups_num == 0)
> -		return;
> +		return 1;

Shouldn't it return 0?

> -	/* promote the index that indicates which group should we
> -	   start with when going through all the downgoing groups */
> -	p_sw->down_port_groups_idx =
> -		(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +	/* If we are on a leaf switch we should be creating routes for real HCA  */
> +	/* This flag will be returned  so upper layers will incrementent shift index */
> +	if(p_sw->is_leaf == TRUE){
> +		created_route=1;
> +	}

The "is_leaf" flag will be TRUE only on leaf switches that have CNs connected to them.
If we want to solve the problem for all routes (CNs, IO nodes, management nodes),
the "created_route" flag should be updated elsewhere (see below).

>  	/* foreach down-going port group (in indexing order) */
> -	i = p_sw->down_port_groups_idx;
> +	i = (p_sw->down_port_groups_idx + 1) %  p_sw->down_port_groups_num;
>  	for (k = 0; k < p_sw->down_port_groups_num; k++) {

I think that since p_sw->down_port_groups_idx is promoted below,
there is no need to increase the starting value of i.

> +	if(created_route)
> +		p_sw->down_port_groups_idx = 
> +			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
...
> +	return created_route;
>  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */
>  
>  /***************************************************/

How about something like this:

@@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
   *        assign-up-going-port-by-descending-down to r-port node (recursion)
   */

-static void
+static boolean_t
  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  					       IN ftree_sw_t * p_sw,
  					       IN ftree_sw_t * p_prev_sw,
@@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  	uint16_t i;
  	uint16_t j;
  	uint16_t k;
+	boolean_t created_route = FALSE;

  	/* we shouldn't enter here if both real_lid and main_path are false */
  	CL_ASSERT(is_real_lid || is_main_path);

  	/* if there is no down-going ports */
  	if (p_sw->down_port_groups_num == 0)
-		return;
-
-	/* promote the index that indicates which group should we
-	   start with when going through all the downgoing groups */
-	p_sw->down_port_groups_idx =
-		(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+		return FALSE;

  	/* foreach down-going port group (in indexing order) */
  	i = p_sw->down_port_groups_idx;
@@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  		p_group = p_sw->down_port_groups[i];
  		i = (i + 1) % p_sw->down_port_groups_num;

-		/* Skip this port group unless it points to a switch */
-		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+		/* If this port group doesn't point to a switch, mark
+		   that the route was created and skip to the next group */
+		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
+			created_route = TRUE;
  			continue;
+		}

  		if (p_prev_sw
  		    && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
@@ -2073,16 +2072,25 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,

  		/* Recursion step:
  		   Assign upgoing ports by stepping down, starting on REMOTE switch */
-		__osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
-							       NULL,	/* prev. position - NULL to mark that we went down and not up */
-							       target_lid,	/* LID that we're routing to */
-							       target_rank,	/* rank of the LID that we're routing to */
-							       is_real_lid,	/* whether the target LID is real or dummy */
-							       is_main_path,	/* whether this is path to HCA that should by tracked by counters */
-							       highest_rank_in_route);	/* highest visited point in the tree before going down */
+		created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
+			p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
+			NULL,		/* prev. position - NULL to mark that we went down and not up */
+			target_lid,	/* LID that we're routing to */
+			target_rank,	/* rank of the LID that we're routing to */
+			is_real_lid,	/* whether the target LID is real or dummy */
+			is_main_path,	/* whether this is path to HCA that should by tracked by counters */
+			highest_rank_in_route);	/* highest visited point in the tree before going down */
  	}
  	/* done scanning all the down-going port groups */

+	/* if the route was created, promote the index that
+	   indicates which group should we start with when
+	   going through all the downgoing groups */
+	if (created_route)
+		p_sw->down_port_groups_idx =
+			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+
+	return created_route;
  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */

  /***************************************************/


From nicolas.morey-chaisemartin at ext.bull.net  Wed Feb  4 02:37:43 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Wed, 04 Feb 2009 11:37:43 +0100
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on
	index port order incrementation
In-Reply-To: <49896B9C.8040006@dev.mellanox.co.il>
References: <4981DC18.9030400@ext.bull.net>
	<49896B9C.8040006@dev.mellanox.co.il>
Message-ID: <49896FF7.8060908@ext.bull.net>

Yevgeny Kliteynik wrote:
> Hi Nicolas,
>
> Nicolas Morey Chaisemartin wrote:
>> Hello,
>>
>> While doing some routing analysis on fat tree using ibsim we found a 
>> "bug" in the fat-tree algorithm.
>> Problem happens with a 4 level Fat tree as below:
>>
>>
>>                          L3  L3
>>        ___________________|__|____________________
>>       /          /               \               \                <= 
>> All the L2 are connected on 2 L3 switches
>>    L2-1         L2-2            L2-1           L2-2
>>   /             /                 \              \                 
>> <== The Nth L1  of a set leads only to the Nth L2 (L2-N). With some 
>> pruning.
>>   L1           L1                 L1             L1
>>   /|\         /|\                 /|\           /|\
>>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We 
>> have multiple set. In each set, all L0 lead to all L1 of their set.
>>
>>    L0           L0                 L0           L0
>>  /   \        /    \             /    \       /     \
>> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>>
>>
>> To detail:
>> We have a bunch of sets. Each set contains compute node, L0 and L1 
>> switches.
>> Plus a common top of L2 and L3 switches.
>>
>> In each set, there are groups of compute nodes. Each group is 
>> connected to a single L0 switch.
>> In a given set, all L0 are connected to all L1.
>>
>> The Nth L1 of a set is connected to the Nth L2 and only to this one. 
>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>> All the L2 are connected to a couple of L3.
>>
>>
>> If we dont put the L3. We have a perfectly equilibrated fat tree and 
>> well equilibrated routes.
>> But when we add the L3, it introduce a huge difference. As it is not 
>> necessary, no route is going through L3 (which is fine).
>> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 
>> 1/4 is twice overused (compared to the equilibrate state).
>>
>> This comes from the down_port_groups_idx which is incremented each 
>> time the algorithm goes down through a node whether it creates routes 
>> to HCA (port != switch)
>> or not. As route coming up from a L1 reaches only one L2, the 
>> algorithm goes through all the other L2 while going down, 
>> incrementing their index.
>> Our case here is a bit specific but in a case where your L1 doesn't 
>> have full connectivity to all your L2, and another switch rank above, 
>> the problem may appear.
>>
>> To avoid this problem, I've changed the 
>> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns 
>> a value to indicate if routes to HCA (in fact to leaf switch) were 
>> created.
>> With this information, we only increase the index when the algorithm 
>> has created routes to HCA.
>> After applying this patch and measuring the link usage, we are at 
>> perfect equilibrium (L2<->L3 links are still not used but that is to 
>> be expected).
>
> Great! I've actually seen this problem on a real clusters, but
> couldn't understand what's cusing the lack of equilibrity.
>
> See couple of questions below.
>
>> Signed-off-by: Nicolas Morey-Chaisemartin 
>> <nicolas.morey-chaisemartin at ext.bull.net>
>> ---
>>  opensm/opensm/osm_ucast_ftree.c |   23 ++++++++++++++---------
>>  1 files changed, 14 insertions(+), 9 deletions(-)
>>
>> ------------------------------------------------------------------------
>>
>> diff --git a/opensm/opensm/osm_ucast_ftree.c 
>> b/opensm/opensm/osm_ucast_ftree.c
>> index ebe6612..3474876 100644
>> --- a/opensm/opensm/osm_ucast_ftree.c
>> +++ b/opensm/opensm/osm_ucast_ftree.c
>> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN 
>> cl_map_item_t * const p_map_item,
>>   *        assign-up-going-port-by-descending-down to r-port node 
>> (recursion)
>>   */
>>  
>> -static void
>> +static int
>>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
>> p_ftree,
>>                             IN ftree_sw_t * p_sw,
>>                             IN ftree_sw_t * p_prev_sw,
>> @@ -1932,21 +1932,23 @@ 
>> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
>> p_ftree,
>>      uint16_t i;
>>      uint16_t j;
>>      uint16_t k;
>> +    uint8_t created_route=0;
>>  
>>      /* we shouldn't enter here if both real_lid and main_path are 
>> false */
>>      CL_ASSERT(is_real_lid || is_main_path);
>>  
>>      /* if there is no down-going ports */
>>      if (p_sw->down_port_groups_num == 0)
>> -        return;
>> +        return 1;
>
> Shouldn't it return 0?
Probably yes. I was thinking to the case where (taking notations from my 
scheme above) a L0 wouldn't have any CN (beaucse they are shutdown, 
broken, or for future extension). In this case, I think it'll smooth 
things a bit and not desequilibrate the network.
>> -    /* promote the index that indicates which group should we
>> -       start with when going through all the downgoing groups */
>> -    p_sw->down_port_groups_idx =
>> -        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
>> +    /* If we are on a leaf switch we should be creating routes for 
>> real HCA  */
>> +    /* This flag will be returned  so upper layers will incrementent 
>> shift index */
>> +    if(p_sw->is_leaf == TRUE){
>> +        created_route=1;
>> +    }
>
> The "is_leaf" flag will be TRUE only on leaf switches that have CNs 
> connected to them.
> If we want to solve the problem for all routes (CNs, IO nodes, 
> management nodes),
> the "created_route" flag should be updated elsewhere (see below).
I was not aware of this, so yes it should be done somewhere else (though 
it can also be done here).
>
>>      /* foreach down-going port group (in indexing order) */
>> -    i = p_sw->down_port_groups_idx;
>> +    i = (p_sw->down_port_groups_idx + 1) %  p_sw->down_port_groups_num;
>>      for (k = 0; k < p_sw->down_port_groups_num; k++) {
>
> I think that since p_sw->down_port_groups_idx is promoted below,
> there is no need to increase the starting value of i.
>
I tried to but I had some problem (segfault probably due to the 
%p_sw->down_port_groups_num).
If it works without incrementing, it's fine with me.
>> +    if(created_route)
>> +        p_sw->down_port_groups_idx = +            
>> (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> ...
>> +    return created_route;
>>  }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>>  
>>  /***************************************************/
>
> How about something like this:
>
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN 
> cl_map_item_t * const p_map_item,
>   *        assign-up-going-port-by-descending-down to r-port node 
> (recursion)
>   */
>
> -static void
> +static boolean_t
>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>                             IN ftree_sw_t * p_sw,
>                             IN ftree_sw_t * p_prev_sw,
> @@ -1932,18 +1932,14 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>      uint16_t i;
>      uint16_t j;
>      uint16_t k;
> +    boolean_t created_route = FALSE;
>
>      /* we shouldn't enter here if both real_lid and main_path are 
> false */
>      CL_ASSERT(is_real_lid || is_main_path);
>
>      /* if there is no down-going ports */
>      if (p_sw->down_port_groups_num == 0)
> -        return;
> -
> -    /* promote the index that indicates which group should we
> -       start with when going through all the downgoing groups */
> -    p_sw->down_port_groups_idx =
> -        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +        return FALSE;
>
>      /* foreach down-going port group (in indexing order) */
>      i = p_sw->down_port_groups_idx;
> @@ -1952,9 +1948,12 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>          p_group = p_sw->down_port_groups[i];
>          i = (i + 1) % p_sw->down_port_groups_num;
>
> -        /* Skip this port group unless it points to a switch */
> -        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> +        /* If this port group doesn't point to a switch, mark
> +           that the route was created and skip to the next group */
> +        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
> +            created_route = TRUE;
>              continue;
> +        }
>
>          if (p_prev_sw
>              && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
> @@ -2073,16 +2072,25 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>
>          /* Recursion step:
>             Assign upgoing ports by stepping down, starting on REMOTE 
> switch */
> -        __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, 
> p_remote_sw,    /* remote switch - used as a route-upgoing alg. start 
> point */
> -                                   NULL,    /* prev. position - NULL 
> to mark that we went down and not up */
> -                                   target_lid,    /* LID that we're 
> routing to */
> -                                   target_rank,    /* rank of the LID 
> that we're routing to */
> -                                   is_real_lid,    /* whether the 
> target LID is real or dummy */
> -                                   is_main_path,    /* whether this 
> is path to HCA that should by tracked by counters */
> -                                   highest_rank_in_route);    /* 
> highest visited point in the tree before going down */
> +        created_route |= 
> __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
> +            p_remote_sw,    /* remote switch - used as a 
> route-upgoing alg. start point */
> +            NULL,        /* prev. position - NULL to mark that we 
> went down and not up */
> +            target_lid,    /* LID that we're routing to */
> +            target_rank,    /* rank of the LID that we're routing to */
> +            is_real_lid,    /* whether the target LID is real or 
> dummy */
> +            is_main_path,    /* whether this is path to HCA that 
> should by tracked by counters */
> +            highest_rank_in_route);    /* highest visited point in 
> the tree before going down */
>      }
>      /* done scanning all the down-going port groups */
>
> +    /* if the route was created, promote the index that
> +       indicates which group should we start with when
> +       going through all the downgoing groups */
> +    if (created_route)
> +        p_sw->down_port_groups_idx =
> +            (p_sw->down_port_groups_idx + 1) % 
> p_sw->down_port_groups_num;
> +
> +    return created_route;
>  }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>
>  /***************************************************/
>
>
>
>
>
>

That seems good.
I'm going to think a bit more about the case where there are no downports.

Best regards

Nicolas


From vlad at lists.openfabrics.org  Wed Feb  4 03:11:13 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed,  4 Feb 2009 03:11:13 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090204-0200 daily build status
Message-ID: <20090204111113.8A537E60DCF@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From hal.rosenstock at gmail.com  Wed Feb  4 04:29:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 07:29:36 -0500
Subject: [ofa-general] Possible memory leak and null pointer dereference 
	in local_completions()
In-Reply-To: <1233689172.23327.155.camel@chromite.mv.qlogic.com>
References: <1233689172.23327.155.camel@chromite.mv.qlogic.com>
Message-ID: <f0e08f230902040429p5c01abd0y349abb413e120277@mail.gmail.com>

On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell
<ralph.campbell at qlogic.com> wrote:
> I was doing some tests with different MAD packets and
> then reading the infiniband/core/mad.c code.
>
> handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
> on the mad_agent_priv->local_work work queue with
> local->mad_priv == NULL if device->process_mad() returns
> IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
> (!ib_response_mad(&mad_priv->mad.mad) ||
>  !mad_agent_priv->agent.recv_handler).
>
> In this case, local_completions() will be called with
> local->mad_priv == NULL. The code does check for this
> case and skips calling recv_mad_agent->agent.recv_handler().
> This means recv == 0 so kmem_cache_free() is called with a
> NULL pointer.

That could be fixed by changing the check for !recv prior to the
kmem_cache_free there to a check for (!recv && local->mad_priv).

> Even if local->mad_priv != NULL, I don't see how local->mad_priv
> is freed when recv == 1. Thus, it appears to be a memory leak.

For those cases, it's either freed in local_completions (as recv is
set to 1 for local->mad_priv != NULL except when there is no mad recv
agent but that is another bug (see below)) or earlier in the else
clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of
handle_outgoing_dr_smp(). That's another issue that this points out
where recv = 1 needs to be moved up in local_completions.

Would you try the untested patch below and see if it fixes the problem
you found ? Thanks.

-- Hal

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5c54fc2..cca87e6 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work)
                list_del(&local->completion_list);
                spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
                if (local->mad_priv) {
+                       recv = 1;
                        recv_mad_agent = local->recv_mad_agent;
                        if (!recv_mad_agent) {
                                printk(KERN_ERR PFX "No receive MAD agent for lo
                                goto local_send_completion;
                        }

-                       recv = 1;
                        /*
                         * Defined behavior is to complete response
                         * before request
@@ -2422,7 +2422,7 @@ local_send_completion:

                spin_lock_irqsave(&mad_agent_priv->lock, flags);
                atomic_dec(&mad_agent_priv->refcount);
-               if (!recv)
+               if (!recv && local->mad_priv)
                        kmem_cache_free(ib_mad_cache, local->mad_priv);
                kfree(local);
        }

> So, I'm proposing the following patch:
>
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 5c54fc2..93d80e5 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work)
>        struct ib_mad_local_private *local;
>        struct ib_mad_agent_private *recv_mad_agent;
>        unsigned long flags;
> -       int recv = 0;
>        struct ib_wc wc;
>        struct ib_mad_send_wc mad_send_wc;
>
> @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work)
>                                goto local_send_completion;
>                        }
>
> -                       recv = 1;
>                        /*
>                         * Defined behavior is to complete response
>                         * before request
> @@ -2422,7 +2420,7 @@ local_send_completion:
>
>                spin_lock_irqsave(&mad_agent_priv->lock, flags);
>                atomic_dec(&mad_agent_priv->refcount);
> -               if (!recv)
> +               if (local->mad_priv)
>                        kmem_cache_free(ib_mad_cache, local->mad_priv);
>                kfree(local);
>        }
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From monis at Voltaire.COM  Wed Feb  4 05:30:03 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 04 Feb 2009 15:30:03 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <200902040846.48370.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>	<49888558.3050506@Voltaire.COM>
	<200902040846.48370.jackm@dev.mellanox.co.il>
Message-ID: <4989985B.6010707@Voltaire.COM>


> It was originally written without the path->valid check in the "if", and so was based on the path record
> being allocated within the "if".  In this case, the path record was not yet inserted into the path list.
> When you added the "valid" processing, you did not take this into account.
> 
> You need code something like the following:
> 
> 	path = __path_find(dev, phdr->hwaddr + 4);
> 	if (!path || !path->valid) {
> 		int had_path = 0;
> 		if (!path)
> 			path = path_rec_create(dev, phdr->hwaddr + 4);
> 		else
> 		    had_path = 1;
> 		if (path) {
> 			/* put pseudoheader back on for next time */
> 			skb_push(skb, sizeof *phdr);
> 			__skb_queue_tail(&path->queue, skb);
> 
> 			if (path_rec_start(dev, path)) {
> 				if (had_path)
> 					/* detach from path list here under spinlock */
> 				spin_unlock(&priv->lock);
> 				path_free(dev, path);
> 				return;
> 			} else if (!had_path)
> 				__path_add(dev, path);
> 		} else {
> 			++dev->stats.tx_dropped;
> 			dev_kfree_skb_any(skb);
> 		}
> 
> 		spin_unlock(&priv->lock);
> 		return;
> 	}

I hope I'm not missing something but __path_rec() checks for path existence
and returns -EEXIST if the path is not added.

                ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw,
                             sizeof (union ib_gid));
                if (ret < 0)
                        n = &pn->rb_left;
                else if (ret > 0)
                        n = &pn->rb_right;
                else
                        return -EEXIST;
        }

so the code you suggest may improve performance but I  don't see how it solves the bug.


From monis at Voltaire.COM  Wed Feb  4 05:33:38 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 04 Feb 2009 15:33:38 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <49888558.3050506@Voltaire.COM>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<49888558.3050506@Voltaire.COM>
Message-ID: <49899932.5060507@Voltaire.COM>

Yossi Etigin wrote:
> I think it comes from unicast_arp_send.
> 
> Consider this scenario:
> - paths are flushed (opensm up/down).
> - unicast_arp_send() is called with a path in priv->path_list.
> path->valid is 0.
> - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails -
> no sm ah (yet)
>  (see the prints just before the panic).
> - unicast_arp_send calls() path_free().
> - path memory is overwritten.
> - __ipoib_dev_flush() is called again.
> - mark_paths_invalid() tries to iterate over priv->path_list and gets
> kernel panic
>  because path->list became invalid.
> 
> --Yossi
> 
I agree with Yossi's analysis.
Isn't the fix just as simple as this?

void ipoib_mark_paths_invalid(struct net_device *dev)
{
        struct ipoib_dev_priv *priv = netdev_priv(dev);
        struct ipoib_path *path, *tp;

        spin_lock_irq(&priv->lock);

        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
                ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
                        be16_to_cpu(path->pathrec.dlid),
                        IPOIB_GID_ARG(path->pathrec.dgid));
-                path->valid =  0;
+                if (path)
+			path->valid =  0;
        }

        spin_unlock_irq(&priv->lock);
}


From jackm at dev.mellanox.co.il  Wed Feb  4 05:45:22 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 4 Feb 2009 15:45:22 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <49899932.5060507@Voltaire.COM>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<49888558.3050506@Voltaire.COM> <49899932.5060507@Voltaire.COM>
Message-ID: <200902041545.22662.jackm@dev.mellanox.co.il>

On Wednesday 04 February 2009 15:33, Moni Shoua wrote:
> Isn't the fix just as simple as this?
> 
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         struct ipoib_path *path, *tp;
> 
>         spin_lock_irq(&priv->lock);
> 
>         list_for_each_entry_safe(path, tp, &priv->path_list, list) {
>                 ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
>                         be16_to_cpu(path->pathrec.dlid),
>                         IPOIB_GID_ARG(path->pathrec.dgid));
> -                path->valid =  0;
> +                if (path)
> +			path->valid =  0;
>         }
> 
>         spin_unlock_irq(&priv->lock);
> }
> 
I doubt it.  You are leaving a deleted path record as part of the path list.
This is list corruption (since the list pointers themselves are part of the
path record structure -- what if this returned storage is re-allocated?).

I think the correct fix (after your previous posted comment) is:
        path = __path_find(dev, phdr->hwaddr + 4);
        if (!path || !path->valid) {
                int had_path = 0;
                if (!path)
                        path = path_rec_create(dev, phdr->hwaddr + 4);
                else
                        had_path = 1;
                if (path) {
                        /* put pseudoheader back on for next time */
                        skb_push(skb, sizeof *phdr);
                        __skb_queue_tail(&path->queue, skb);

                        if (path_rec_start(dev, path)) {
                                if (had_path) {
                                        list_del(&path->list);
                                        rb_erase(&path->rb_node,
                                                 &priv->path_tree);
                                }
                                spin_unlock_irqrestore(&priv->lock, flags);
                                path_free(dev, path);
                                return;
                        } else
                                __path_add(dev, path);
                } else {
                        ++dev->stats.tx_dropped;
                        dev_kfree_skb_any(skb);
                }

                spin_unlock_irqrestore(&priv->lock, flags);
                return;
        }

My only question here is:
Do we have to worry about netif_tx_lock_bh(dev) (as taken in ipoib_flush_paths)?
(If we do, we have a problem).

- Jack


From monis at Voltaire.COM  Wed Feb  4 07:45:28 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 04 Feb 2009 17:45:28 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <200902041545.22662.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>	<49888558.3050506@Voltaire.COM>
	<49899932.5060507@Voltaire.COM>
	<200902041545.22662.jackm@dev.mellanox.co.il>
Message-ID: <4989B818.102@Voltaire.COM>


> I doubt it.  You are leaving a deleted path record as part of the path list.
> This is list corruption (since the list pointers themselves are part of the
> path record structure -- what if this returned storage is re-allocated?).
> 
You are right. 
> I think the correct fix (after your previous posted comment) is:
>         path = __path_find(dev, phdr->hwaddr + 4);
>         if (!path || !path->valid) {
>                 int had_path = 0;
>                 if (!path)
>                         path = path_rec_create(dev, phdr->hwaddr + 4);
>                 else
>                         had_path = 1;
>                 if (path) {
>                         /* put pseudoheader back on for next time */
>                         skb_push(skb, sizeof *phdr);
>                         __skb_queue_tail(&path->queue, skb);
> 
>                         if (path_rec_start(dev, path)) {
>                                 if (had_path) {
>                                         list_del(&path->list);
>                                         rb_erase(&path->rb_node,
>                                                  &priv->path_tree);
>                                 }
>                                 spin_unlock_irqrestore(&priv->lock, flags);
>                                 path_free(dev, path);
>                                 return;
>                         } else
>                                 __path_add(dev, path);
>                 } else {
>                         ++dev->stats.tx_dropped;
>                         dev_kfree_skb_any(skb);
>                 }
> 
>                 spin_unlock_irqrestore(&priv->lock, flags);
>                 return;
>         }
> 
> My only question here is:
> Do we have to worry about netif_tx_lock_bh(dev) (as taken in ipoib_flush_paths)?
> (If we do, we have a problem).
> 
Besides the locking issue that I hadn't think about yet what if we this fix looks the right thing to do.
But what if we leave the path without freeing it even if path_rec_start() fails?
This would leave a path which is not valid in path_list which is not forbidden state as 
I conclude (after all this is the state the function was called)
In this way, I think that we don't  have to worry about locks.

and the code will look like this

        if (!path || !path->valid) {
                if (!path)
                        path = path_rec_create(dev, phdr->hwaddr + 4);
                if (path) {
                        /* put pseudoheader back on for next time */
                        skb_push(skb, sizeof *phdr);
                        __skb_queue_tail(&path->queue, skb);

                        if (!path->query && path_rec_start(dev, path)) {
                                spin_unlock_irqrestore(&priv->lock, flags);
-                               path_free(dev, path);
                                return;
                        } else
                                __path_add(dev, path);
                } else {
                        ++priv->stats.tx_dropped;
                        dev_kfree_skb_any(skb);
                }

                spin_unlock_irqrestore(&priv->lock, flags);
                return;
        }


From jackm at dev.mellanox.co.il  Wed Feb  4 08:03:57 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 4 Feb 2009 18:03:57 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <4989B818.102@Voltaire.COM>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<200902041545.22662.jackm@dev.mellanox.co.il>
	<4989B818.102@Voltaire.COM>
Message-ID: <200902041803.57457.jackm@dev.mellanox.co.il>

On Wednesday 04 February 2009 17:45, Moni Shoua wrote:
> Besides the locking issue that I hadn't think about yet what if we this fix looks the right thing to do.
> But what if we leave the path without freeing it even if path_rec_start() fails?
> This would leave a path which is not valid in path_list which is not forbidden state as 
> I conclude (after all this is the state the function was called)
> In this way, I think that we don't  have to worry about locks.
> 
> and the code will look like this
> 
>         if (!path || !path->valid) {
>                 if (!path)
>                         path = path_rec_create(dev, phdr->hwaddr + 4);
>                 if (path) {
>                         /* put pseudoheader back on for next time */
>                         skb_push(skb, sizeof *phdr);
>                         __skb_queue_tail(&path->queue, skb);
> 
>                         if (!path->query && path_rec_start(dev, path)) {
>                                 spin_unlock_irqrestore(&priv->lock, flags);
> -                               path_free(dev, path);
>                                 return;
>                         } else
>                                 __path_add(dev, path);
>                 } else {
>                         ++priv->stats.tx_dropped;
>                         dev_kfree_skb_any(skb);
>                 }
> 
>                 spin_unlock_irqrestore(&priv->lock, flags);
>                 return;
>         }
> 
Still need some correction.  If the path did not exist previously (i.e, !path = TRUE,
and, below, had_path = 0), then need to call path_free or we will have a leak.

Maybe the correct patch is:
       path = __path_find(dev, phdr->hwaddr + 4);
        if (!path || !path->valid) {
                int had_path = 0;
                if (!path)
                        path = path_rec_create(dev, phdr->hwaddr + 4);
                else
                        had_path = 1;
                if (path) {
                        /* put pseudoheader back on for next time */
                        skb_push(skb, sizeof *phdr);
                        __skb_queue_tail(&path->queue, skb);

                        if (!path->query && path_rec_start(dev, path)) {
                                spin_unlock_irqrestore(&priv->lock, flags);
				if (!had_path)
                                	path_free(dev, path);
                                return;
                        } else
                                __path_add(dev, path);
                } else {
                        ++dev->stats.tx_dropped;
                        dev_kfree_skb_any(skb);
                }

From halr at obsidianresearch.com  Wed Feb  4 08:14:48 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 09:14:48 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/include/iba/ib_types.h: Add
	xmit_wait for PortCounters
Message-ID: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca>

Sasha,

Trivial path to ib_types.h to add xmit_wait field to PortCounters. Also,
updated a reference from IBA 1.2 to 1.2.1.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-opensm-include-iba-ib_types.h-Add-xmit_wait-for-Por.patch
Type: application/mbox
Size: 1123 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090204/a6d1c9e4/attachment.mbox>

From halr at obsidianresearch.com  Wed Feb  4 08:15:07 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 09:15:07 -0700
Subject: [ofa-general] [PATCH 0/3] OpenSM/PerfMgr improvements
Message-ID: <1233764107.8992.459.camel@bertha1.edm.orcorp.ca>

Sasha,

Following patch series improves PerfMgr:
1 - cosmetic cleanups
2 - Move ESP0 determination into __malloc_node
3 - Move ESP0 determination into monitored node

These patches are based on previous PerfMgr patches sent over the last
couple days.

-- Hal


From halr at obsidianresearch.com  Wed Feb  4 08:15:10 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 09:15:10 -0700
Subject: [ofa-general] [PATCH 1/3] opensm/PerfMgr: Mainly cosmetic changes
Message-ID: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca>

Sasha,

Cosmetic changes to PerfMgr:
Eliminated unneeded extra parentheses
Made some formatting consistent
Simplified some internal names
Also, removed inline from __init_monitored_nodes declaration

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-opensm-PerfMgr-Mainly-cosmetic-changes.patch
Type: application/mbox
Size: 19659 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090204/e8881100/attachment.mbox>

From halr at obsidianresearch.com  Wed Feb  4 08:15:15 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 09:15:15 -0700
Subject: [ofa-general] [PATCH 2/3] opensm/osm_perfmgr_db.(h c): Move ESP0
	determination into __malloc_node
Message-ID: <1233764115.8992.461.camel@bertha1.edm.orcorp.ca>

Sasha,

This patch moves the ESP0 determination once per db_node allocation
rather than in bad_node_port in the PerfMgr.

-- Hal

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-opensm-osm_perfmgr_db.-h-c-Move-ESP0-determination.patch
Type: application/mbox
Size: 5609 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090204/7bf5d39c/attachment.mbox>

From halr at obsidianresearch.com  Wed Feb  4 08:15:19 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 09:15:19 -0700
Subject: [ofa-general] [PATCH 3/3] opensm/PerfMgr: Move ESP0 determination in
	monitored node
Message-ID: <1233764119.8992.462.camel@bertha1.edm.orcorp.ca>

Sasha,

This patch moves the ESP0 determination into monitored node and copies
into db_node when needed.

-- Hal

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-opensm-PerfMgr-Move-ESP0-determination-in-monitored.patch
Type: application/mbox
Size: 5405 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090204/62828d29/attachment.mbox>

From monis at Voltaire.COM  Wed Feb  4 08:16:15 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 04 Feb 2009 18:16:15 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <200902041803.57457.jackm@dev.mellanox.co.il>
References: <200902031816.41784.jackm@dev.mellanox.co.il>	<200902041545.22662.jackm@dev.mellanox.co.il>	<4989B818.102@Voltaire.COM>
	<200902041803.57457.jackm@dev.mellanox.co.il>
Message-ID: <4989BF4F.1060707@Voltaire.COM>


> Still need some correction.  If the path did not exist previously (i.e, !path = TRUE,
> and, below, had_path = 0), then need to call path_free or we will have a leak.
> 
True
> Maybe the correct patch is:
>        path = __path_find(dev, phdr->hwaddr + 4);
>         if (!path || !path->valid) {
>                 int had_path = 0;
>                 if (!path)
>                         path = path_rec_create(dev, phdr->hwaddr + 4);
>                 else
>                         had_path = 1;
>                 if (path) {
>                         /* put pseudoheader back on for next time */
>                         skb_push(skb, sizeof *phdr);
>                         __skb_queue_tail(&path->queue, skb);
> 
>                         if (!path->query && path_rec_start(dev, path)) {
>                                 spin_unlock_irqrestore(&priv->lock, flags);
> 				if (!had_path)
>                                 	path_free(dev, path);
>                                 return;
>                         } else
>                                 __path_add(dev, path);
>                 } else {
>                         ++dev->stats.tx_dropped;
>                         dev_kfree_skb_any(skb);
>                 }
This one looks good  to me.
Are you going to make a patch and submit it?

I think it would be best if you run the same test on the patched IPoIB before submission.
Do you agree?

thanks


From jackm at dev.mellanox.co.il  Wed Feb  4 08:25:16 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 4 Feb 2009 18:25:16 +0200
Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing
In-Reply-To: <4989BF4F.1060707@Voltaire.COM>
References: <200902031816.41784.jackm@dev.mellanox.co.il>
	<200902041803.57457.jackm@dev.mellanox.co.il>
	<4989BF4F.1060707@Voltaire.COM>
Message-ID: <200902041825.16354.jackm@dev.mellanox.co.il>

On Wednesday 04 February 2009 18:16, Moni Shoua wrote:
> This one looks good  to me.
> Are you going to make a patch and submit it?
> 
> I think it would be best if you run the same test on the patched IPoIB before submission.
> Do you agree?
> 
I'll do a patch tomorrow.
We'll run the test over the weekend.
I'll submit it on Sunday if all is well.

- Jack


From sean.hefty at intel.com  Wed Feb  4 08:41:35 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 4 Feb 2009 08:41:35 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <49893FAF.3090007@voltaire.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>
	<49893FAF.3090007@voltaire.com>
Message-ID: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>

I was mixing up ib_sa_get_mcmember_rec and ib_sa_mcmember_rec_query.  I'm
following you now.  There may be some way to defer setting the qkey if it's not
available when binding, but how does allowing the bind to proceed help?  Without
the qkey, the QP is basically unusable.

- Sean


From sashak at voltaire.com  Wed Feb  4 09:43:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 19:43:33 +0200
Subject: [ofa-general] Re: [PATCHv2] libibmad/(mad.h fields.c): Add support
	for PerfMgt ClassPortInfo
In-Reply-To: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca>
References: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204174333.GT11874@sashak.voltaire.com>

On 11:58 Mon 02 Feb     , Hal Rosenstock wrote:
> 
> Attached is v2 of a patch to add support for PerfMgt ClassPortInfo attribute
> into libibmad.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb  4 09:43:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 19:43:54 +0200
Subject: [ofa-general] Re: [PATCHv2] ibsim/sim_mad.c: Add sim support for
	PerfMgt ClassPortInfo
In-Reply-To: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca>
References: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204174354.GU11874@sashak.voltaire.com>

On 11:58 Mon 02 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Attached is v2 of a patch to add simulator support for PerfMgt ClassPortInfo
> (subsequent to previous libibmad patch).

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb  4 10:14:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 20:14:21 +0200
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs
	for cleaner function interfaces
In-Reply-To: <20090202185425.729a80b3.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
Message-ID: <20090204181421.GV11874@sashak.voltaire.com>

Hi Ira,

On 18:54 Mon 02 Feb     , Ira Weiny wrote:
> Begining to clean up the libibmad interface.
> 
> Ira
> 
> 
> From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001
> From: weiny2 at llnl.gov <weiny2 at llnl.gov>
> Date: Mon, 2 Feb 2009 10:21:18 -0800
> Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces

I don't understand how enum typedefing makes things cleaner - actually
this will enforce me explicitly to verify an actual type in header
files. Sometimes typedefs could help with porting, but it is not the
case here.

Sasha

> 
> 
> Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>
> ---
>  libibmad/include/infiniband/mad.h |   38 ++++++++++++++++++------------------
>  libibmad/src/fields.c             |   22 ++++++++++----------
>  libibmad/src/resolve.c            |   10 ++++----
>  3 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 9ff4a3e..f235ab0 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -203,7 +203,7 @@ typedef struct ib_field {
>  	ib_mad_dump_fn *def_dump_fn;
>  } ib_field_t;
>  
> -enum MAD_FIELDS {
> +typedef enum MAD_FIELDS {
>  	IB_NO_FIELD,
>  
>  	IB_GID_PREFIX_F,
> @@ -525,7 +525,7 @@ enum MAD_FIELDS {
>  	IB_GUID_GUID0_F,
>  
>  	IB_FIELD_LAST_		/* must be last */
> -};
> +} mad_field_t;
>  
>  /*
>   * SA RMPP section
> @@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
>  #define MAD_DEF_RETRIES		3
>  #define MAD_DEF_TIMEOUT_MS	1000
>  
> -enum {
> +typedef enum {
>  	IB_DEST_LID,
>  	IB_DEST_DRPATH,
>  	IB_DEST_GUID,
>  	IB_DEST_DRSLID,
> -};
> +} mad_dest_t;
>  
> -enum {
> +typedef enum {
>  	IB_NODE_CA = 1,
>  	IB_NODE_SWITCH,
>  	IB_NODE_ROUTER,
>  	NODE_RNIC,
>  
>  	IB_NODE_MAX = NODE_RNIC
> -};
> +} mad_node_type_t;
>  
>  /******************************************************************************/
>  
> @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey)
>  }
>  
>  /* fields.c */
> -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field);
> -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field,
> +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field);
> +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field,
>  			      uint32_t val);
>  /* field must be byte aligned */
> -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field);
> -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field,
> +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field);
> +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field,
>  				uint64_t val);
> -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val);
> -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val);
> -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val);
> -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val);
> -MAD_EXPORT int mad_print_field(int field, const char *name, void *val);
> -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val);
> -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val);
> +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val);
> +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val);
> +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val);
> +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val);
> +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val);
> +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val);
> +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val);
>  
>  /* mad.c */
>  MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath,
> @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
>  MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
>  			       ib_portid_t * sm_id, int timeout);
>  MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> -				     int dest_type, ib_portid_t * sm_id);
> +				     mad_dest_t dest, ib_portid_t * sm_id);
>  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
>  			       ibmad_gid_t * gid);
>  
> @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
>  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>  			ib_portid_t * sm_id, int timeout, const void *srcport);
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> -			      int dest_type, ib_portid_t * sm_id,
> +			      mad_dest_t dest, ib_portid_t * sm_id,
>  			      const void *srcport);
>  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
>  			const void *srcport);
> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
> index d5a1eb4..d435a2f 100644
> --- a/libibmad/src/fields.c
> +++ b/libibmad/src/fields.c
> @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f,
>  	memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8);
>  }
>  
> -uint32_t mad_get_field(void *buf, int base_offs, int field)
> +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field)
>  {
>  	return _get_field(buf, base_offs, ib_mad_f + field);
>  }
>  
> -void mad_set_field(void *buf, int base_offs, int field, uint32_t val)
> +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val)
>  {
>  	_set_field(buf, base_offs, ib_mad_f + field, val);
>  }
>  
> -uint64_t mad_get_field64(void *buf, int base_offs, int field)
> +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field)
>  {
>  	return _get_field64(buf, base_offs, ib_mad_f + field);
>  }
>  
> -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val)
> +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val)
>  {
>  	_set_field64(buf, base_offs, ib_mad_f + field, val);
>  }
>  
> -void mad_set_array(void *buf, int base_offs, int field, void *val)
> +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val)
>  {
>  	_set_array(buf, base_offs, ib_mad_f + field, val);
>  }
>  
> -void mad_get_array(void *buf, int base_offs, int field, void *val)
> +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val)
>  {
>  	_get_array(buf, base_offs, ib_mad_f + field, val);
>  }
>  
> -void mad_decode_field(uint8_t * buf, int field, void *val)
> +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val)
>  {
>  	const ib_field_t *f = ib_mad_f + field;
>  
> @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val)
>  	_get_array(buf, 0, f, val);
>  }
>  
> -void mad_encode_field(uint8_t * buf, int field, void *val)
> +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val)
>  {
>  	const ib_field_t *f = ib_mad_f + field;
>  
> @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val,
>  			 valsz ? valsz : ALIGN(f->bitlen, 8) / 8);
>  }
>  
> -int mad_print_field(int field, const char *name, void *val)
> +int mad_print_field(mad_field_t field, const char *name, void *val)
>  {
>  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
>  		return -1;
>  	return _mad_print_field(ib_mad_f + field, name, val, 0);
>  }
>  
> -char *mad_dump_field(int field, char *buf, int bufsz, void *val)
> +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val)
>  {
>  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
>  		return 0;
>  	return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val);
>  }
>  
> -char *mad_dump_val(int field, char *buf, int bufsz, void *val)
> +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val)
>  {
>  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
>  		return 0;
> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> index b62360b..faac1f9 100644
> --- a/libibmad/src/resolve.c
> +++ b/libibmad/src/resolve.c
> @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>  }
>  
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> -			      int dest_type, ib_portid_t * sm_id,
> +			      mad_dest_t dest, ib_portid_t * sm_id,
>  			      const void *srcport)
>  {
>  	uint64_t guid;
> @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>  	ib_portid_t selfportid = { 0 };
>  	int selfport = 0;
>  
> -	switch (dest_type) {
> +	switch (dest) {
>  	case IB_DEST_LID:
>  		lid = strtol(addr_str, 0, 0);
>  		if (!IB_LID_VALID(lid))
> @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>  		return 0;
>  
>  	default:
> -		IBWARN("bad dest_type %d", dest_type);
> +		IBWARN("bad dest %d", dest);
>  	}
>  
>  	return -1;
>  }
>  
> -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type,
> +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest,
>  			  ib_portid_t * sm_id)
>  {
> -	return ib_resolve_portid_str_via(portid, addr_str, dest_type,
> +	return ib_resolve_portid_str_via(portid, addr_str, dest,
>  					 sm_id, NULL);
>  }
>  
> -- 
> 1.5.4.5
> 


From jgunthorpe at obsidianresearch.com  Wed Feb  4 10:20:23 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 4 Feb 2009 11:20:23 -0700
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as
	typedefs for cleaner function interfaces
In-Reply-To: <20090204181421.GV11874@sashak.voltaire.com>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
Message-ID: <20090204182023.GP7618@obsidianresearch.com>

On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote:

> I don't understand how enum typedefing makes things cleaner - actually
> this will enforce me explicitly to verify an actual type in header
> files. Sometimes typedefs could help with porting, but it is not the
> case here.

Not typedefing per say, but passing an enum through an int is not that
great. You don't need the typedefs to do this, just 'enum MAD_FIELDS'
for instance will do.

Jason


From sashak at voltaire.com  Wed Feb  4 10:25:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 20:25:20 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h:
	osm_node_get_num_physp description fix
In-Reply-To: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204182520.GW11874@sashak.voltaire.com>

Hi Hal,

On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
> 
> Trivial description change to osm_node_get_num_physp.

It makes some troubles for me to comment over attachments... :(

In this comment line:

+*      Returns the number of physical ports (+1) for this node.

"(+1)" will not be true for switch nodes.

Sasha


From sashak at voltaire.com  Wed Feb  4 10:27:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 20:27:25 +0200
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as
	typedefs for cleaner function interfaces
In-Reply-To: <20090204182023.GP7618@obsidianresearch.com>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204182023.GP7618@obsidianresearch.com>
Message-ID: <20090204182725.GX11874@sashak.voltaire.com>

On 11:20 Wed 04 Feb     , Jason Gunthorpe wrote:
> On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote:
> 
> > I don't understand how enum typedefing makes things cleaner - actually
> > this will enforce me explicitly to verify an actual type in header
> > files. Sometimes typedefs could help with porting, but it is not the
> > case here.
> 
> Not typedefing per say, but passing an enum through an int is not that
> great. You don't need the typedefs to do this, just 'enum MAD_FIELDS'
> for instance will do.

Yes, that would be fine to do.

Sasha


From weiny2 at llnl.gov  Wed Feb  4 10:30:05 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 4 Feb 2009 10:30:05 -0800
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs
 for cleaner function interfaces
In-Reply-To: <20090204181421.GV11874@sashak.voltaire.com>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
Message-ID: <20090204103005.4ef9256a.weiny2@llnl.gov>

On Wed, 4 Feb 2009 20:14:21 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 18:54 Mon 02 Feb     , Ira Weiny wrote:
> > Begining to clean up the libibmad interface.
> > 
> > Ira
> > 
> > 
> > From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001
> > From: weiny2 at llnl.gov <weiny2 at llnl.gov>
> > Date: Mon, 2 Feb 2009 10:21:18 -0800
> > Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces
> 
> I don't understand how enum typedefing makes things cleaner - actually
> this will enforce me explicitly to verify an actual type in header
> files. Sometimes typedefs could help with porting, but it is not the
> case here.

Yes, this will force you to use the correct type.  But I was looking at it from
the user standpoint.  If I give the user a uint8_t buffer and tell them to use
this library to decode fields how do they know which values to pass in this
call.

uint32_t mad_get_field(void *buf, int base_offs, int field);

Using mad_field_t or even enum MAD_FIELDS allows one to use tags/cscope to find
the valid values for that parameter easily.  Grepping will work but is still
cumbersome.

Again, I am trying to write a library which makes it easier for someone who
might not be familiar with IB to extract diagnostic data.  I understand you
wanting the decoding of the data to be more flexible and abstract but we should
make the interface for decoding that data easier to use.  I feel the following
patch does this.

Would you prefer to use:

uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field);

?

Ira


> 
> Sasha
> 
> > 
> > 
> > Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>
> > ---
> >  libibmad/include/infiniband/mad.h |   38 ++++++++++++++++++------------------
> >  libibmad/src/fields.c             |   22 ++++++++++----------
> >  libibmad/src/resolve.c            |   10 ++++----
> >  3 files changed, 35 insertions(+), 35 deletions(-)
> > 
> > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> > index 9ff4a3e..f235ab0 100644
> > --- a/libibmad/include/infiniband/mad.h
> > +++ b/libibmad/include/infiniband/mad.h
> > @@ -203,7 +203,7 @@ typedef struct ib_field {
> >  	ib_mad_dump_fn *def_dump_fn;
> >  } ib_field_t;
> >  
> > -enum MAD_FIELDS {
> > +typedef enum MAD_FIELDS {
> >  	IB_NO_FIELD,
> >  
> >  	IB_GID_PREFIX_F,
> > @@ -525,7 +525,7 @@ enum MAD_FIELDS {
> >  	IB_GUID_GUID0_F,
> >  
> >  	IB_FIELD_LAST_		/* must be last */
> > -};
> > +} mad_field_t;
> >  
> >  /*
> >   * SA RMPP section
> > @@ -595,21 +595,21 @@ typedef struct ib_vendor_call {
> >  #define MAD_DEF_RETRIES		3
> >  #define MAD_DEF_TIMEOUT_MS	1000
> >  
> > -enum {
> > +typedef enum {
> >  	IB_DEST_LID,
> >  	IB_DEST_DRPATH,
> >  	IB_DEST_GUID,
> >  	IB_DEST_DRSLID,
> > -};
> > +} mad_dest_t;
> >  
> > -enum {
> > +typedef enum {
> >  	IB_NODE_CA = 1,
> >  	IB_NODE_SWITCH,
> >  	IB_NODE_ROUTER,
> >  	NODE_RNIC,
> >  
> >  	IB_NODE_MAX = NODE_RNIC
> > -};
> > +} mad_node_type_t;
> >  
> >  /******************************************************************************/
> >  
> > @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey)
> >  }
> >  
> >  /* fields.c */
> > -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field);
> > -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field,
> > +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field);
> > +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field,
> >  			      uint32_t val);
> >  /* field must be byte aligned */
> > -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field);
> > -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field,
> > +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field);
> > +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field,
> >  				uint64_t val);
> > -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val);
> > -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val);
> > -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val);
> > -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val);
> > -MAD_EXPORT int mad_print_field(int field, const char *name, void *val);
> > -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val);
> > -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val);
> > +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val);
> > +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val);
> > +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val);
> > +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val);
> > +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val);
> > +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val);
> > +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val);
> >  
> >  /* mad.c */
> >  MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath,
> > @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
> >  MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
> >  			       ib_portid_t * sm_id, int timeout);
> >  MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> > -				     int dest_type, ib_portid_t * sm_id);
> > +				     mad_dest_t dest, ib_portid_t * sm_id);
> >  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
> >  			       ibmad_gid_t * gid);
> >  
> > @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> >  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> >  			ib_portid_t * sm_id, int timeout, const void *srcport);
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> > -			      int dest_type, ib_portid_t * sm_id,
> > +			      mad_dest_t dest, ib_portid_t * sm_id,
> >  			      const void *srcport);
> >  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> >  			const void *srcport);
> > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
> > index d5a1eb4..d435a2f 100644
> > --- a/libibmad/src/fields.c
> > +++ b/libibmad/src/fields.c
> > @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f,
> >  	memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8);
> >  }
> >  
> > -uint32_t mad_get_field(void *buf, int base_offs, int field)
> > +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field)
> >  {
> >  	return _get_field(buf, base_offs, ib_mad_f + field);
> >  }
> >  
> > -void mad_set_field(void *buf, int base_offs, int field, uint32_t val)
> > +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val)
> >  {
> >  	_set_field(buf, base_offs, ib_mad_f + field, val);
> >  }
> >  
> > -uint64_t mad_get_field64(void *buf, int base_offs, int field)
> > +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field)
> >  {
> >  	return _get_field64(buf, base_offs, ib_mad_f + field);
> >  }
> >  
> > -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val)
> > +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val)
> >  {
> >  	_set_field64(buf, base_offs, ib_mad_f + field, val);
> >  }
> >  
> > -void mad_set_array(void *buf, int base_offs, int field, void *val)
> > +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val)
> >  {
> >  	_set_array(buf, base_offs, ib_mad_f + field, val);
> >  }
> >  
> > -void mad_get_array(void *buf, int base_offs, int field, void *val)
> > +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val)
> >  {
> >  	_get_array(buf, base_offs, ib_mad_f + field, val);
> >  }
> >  
> > -void mad_decode_field(uint8_t * buf, int field, void *val)
> > +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val)
> >  {
> >  	const ib_field_t *f = ib_mad_f + field;
> >  
> > @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val)
> >  	_get_array(buf, 0, f, val);
> >  }
> >  
> > -void mad_encode_field(uint8_t * buf, int field, void *val)
> > +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val)
> >  {
> >  	const ib_field_t *f = ib_mad_f + field;
> >  
> > @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val,
> >  			 valsz ? valsz : ALIGN(f->bitlen, 8) / 8);
> >  }
> >  
> > -int mad_print_field(int field, const char *name, void *val)
> > +int mad_print_field(mad_field_t field, const char *name, void *val)
> >  {
> >  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
> >  		return -1;
> >  	return _mad_print_field(ib_mad_f + field, name, val, 0);
> >  }
> >  
> > -char *mad_dump_field(int field, char *buf, int bufsz, void *val)
> > +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val)
> >  {
> >  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
> >  		return 0;
> >  	return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val);
> >  }
> >  
> > -char *mad_dump_val(int field, char *buf, int bufsz, void *val)
> > +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val)
> >  {
> >  	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
> >  		return 0;
> > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> > index b62360b..faac1f9 100644
> > --- a/libibmad/src/resolve.c
> > +++ b/libibmad/src/resolve.c
> > @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> >  }
> >  
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> > -			      int dest_type, ib_portid_t * sm_id,
> > +			      mad_dest_t dest, ib_portid_t * sm_id,
> >  			      const void *srcport)
> >  {
> >  	uint64_t guid;
> > @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >  	ib_portid_t selfportid = { 0 };
> >  	int selfport = 0;
> >  
> > -	switch (dest_type) {
> > +	switch (dest) {
> >  	case IB_DEST_LID:
> >  		lid = strtol(addr_str, 0, 0);
> >  		if (!IB_LID_VALID(lid))
> > @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >  		return 0;
> >  
> >  	default:
> > -		IBWARN("bad dest_type %d", dest_type);
> > +		IBWARN("bad dest %d", dest);
> >  	}
> >  
> >  	return -1;
> >  }
> >  
> > -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type,
> > +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest,
> >  			  ib_portid_t * sm_id)
> >  {
> > -	return ib_resolve_portid_str_via(portid, addr_str, dest_type,
> > +	return ib_resolve_portid_str_via(portid, addr_str, dest,
> >  					 sm_id, NULL);
> >  }
> >  
> > -- 
> > 1.5.4.5
> > 


From weiny2 at llnl.gov  Wed Feb  4 10:30:54 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 4 Feb 2009 10:30:54 -0800
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as
	typedefs for cleaner function interfaces
In-Reply-To: <20090204182725.GX11874@sashak.voltaire.com>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204182023.GP7618@obsidianresearch.com>
	<20090204182725.GX11874@sashak.voltaire.com>
Message-ID: <20090204103054.177aa6e2.weiny2@llnl.gov>

On Wed, 4 Feb 2009 20:27:25 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 11:20 Wed 04 Feb     , Jason Gunthorpe wrote:
> > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote:
> > 
> > > I don't understand how enum typedefing makes things cleaner - actually
> > > this will enforce me explicitly to verify an actual type in header
> > > files. Sometimes typedefs could help with porting, but it is not the
> > > case here.
> > 
> > Not typedefing per say, but passing an enum through an int is not that
> > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS'
> > for instance will do.
> 
> Yes, that would be fine to do.

I will redo the patch with 'enum MAD_FIELDS'.

Ira

> 
> Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 10:41:03 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 13:41:03 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: 
	osm_node_get_num_physp description fix
In-Reply-To: <20090204182520.GW11874@sashak.voltaire.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>

Sasha,

On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>>
>> Trivial description change to osm_node_get_num_physp.
>
> It makes some troubles for me to comment over attachments... :(
>
> In this comment line:
>
> +*      Returns the number of physical ports (+1) for this node.
>
> "(+1)" will not be true for switch nodes.

Are you sure about that ? It's not what I see regardless of whether
base or enhanced SP0.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Wed Feb  4 11:00:23 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:00:23 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of
	memory allocation in __collect_guids
In-Reply-To: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204190023.GY11874@sashak.voltaire.com>

On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
> 
> Patch to increase size of monitored node in
> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
> port number.

There are couple of validations like (port > p_mon_node->redir_tbl_size)
in osm_perfmgr.c. Would it be correct after proposed change?

Sasha


From sashak at voltaire.com  Wed Feb  4 11:02:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:02:56 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h:
	osm_node_get_num_physp description fix
In-Reply-To: <f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
	<f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
Message-ID: <20090204190256.GZ11874@sashak.voltaire.com>

On 13:41 Wed 04 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > Hi Hal,
> >
> > On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
> >>
> >> Trivial description change to osm_node_get_num_physp.
> >
> > It makes some troubles for me to comment over attachments... :(
> >
> > In this comment line:
> >
> > +*      Returns the number of physical ports (+1) for this node.
> >
> > "(+1)" will not be true for switch nodes.
> 
> Are you sure about that ? It's not what I see regardless of whether
> base or enhanced SP0.

For switch it will be an actual number of allocated physical ports
(struct osm_physp) - port 0 olus number of external ports. For non
switch nodes entry '0' is not used.

Sasha


From yosefe at Voltaire.COM  Wed Feb  4 11:04:54 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Wed, 04 Feb 2009 21:04:54 +0200
Subject: [ofa-general] RE: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
Message-ID: <4989E6D6.5030109@Voltaire.COM>

How about this patch?
If no QKey - QP creation (and other stuff that need QKey) fail.
However, rdma_resolve_addr() succeeds.

---

  When doing rdma_resolve_addr() and relevant port is down, the function fails
and rdma_cm id is not bound to the device. Therefore, application does not have
device handle and cannot wait for the port to become active. The function
fails because ipoib is not joined to the multicast group and therefore sa does 
not have a multicast record to take a qkey from.
  The proposed patch is to make lazy qkey resolution - cma_set_qkey will set 
id_priv->qkey if it was not set, and will be called just before the qkey is
really required.

Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

---
 drivers/infiniband/core/cma.c |   41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

Index: b/drivers/infiniband/core/cma.c
===================================================================
--- a/drivers/infiniband/core/cma.c	2009-02-04 20:40:20.000000000 +0200
+++ b/drivers/infiniband/core/cma.c	2009-02-04 20:57:59.000000000 +0200
@@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r
 	id_priv->cma_dev = NULL;
 }
 
-static int cma_set_qkey(struct ib_device *device, u8 port_num,
-			enum rdma_port_space ps,
-			struct rdma_dev_addr *dev_addr, u32 *qkey)
+static int cma_set_qkey(struct rdma_id_private *id_priv)
 {
 	struct ib_sa_mcmember_rec rec;
 	int ret = 0;
 
-	switch (ps) {
+	if (id_priv->qkey)
+		return;
+
+	switch (id_priv->id.ps) {
 	case RDMA_PS_UDP:
-		*qkey = RDMA_UDP_QKEY;
+		id_priv->qkey = RDMA_UDP_QKEY;
 		break;
 	case RDMA_PS_IPOIB:
-		ib_addr_get_mgid(dev_addr, &rec.mgid);
-		ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec);
-		*qkey = be32_to_cpu(rec.qkey);
+		ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid);
+		ret = ib_sa_get_mcmember_rec(id_priv->id.device,
+		                             id_priv->id.port_num, &rec.mgid,
+		                             &rec);
+		if (!ret)
+			id_priv->qkey = be32_to_cpu(rec.qkey);
 		break;
 	default:
 		break;
@@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i
 		ret = ib_find_cached_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
-			ret = cma_set_qkey(cma_dev->device,
-					   id_priv->id.port_num,
-					   id_priv->id.ps, dev_addr,
-					   &id_priv->qkey);
-			if (!ret)
-				cma_attach_to_dev(id_priv, cma_dev);
+			cma_attach_to_dev(id_priv, cma_dev);
 			break;
 		}
 	}
@@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd
 	*qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT;
 
 	if (cma_is_ud_ps(id_priv->id.ps)) {
+		ret = cma_set_qkey(id_priv);
+		if (ret)
+			return ret;
+
 		qp_attr->qkey = id_priv->qkey;
 		*qp_attr_mask |= IB_QP_QKEY;
 	} else {
@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i
 			event.status = ib_event->param.sidr_rep_rcvd.status;
 			break;
 		}
+		ret = cma_set_qkey(id_priv);
+		if (ret) {
+			event.event = RDMA_CM_EVENT_ADDR_ERROR;
+			event.status = -EINVAL;
+			break;
+		}
 		if (id_priv->qkey != rep->qkey) {
 			event.event = RDMA_CM_EVENT_UNREACHABLE;
 			event.status = -EINVAL;
@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma
 			     const void *private_data, int private_data_len)
 {
 	struct ib_cm_sidr_rep_param rep;
+	int ret;
 
 	memset(&rep, 0, sizeof rep);
 	rep.status = status;
 	if (status == IB_SIDR_SUCCESS) {
+		ret = cma_set_qkey(id_priv);
+		if (ret)
+			return ret;
 		rep.qp_num = id_priv->qp_num;
 		rep.qkey = id_priv->qkey;
 	}
-- 
--Yossi


From sashak at voltaire.com  Wed Feb  4 11:15:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:15:32 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port,
	allow queries on enhanced SP0
In-Reply-To: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204191523.GA11874@sashak.voltaire.com>

On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
> 
> Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced
> SP0.

This:

+       osm_node = osm_get_node_by_guid(pm->subn, cl_hton64(node->node_guid));
+       if (!osm_node)
+               return (PERFMGR_EVENT_DB_GUIDNOTFOUND);
+       if ((!(osm_node_get_type(osm_node) == IB_NODE_TYPE_SWITCH) ||
+           !osm_node->sw ||
+           !ib_switch_info_is_enhanced_port0(&osm_node->sw->switch_info)) &&
+          (port == 0))
+               return (PERFMGR_EVENT_DB_PORTNOTFOUND);

(osm_get_node_by_guid()) is expensive operation. If you only need to
determine port 0 type - store it as part of struct monitored_node
structure. Another (even more universal) approach would be to keep there
a reference to related osm_node object.

Sasha


From sashak at voltaire.com  Wed Feb  4 11:29:14 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:29:14 +0200
Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs
	for cleaner function interfaces
In-Reply-To: <20090204103005.4ef9256a.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204103005.4ef9256a.weiny2@llnl.gov>
Message-ID: <20090204192914.GB11874@sashak.voltaire.com>

On 10:30 Wed 04 Feb     , Ira Weiny wrote:
> > 
> > I don't understand how enum typedefing makes things cleaner - actually
> > this will enforce me explicitly to verify an actual type in header
> > files. Sometimes typedefs could help with porting, but it is not the
> > case here.
> 
> Yes, this will force you to use the correct type.

Not "typedef" will do it, but proper prototypes.

> Again, I am trying to write a library which makes it easier for someone who
> might not be familiar with IB to extract diagnostic data. I understand you
> wanting the decoding of the data to be more flexible and abstract but we should
> make the interface for decoding that data easier to use.  I feel the following
> patch does this.
> 
> Would you prefer to use:
> 
> uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field);
> 
> ?

Yes, this would be correct and clear.

Sasha


From sashak at voltaire.com  Wed Feb  4 11:38:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:38:34 +0200
Subject: [ofa-general] Re: [PATCH] libibmad/src/dump.c fix dump functions for
	big endian machines
In-Reply-To: <49894B05.1090608@gmail.com>
References: <49894B05.1090608@gmail.com>
Message-ID: <20090204193834.GC11874@sashak.voltaire.com>

On 10:00 Wed 04 Feb     , Eli Dorfman (Voltaire) wrote:
> fix dump functions for big endian machines
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb  4 11:42:51 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:42:51 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/include/iba/ib_types.h:
	Add xmit_wait for PortCounters
In-Reply-To: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca>
References: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204194251.GD11874@sashak.voltaire.com>

On 09:14 Wed 04 Feb     , Hal Rosenstock wrote:
> 
> Trivial path to ib_types.h to add xmit_wait field to PortCounters. Also,
> updated a reference from IBA 1.2 to 1.2.1.

Applied, Thanks.

Sasha


From sashak at voltaire.com  Wed Feb  4 11:56:28 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 21:56:28 +0200
Subject: [ofa-general] Re: [PATCH 1/3] opensm/PerfMgr: Mainly cosmetic
	changes
In-Reply-To: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca>
References: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204195628.GE11874@sashak.voltaire.com>

On 09:15 Wed 04 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Cosmetic changes to PerfMgr:
> Eliminated unneeded extra parentheses
> Made some formatting consistent
> Simplified some internal names
> Also, removed inline from __init_monitored_nodes declaration

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 11:54:41 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 14:54:41 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In 
	bad_node_port, allow queries on enhanced SP0
In-Reply-To: <20090204191523.GA11874@sashak.voltaire.com>
References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
	<20090204191523.GA11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041154j6570089ej145f9dbc3f2860df@mail.gmail.com>

On Wed, Feb 4, 2009 at 2:15 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>>
>> Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced
>> SP0.
>
> This:
>
> +       osm_node = osm_get_node_by_guid(pm->subn, cl_hton64(node->node_guid));
> +       if (!osm_node)
> +               return (PERFMGR_EVENT_DB_GUIDNOTFOUND);
> +       if ((!(osm_node_get_type(osm_node) == IB_NODE_TYPE_SWITCH) ||
> +           !osm_node->sw ||
> +           !ib_switch_info_is_enhanced_port0(&osm_node->sw->switch_info)) &&
> +          (port == 0))
> +               return (PERFMGR_EVENT_DB_PORTNOTFOUND);
>
> (osm_get_node_by_guid()) is expensive operation. If you only need to
> determine port 0 type - store it as part of struct monitored_node
> structure. Another (even more universal) approach would be to keep there
> a reference to related osm_node object.

This was done later in the patch series.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From ralph.campbell at qlogic.com  Wed Feb  4 11:58:05 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 04 Feb 2009 11:58:05 -0800
Subject: [ofa-general] Possible memory leak and null pointer
	dereference in local_completions()
In-Reply-To: <f0e08f230902040429p5c01abd0y349abb413e120277@mail.gmail.com>
References: <1233689172.23327.155.camel@chromite.mv.qlogic.com>
	<f0e08f230902040429p5c01abd0y349abb413e120277@mail.gmail.com>
Message-ID: <1233777486.23327.172.camel@chromite.mv.qlogic.com>

On Wed, 2009-02-04 at 04:29 -0800, Hal Rosenstock wrote:
> On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell
> <ralph.campbell at qlogic.com> wrote:
> > I was doing some tests with different MAD packets and
> > then reading the infiniband/core/mad.c code.
> >
> > handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
> > on the mad_agent_priv->local_work work queue with
> > local->mad_priv == NULL if device->process_mad() returns
> > IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
> > (!ib_response_mad(&mad_priv->mad.mad) ||
> >  !mad_agent_priv->agent.recv_handler).
> >
> > In this case, local_completions() will be called with
> > local->mad_priv == NULL. The code does check for this
> > case and skips calling recv_mad_agent->agent.recv_handler().
> > This means recv == 0 so kmem_cache_free() is called with a
> > NULL pointer.
> 
> That could be fixed by changing the check for !recv prior to the
> kmem_cache_free there to a check for (!recv && local->mad_priv).

This is what we did to continue making progress so I know
it works.

> > Even if local->mad_priv != NULL, I don't see how local->mad_priv
> > is freed when recv == 1. Thus, it appears to be a memory leak.
> 
> For those cases, it's either freed in local_completions (as recv is
> set to 1 for local->mad_priv != NULL except when there is no mad recv
> agent but that is another bug (see below)) or earlier in the else
> clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of
> handle_outgoing_dr_smp(). That's another issue that this points out
> where recv = 1 needs to be moved up in local_completions.

The other problem I noticed with setting recv = 1, is that recv = 0
is outside the while (!list_empty) loop so it is never reset back
to zero.

I'm not really following you about recv = 1 needs to be moved up in
local_completions.

What I was really looking for was a confirmation that the original
code had a memory leak. I don't see any reason to special case the
call to kmem_cache_free(). It seems to me that it is needed any time
local->mad_priv != NULL.
The NULL pointer bug is easily fixed in a number of different ways.

> Would you try the untested patch below and see if it fixes the problem
> you found ? Thanks.

We are in the middle of moving our office so I won't be able to
reproduce this until next week.

> -- Hal
> 
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 5c54fc2..cca87e6 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work)
>                 list_del(&local->completion_list);
>                 spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
>                 if (local->mad_priv) {
> +                       recv = 1;
>                         recv_mad_agent = local->recv_mad_agent;
>                         if (!recv_mad_agent) {
>                                 printk(KERN_ERR PFX "No receive MAD agent for lo
>                                 goto local_send_completion;
>                         }
> 
> -                       recv = 1;
>                         /*
>                          * Defined behavior is to complete response
>                          * before request
> @@ -2422,7 +2422,7 @@ local_send_completion:
> 
>                 spin_lock_irqsave(&mad_agent_priv->lock, flags);
>                 atomic_dec(&mad_agent_priv->refcount);
> -               if (!recv)
> +               if (!recv && local->mad_priv)
>                         kmem_cache_free(ib_mad_cache, local->mad_priv);
>                 kfree(local);
>         }
> 
> > So, I'm proposing the following patch:
> >
> > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> > index 5c54fc2..93d80e5 100644
> > --- a/drivers/infiniband/core/mad.c
> > +++ b/drivers/infiniband/core/mad.c
> > @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work)
> >        struct ib_mad_local_private *local;
> >        struct ib_mad_agent_private *recv_mad_agent;
> >        unsigned long flags;
> > -       int recv = 0;
> >        struct ib_wc wc;
> >        struct ib_mad_send_wc mad_send_wc;
> >
> > @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work)
> >                                goto local_send_completion;
> >                        }
> >
> > -                       recv = 1;
> >                        /*
> >                         * Defined behavior is to complete response
> >                         * before request
> > @@ -2422,7 +2420,7 @@ local_send_completion:
> >
> >                spin_lock_irqsave(&mad_agent_priv->lock, flags);
> >                atomic_dec(&mad_agent_priv->refcount);
> > -               if (!recv)
> > +               if (local->mad_priv)
> >                        kmem_cache_free(ib_mad_cache, local->mad_priv);
> >                kfree(local);
> >        }
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >


From hal.rosenstock at gmail.com  Wed Feb  4 12:03:33 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 15:03:33 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: 
	osm_node_get_num_physp description fix
In-Reply-To: <20090204190256.GZ11874@sashak.voltaire.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
	<f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
	<20090204190256.GZ11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041203o27eeac6fm7fc64d4ea9462859@mail.gmail.com>

On Wed, Feb 4, 2009 at 2:02 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 13:41 Wed 04 Feb     , Hal Rosenstock wrote:
>> Sasha,
>>
>> On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> > Hi Hal,
>> >
>> > On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>> >>
>> >> Trivial description change to osm_node_get_num_physp.
>> >
>> > It makes some troubles for me to comment over attachments... :(
>> >
>> > In this comment line:
>> >
>> > +*      Returns the number of physical ports (+1) for this node.
>> >
>> > "(+1)" will not be true for switch nodes.
>>
>> Are you sure about that ? It's not what I see regardless of whether
>> base or enhanced SP0.
>
> For switch it will be an actual number of allocated physical ports
> (struct osm_physp) - port 0 olus number of external ports. For non
> switch nodes entry '0' is not used.

Right. In my terms, physical is another name for an external port and
port 0 is not a physical (external) port so I think we're quibbling
about words. What do you think it should say ?

-- Hal

> Sasha
>


From swise at opengridcomputing.com  Wed Feb  4 12:20:45 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 04 Feb 2009 14:20:45 -0600
Subject: [ofa-general] Re: dapl attribute bug
In-Reply-To: <E3280858FA94444CA49D2BA02341C983382DC33A@orsmsx506.amr.corp.intel.com>
References: <49871E6A.9000901@opengridcomputing.com>
	<E3280858FA94444CA49D2BA02341C983382DC33A@orsmsx506.amr.corp.intel.com>
Message-ID: <4989F89D.8020905@opengridcomputing.com>

Davis, Arlin R wrote:
>  
>
>   
>> The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code 
>> maps this to the linux ib_device_attr->max_mr_size which is u64.
>>
>> This causes dapltest to fail in some cases when running over chelsio 
>> which sets max_mr_size to 0x100000000 (4GB).  The dapl code truncates 
>> the value to 0. See dapl/openib_cma/dapl_ib_util.c.
>>
>> I'm not sure what the fix should be, but maybe the dapl code 
>> should set 
>> anything over 32 bits to 0xffffffff?
>>
>>     
>
> This attribute changed with DAT 2.0 to match the 32-bit ibv_sge
> length field. Since there are no direct max lmr segments mappings
> I will need add some checks when setting max_lmr_block_size from
> max_mr_size. Thanks.
>
> -arlin

I'll test your fix when its ready.  Lemme know.


Steve.


From swise at opengridcomputing.com  Wed Feb  4 12:26:12 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 04 Feb 2009 14:26:12 -0600
Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset
	calculation is 64b.
Message-ID: <20090204202612.27031.78831.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 19661b2..2cf6f13 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -195,15 +195,12 @@ static int build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
-/*
- * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
- */
 static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list,
 			    u32 num_sgle, u32 * pbl_addr, u8 * page_size)
 {
 	int i;
 	struct iwch_mr *mhp;
-	u32 offset;
+	u64 offset;
 	for (i = 0; i < num_sgle; i++) {
 
 		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
@@ -235,7 +232,7 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list,
 			return -EINVAL;
 		}
 		offset = sg_list[i].addr - mhp->attr.va_fbo;
-		offset += ((u32) mhp->attr.va_fbo) %
+		offset += ((u64) mhp->attr.va_fbo) %
 		          (1UL << (12 + mhp->attr.page_size));
 		pbl_addr[i] = ((mhp->attr.pbl_addr -
 			        rhp->rdev.rnic_info.pbl_base) >> 3) +


From swise at opengridcomputing.com  Wed Feb  4 12:26:14 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 04 Feb 2009 14:26:14 -0600
Subject: [ofa-general] [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination
	fixes.
In-Reply-To: <20090204202612.27031.78831.stgit@dell3.ogc.int>
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
Message-ID: <20090204202614.27031.22248.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

The poll and flush code needs to handle all send opcodes:
SEND, SEND_WITH_SE, SEND_WITH_INV, and SEND_WITH_SE_INV.

Ignore TERM indications if the connection already gone.

Ignore hw recv completions if the RQ is empty.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |   13 +++++++++++--
 drivers/infiniband/hw/cxgb3/cxio_wr.h  |    6 ++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.c  |    3 +++
 drivers/infiniband/hw/cxgb3/iwch_ev.c  |    5 -----
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 4dcf08b..c2740e7 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -450,7 +450,7 @@ static int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq)
 	if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe))
 		return 0;
 
-	if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) &&
+	if (CQE_SEND_OPCODE(*cqe) && RQ_TYPE(*cqe) &&
 	    Q_EMPTY(wq->rq_rptr, wq->rq_wptr))
 		return 0;
 
@@ -1204,11 +1204,12 @@ int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
 		}
 
 		/* incoming SEND with no receive posted failures */
-		if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) &&
+		if (CQE_SEND_OPCODE(*hw_cqe) && RQ_TYPE(*hw_cqe) &&
 		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
 			ret = -1;
 			goto skip_cqe;
 		}
+		BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe));
 		goto proc_cqe;
 	}
 
@@ -1223,6 +1224,13 @@ int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
 		 * then we complete this with TPT_ERR_MSN and mark the wq in
 		 * error.
 		 */
+
+		if (Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			wq->error = 1;
+			ret = -1;
+			goto skip_cqe;
+		}
+
 		if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) {
 			wq->error = 1;
 			hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN));
@@ -1277,6 +1285,7 @@ proc_cqe:
 			cxio_hal_pblpool_free(wq->rdev,
 				wq->rq[Q_PTR2IDX(wq->rq_rptr,
 				wq->rq_size_log2)].pbl_addr, T3_STAG0_PBL_SIZE);
+		BUG_ON(Q_EMPTY(wq->rq_rptr, wq->rq_wptr));
 		wq->rq_rptr++;
 	}
 
diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index 04618f7..ff9be1a 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -604,6 +604,12 @@ struct t3_cqe {
 #define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
 #define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
 
+#define CQE_SEND_OPCODE(x)( \
+	(G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND) || \
+	(G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_SE) || \
+	(G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_INV) || \
+	(G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_SE_INV))
+
 #define CQE_LEN(x)        (be32_to_cpu((x).len))
 
 /* used for RQ completion processing */
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 44e936e..8699947 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -1678,6 +1678,9 @@ static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 {
 	struct iwch_ep *ep = ctx;
 
+	if (state_read(&ep->com) != FPDU_MODE)
+		return CPL_RET_BUF_DONE;
+
 	PDBG("%s ep %p\n", __func__, ep);
 	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
 	PDBG("%s saving %d bytes of term msg\n", __func__, skb->len);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
index 7b67a67..743c5d8 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_ev.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -179,11 +179,6 @@ void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
 	case TPT_ERR_BOUND:
 	case TPT_ERR_INVALIDATE_SHARED_MR:
 	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
-		printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x "
-		       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __func__,
-		       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe),
-		       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
-		       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
 		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
 		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1);
 		break;


From sashak at voltaire.com  Wed Feb  4 12:47:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 22:47:31 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h:
	osm_node_get_num_physp description fix
In-Reply-To: <f0e08f230902041203o27eeac6fm7fc64d4ea9462859@mail.gmail.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
	<f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
	<20090204190256.GZ11874@sashak.voltaire.com>
	<f0e08f230902041203o27eeac6fm7fc64d4ea9462859@mail.gmail.com>
Message-ID: <20090204204731.GG11874@sashak.voltaire.com>

On 15:03 Wed 04 Feb     , Hal Rosenstock wrote:
> >
> > For switch it will be an actual number of allocated physical ports
> > (struct osm_physp) - port 0 olus number of external ports. For non
> > switch nodes entry '0' is not used.
> 
> Right. In my terms, physical is another name for an external port and
> port 0 is not a physical (external) port so I think we're quibbling
> about words. What do you think it should say ?

I don't really have a good opinion :(. Maybe something like:

 *	Returns the number of osm_physp ports allocated for this for node
 *	(for switches it is number of external physical ports plus port
 *	0 and number of physical ports + 1 for non-switch nodes).

It is long... :(

Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 12:50:44 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 15:50:44 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: 
	osm_node_get_num_physp description fix
In-Reply-To: <20090204204731.GG11874@sashak.voltaire.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
	<f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
	<20090204190256.GZ11874@sashak.voltaire.com>
	<f0e08f230902041203o27eeac6fm7fc64d4ea9462859@mail.gmail.com>
	<20090204204731.GG11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041250m79badc67rb61b5ed1c040a35e@mail.gmail.com>

On Wed, Feb 4, 2009 at 3:47 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 15:03 Wed 04 Feb     , Hal Rosenstock wrote:
>> >
>> > For switch it will be an actual number of allocated physical ports
>> > (struct osm_physp) - port 0 olus number of external ports. For non
>> > switch nodes entry '0' is not used.
>>
>> Right. In my terms, physical is another name for an external port and
>> port 0 is not a physical (external) port so I think we're quibbling
>> about words. What do you think it should say ?
>
> I don't really have a good opinion :(. Maybe something like:
>
>  *      Returns the number of osm_physp ports allocated for this for node
>  *      (for switches it is number of external physical ports plus port
>  *      0 and number of physical ports + 1 for non-switch nodes).

Fine with me. Let me know if you want a patch for this.

-- Hal

> It is long... :(
>
> Sasha
>


From sashak at voltaire.com  Wed Feb  4 12:55:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 22:55:20 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In
	bad_node_port, allow queries on enhanced SP0
In-Reply-To: <f0e08f230902041154j6570089ej145f9dbc3f2860df@mail.gmail.com>
References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
	<20090204191523.GA11874@sashak.voltaire.com>
	<f0e08f230902041154j6570089ej145f9dbc3f2860df@mail.gmail.com>
Message-ID: <20090204205520.GH11874@sashak.voltaire.com>

On 14:54 Wed 04 Feb     , Hal Rosenstock wrote:
> >
> > (osm_get_node_by_guid()) is expensive operation. If you only need to
> > determine port 0 type - store it as part of struct monitored_node
> > structure. Another (even more universal) approach would be to keep there
> > a reference to related osm_node object.
> 
> This was done later in the patch series.

Good, but why do we need this intermediate version then? It would be
better to do right things from beginning I think (and also this patch
depends on previous one where redirection table size was changed so I
cannot apply it anyway until things will be clarified or fixed there).

Sasha


From sashak at voltaire.com  Wed Feb  4 12:59:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 22:59:25 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h:
	osm_node_get_num_physp description fix
In-Reply-To: <f0e08f230902041250m79badc67rb61b5ed1c040a35e@mail.gmail.com>
References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca>
	<20090204182520.GW11874@sashak.voltaire.com>
	<f0e08f230902041041u49b6a76cxb294cc1473058af2@mail.gmail.com>
	<20090204190256.GZ11874@sashak.voltaire.com>
	<f0e08f230902041203o27eeac6fm7fc64d4ea9462859@mail.gmail.com>
	<20090204204731.GG11874@sashak.voltaire.com>
	<f0e08f230902041250m79badc67rb61b5ed1c040a35e@mail.gmail.com>
Message-ID: <20090204205917.GI11874@sashak.voltaire.com>

On 15:50 Wed 04 Feb     , Hal Rosenstock wrote:
> On Wed, Feb 4, 2009 at 3:47 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 15:03 Wed 04 Feb     , Hal Rosenstock wrote:
> >> >
> >> > For switch it will be an actual number of allocated physical ports
> >> > (struct osm_physp) - port 0 olus number of external ports. For non
> >> > switch nodes entry '0' is not used.
> >>
> >> Right. In my terms, physical is another name for an external port and
> >> port 0 is not a physical (external) port so I think we're quibbling
> >> about words. What do you think it should say ?
> >
> > I don't really have a good opinion :(. Maybe something like:
> >
> >  *      Returns the number of osm_physp ports allocated for this for node
> >  *      (for switches it is number of external physical ports plus port
> >  *      0 and number of physical ports + 1 for non-switch nodes).
> 
> Fine with me. Let me know if you want a patch for this.

If we are out of ideas then yes, send a new patch (still hope that you
will find better description during this... :)).

Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 12:58:10 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 15:58:10 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c:
	Increase size of memory allocation in __collect_guids
In-Reply-To: <20090204190023.GY11874@sashak.voltaire.com>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
	<20090204190023.GY11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>

On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>>
>> Patch to increase size of monitored node in
>> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
>> port number.
>
> There are couple of validations like (port > p_mon_node->redir_tbl_size)
> in osm_perfmgr.c. Would it be correct after proposed change?

I see an issue with those tests which I will fix in a subsequent patch.

-- Hal

>
> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Wed Feb  4 13:01:41 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 16:01:41 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In 
	bad_node_port, allow queries on enhanced SP0
In-Reply-To: <20090204205520.GH11874@sashak.voltaire.com>
References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
	<20090204191523.GA11874@sashak.voltaire.com>
	<f0e08f230902041154j6570089ej145f9dbc3f2860df@mail.gmail.com>
	<20090204205520.GH11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041301y76638a10jcd2840794f046b82@mail.gmail.com>

On Wed, Feb 4, 2009 at 3:55 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 14:54 Wed 04 Feb     , Hal Rosenstock wrote:
>> >
>> > (osm_get_node_by_guid()) is expensive operation. If you only need to
>> > determine port 0 type - store it as part of struct monitored_node
>> > structure. Another (even more universal) approach would be to keep there
>> > a reference to related osm_node object.
>>
>> This was done later in the patch series.
>
> Good, but why do we need this intermediate version then?

Just as a time saver; it's just the path I took in development.

> It would be better to do right things from beginning I think

Sure it's better but does it really matter ?

> (and also this patch
> depends on previous one where redirection table size was changed so I
> cannot apply it anyway until things will be clarified or fixed there).

I think that position is extreme. I don't think I broke anything that
wasn't already broken.

Anyhow, if you really want, I'll produce one patch for these changes.

-- Hal

> Sasha
>


From sashak at voltaire.com  Wed Feb  4 13:11:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 23:11:06 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size
	of memory allocation in __collect_guids
In-Reply-To: <f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
	<20090204190023.GY11874@sashak.voltaire.com>
	<f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>
Message-ID: <20090204211106.GJ11874@sashak.voltaire.com>

On 15:58 Wed 04 Feb     , Hal Rosenstock wrote:
> On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
> >>
> >> Patch to increase size of monitored node in
> >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
> >> port number.
> >
> > There are couple of validations like (port > p_mon_node->redir_tbl_size)
> > in osm_perfmgr.c. Would it be correct after proposed change?
> 
> I see an issue with those tests which I will fix in a subsequent patch.

Could you fix this and post v2? - putting bugs in a main stream is bad
in general and practically also complicates things like bisecting.

Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 13:09:20 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 16:09:20 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c:
	Increase size of memory allocation in __collect_guids
In-Reply-To: <20090204211106.GJ11874@sashak.voltaire.com>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
	<20090204190023.GY11874@sashak.voltaire.com>
	<f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>
	<20090204211106.GJ11874@sashak.voltaire.com>
Message-ID: <f0e08f230902041309o5d4b1744j4cb2399d27f3f2b5@mail.gmail.com>

On Wed, Feb 4, 2009 at 4:11 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 15:58 Wed 04 Feb     , Hal Rosenstock wrote:
>> On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> > On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>> >>
>> >> Patch to increase size of monitored node in
>> >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
>> >> port number.
>> >
>> > There are couple of validations like (port > p_mon_node->redir_tbl_size)
>> > in osm_perfmgr.c. Would it be correct after proposed change?
>>
>> I see an issue with those tests which I will fix in a subsequent patch.
>
> Could you fix this and post v2?

I can.

> - putting bugs in a main stream is bad
> in general and practically also complicates things like bisecting.

It was leaving an old bug in rather than adding a new one.

-- Hal

> Sasha
>


From halr at obsidianresearch.com  Wed Feb  4 13:26:08 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 14:26:08 -0700
Subject: [ofa-general] [PATCHv2] opensm/osm_node.h: Fix
	osm_node_get_num_physp description
Message-ID: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca>

Sasha,

v2 of patch to update/fix opensm/include/opensm/osm_node.h as requested.

-- Hal
-------------- next part --------------
opensm/include/opensm/osm_node.h: Fix osm_node_num_physp description
    
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h
index 50b3598..fec24ba 100644
--- a/opensm/include/opensm/osm_node.h
+++ b/opensm/include/opensm/osm_node.h
@@ -269,7 +269,10 @@ static inline uint8_t osm_node_get_type(IN const osm_node_t * const p_node)
 *	osm_node_get_num_physp
 *
 * DESCRIPTION
-*	Returns the type of this node.
+*	Returns the number of osm_physp ports allocated for this node.
+*	For switches, it is the number of external physical ports plus
+*	port 0. For CAs and routers, it is the number of external physical
+*	ports plus 1.
 *
 * SYNOPSIS
 */

From sashak at voltaire.com  Wed Feb  4 13:40:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 23:40:18 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In
	bad_node_port, allow queries on enhanced SP0
In-Reply-To: <f0e08f230902041301y76638a10jcd2840794f046b82@mail.gmail.com>
References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca>
	<20090204191523.GA11874@sashak.voltaire.com>
	<f0e08f230902041154j6570089ej145f9dbc3f2860df@mail.gmail.com>
	<20090204205520.GH11874@sashak.voltaire.com>
	<f0e08f230902041301y76638a10jcd2840794f046b82@mail.gmail.com>
Message-ID: <20090204214018.GK11874@sashak.voltaire.com>

On 16:01 Wed 04 Feb     , Hal Rosenstock wrote:
> 
> I think that position is extreme. I don't think I broke anything that
> wasn't already broken.

At least after fast look: (port_num > p_mon_node->redir_tbl_size) and
similar tests look broken, using osm_get_node_by_guid() likely slows
down existing PerfMgr. :( Both triggered by those patches.

I would be fine with subsequent patches if there would no degradations.

> Anyhow, if you really want, I'll produce one patch for these changes.

Thanks.

Sasha


From hal.rosenstock at gmail.com  Wed Feb  4 13:37:47 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 16:37:47 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH]
	opensm/osm_perfmgr.c: 
	Increase size of memory allocation in __collect_guids
In-Reply-To: <f0e08f230902041309o5d4b1744j4cb2399d27f3f2b5@mail.gmail.com>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
	<20090204190023.GY11874@sashak.voltaire.com>
	<f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>
	<20090204211106.GJ11874@sashak.voltaire.com>
	<f0e08f230902041309o5d4b1744j4cb2399d27f3f2b5@mail.gmail.com>
Message-ID: <f0e08f230902041337k55c8d3c3p406563548ee4b0e0@mail.gmail.com>

On Wed, Feb 4, 2009 at 4:09 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> On Wed, Feb 4, 2009 at 4:11 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> On 15:58 Wed 04 Feb     , Hal Rosenstock wrote:
>>> On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>>> > On 07:57 Tue 03 Feb     , Hal Rosenstock wrote:
>>> >>
>>> >> Patch to increase size of monitored node in
>>> >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual
>>> >> port number.
>>> >
>>> > There are couple of validations like (port > p_mon_node->redir_tbl_size)
>>> > in osm_perfmgr.c. Would it be correct after proposed change?
>>>
>>> I see an issue with those tests which I will fix in a subsequent patch.
>>
>> Could you fix this and post v2?
>
> I can.

Would you push the latest changes you've accepted up to the management
repo on the OFA server as they impact this ?

-- Hal

>> - putting bugs in a main stream is bad
>> in general and practically also complicates things like bisecting.
>
> It was leaving an old bug in rather than adding a new one.
>
> -- Hal
>
>> Sasha
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Wed Feb  4 13:42:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 23:42:53 +0200
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c:
	Increase size of memory allocation in __collect_guids
In-Reply-To: <f0e08f230902041337k55c8d3c3p406563548ee4b0e0@mail.gmail.com>
References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca>
	<20090204190023.GY11874@sashak.voltaire.com>
	<f0e08f230902041258s540e8a38la9e31e27f542c82d@mail.gmail.com>
	<20090204211106.GJ11874@sashak.voltaire.com>
	<f0e08f230902041309o5d4b1744j4cb2399d27f3f2b5@mail.gmail.com>
	<f0e08f230902041337k55c8d3c3p406563548ee4b0e0@mail.gmail.com>
Message-ID: <20090204214253.GL11874@sashak.voltaire.com>

On 16:37 Wed 04 Feb     , Hal Rosenstock wrote:
> 
> Would you push the latest changes you've accepted up to the management
> repo on the OFA server as they impact this ?

Sure. Pushing now.

Sasha


From sashak at voltaire.com  Wed Feb  4 13:45:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 4 Feb 2009 23:45:00 +0200
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_node.h: Fix
	osm_node_get_num_physp description
In-Reply-To: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca>
References: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090204214500.GM11874@sashak.voltaire.com>

On 14:26 Wed 04 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> v2 of patch to update/fix opensm/include/opensm/osm_node.h as requested.
> 
> -- Hal

> opensm/include/opensm/osm_node.h: Fix osm_node_num_physp description
>     
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From chien.tin.tung at intel.com  Wed Feb  4 13:48:11 2009
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Wed, 4 Feb 2009 14:48:11 -0700
Subject: [ofa-general] RE: [PATCH] : Define debugging variables only when
 CONFIG_INFINIBAND_NES_DEBUG is enabled
In-Reply-To: <ea11fea30902032254v22d95d35ua3eab9a5a6d4feab@mail.gmail.com>
References: <ea11fea30901271028u70f559d5y656be5610ab83a41@mail.gmail.com>
	<ea11fea30902032254v22d95d35ua3eab9a5a6d4feab@mail.gmail.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com>


>> Below patch removes following compilation warnings :
>> drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused 
>variable 'tmp_addr'
>> drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused 
>variable 'tmp_addr'
>>
>
>Any feedback on this ?


Manish,

Thank you for the patch to take care of the warnings.  Upon closer
examination on the usage of tmp_addr in the subsequent NES_DEBUG,
it seems to be nonsense.  I am creating a patch to take out
tmp_addr and the subsequent NES_DEBUG.

Thanks,

Chien

From or.gerlitz at gmail.com  Wed Feb  4 13:52:07 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 4 Feb 2009 23:52:07 +0200
Subject: [ofa-general] RE: impossibility to bind a device/port with the 
	rdma-cm when the port is down
In-Reply-To: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>
	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
Message-ID: <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com>

On Wed, Feb 4, 2009 at 6:41 PM, Sean Hefty <sean.hefty at intel.com> wrote:
> There may be some way to defer setting the qkey if it's not available when binding, but how
> does allowing the bind to proceed help?  Without the qkey, the QP is basically unusable.

We have two usage cases:

- an rdma-cm based app wants to determine if the route for a multicast
group leads to IPoIB interface/device based on the outcome of
rdma_bind_addr etc

- for HA scheme, an app want to resolve the device/port and then use
IB events as a trigger to actually start doing things such as QP
creation, Joining multicast groups, etc

Or


From halr at obsidianresearch.com  Wed Feb  4 14:06:06 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 04 Feb 2009 15:06:06 -0700
Subject: [ofa-general] [PATCHv2] opensm/PerfMgr: Primarily fix enhanced
	switch port 0 perf manager operation
Message-ID: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca>

Sasha,

Attached is a revised patch superceeding any outstanding perfmgr
patches. This version fixes esp0 perfmgr operation. It determines ESP0
for the monitored node and subsequently copies this into the db node.
Also, it fixes redirection table size and port number validation.

-- Hal

-------------- next part --------------

opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation 

Determine ESP0 for monitored node and copy into db node
Also, fix redirection table size and port number validation

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h
index c8a4add..87fae37 100644
--- a/opensm/include/opensm/osm_perfmgr.h
+++ b/opensm/include/opensm/osm_perfmgr.h
@@ -100,6 +100,7 @@ typedef struct _monitored_node {
 	cl_map_item_t map_item;
 	struct _monitored_node *next;
 	uint64_t guid;
+	boolean_t esp0;
 	char *name;
 	uint32_t redir_tbl_size;
 	redir_t redir_port[1];	/* redirection on a per port basis */
diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h
index 5c96378..cb5c40a 100644
--- a/opensm/include/opensm/osm_perfmgr_db.h
+++ b/opensm/include/opensm/osm_perfmgr_db.h
@@ -134,6 +134,7 @@ typedef struct _db_port {
 typedef struct _db_node {
 	cl_map_item_t map_item;	/* must be first */
 	uint64_t node_guid;
+	boolean_t esp0;
 	_db_port_t *ports;
 	uint8_t num_ports;
 	char node_name[NODE_NAME_SIZE];
@@ -155,7 +156,8 @@ perfmgr_db_t *perfmgr_db_construct(struct osm_perfmgr *perfmgr);
 void perfmgr_db_destroy(perfmgr_db_t * db);
 
 perfmgr_db_err_t perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid,
-					 uint8_t num_ports, char *node_name);
+					 boolean_t esp0, uint8_t num_ports,
+					 char *node_name);
 
 perfmgr_db_err_t perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid,
 					    uint8_t port,
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index a2ce50f..b01d612 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -438,7 +438,7 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context)
 	if (cl_qmap_get(&pm->monitored_map, node_guid)
 	    == cl_qmap_end(&pm->monitored_map)) {
 		/* if not already in our map add it */
-		size = node->node_info.num_ports;
+		size = osm_node_get_num_physp(node);
 		mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * size);
 		if (!mon_node) {
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: "
@@ -449,7 +449,15 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context)
 		memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * size);
 		mon_node->guid = node_guid;
 		mon_node->name = strdup(node->print_desc);
-		mon_node->redir_tbl_size = size + 1;
+		mon_node->redir_tbl_size = size;
+		/* check for enhanced switch port 0 */
+		if (node && osm_node_get_type(node) == IB_NODE_TYPE_SWITCH &&
+		    node->sw &&
+		    ib_switch_info_is_enhanced_port0(&node->sw->switch_info))
+			mon_node->esp0 = TRUE;
+		else
+			mon_node->esp0 = FALSE;
+
 		cl_qmap_insert(&(pm->monitored_map), node_guid,
 			       (cl_map_item_t *) mon_node);
 	}
@@ -491,8 +499,8 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context)
 	node_guid = cl_ntoh64(node->node_info.node_guid);
 
 	/* make sure we have a database object ready to store this information */
-	if (perfmgr_db_create_entry(pm->db, node_guid, num_ports,
-				    node->print_desc) !=
+	if (perfmgr_db_create_entry(pm->db, node_guid, mon_node->esp0,
+				    num_ports, node->print_desc) !=
 	    PERFMGR_EVENT_DB_SUCCESS) {
 		OSM_LOG(pm->log, OSM_LOG_ERROR,
 			"ERR 4C08: DB create entry failed for 0x%"
@@ -501,10 +509,8 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context)
 		goto Exit;
 	}
 
-	/* if switch, check for enhanced port 0 */
-	if (osm_node_get_type(node) == IB_NODE_TYPE_SWITCH &&
-	    node->sw &&
-	    ib_switch_info_is_enhanced_port0(&node->sw->switch_info))
+	/* check for switch enhanced port 0 */
+	if (mon_node->esp0)
 		startport = 0;
 
 	/* issue the query for each port */
@@ -1136,7 +1142,7 @@ static void osm_pc_rcv_process(void *context, void *data)
 		/* LID redirection support (easier than GID redirection) */
 		cl_plock_acquire(pm->lock);
 		/* Now, validate port number */
-		if (port > p_mon_node->redir_tbl_size) {
+		if (port >= p_mon_node->redir_tbl_size) {
 			cl_plock_release(pm->lock);
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C13: "
 				"Invalid port num %d for GUID 0x%016"
diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c
index bff9a0f..ef47ce3 100644
--- a/opensm/opensm/osm_perfmgr_db.c
+++ b/opensm/opensm/osm_perfmgr_db.c
@@ -90,14 +90,15 @@ static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port)
 {
 	if (!node)
 		return (PERFMGR_EVENT_DB_GUIDNOTFOUND);
-	if (port == 0 || port >= node->num_ports)
+	if (port >= node->num_ports || (!node->esp0 && port == 0))
 		return (PERFMGR_EVENT_DB_PORTNOTFOUND);
 	return (PERFMGR_EVENT_DB_SUCCESS);
 }
 
 /** =========================================================================
  */
-static _db_node_t *__malloc_node(uint64_t guid, uint8_t num_ports, char *name)
+static _db_node_t *__malloc_node(uint64_t guid, boolean_t esp0,
+				 uint8_t num_ports, char *name)
 {
 	int i = 0;
 	time_t cur_time = 0;
@@ -110,6 +111,7 @@ static _db_node_t *__malloc_node(uint64_t guid, uint8_t num_ports, char *name)
 		goto free_rc;
 	rc->num_ports = num_ports;
 	rc->node_guid = guid;
+	rc->esp0 = esp0;
 
 	cur_time = time(NULL);
 	for (i = 0; i < num_ports; i++) {
@@ -151,14 +153,15 @@ static perfmgr_db_err_t __insert(perfmgr_db_t * db, _db_node_t * node)
 /**********************************************************************
  **********************************************************************/
 perfmgr_db_err_t
-perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid,
+perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, boolean_t esp0,
 			uint8_t num_ports, char *name)
 {
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_excl_acquire(&db->lock);
 	if (!_get(db, guid)) {
-		_db_node_t *pc_node = __malloc_node(guid, num_ports, name);
+		_db_node_t *pc_node = __malloc_node(guid, esp0, num_ports,
+						    name);
 		if (!pc_node) {
 			rc = PERFMGR_EVENT_DB_NOMEM;
 			goto Exit;

From hal.rosenstock at gmail.com  Wed Feb  4 15:35:15 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 4 Feb 2009 18:35:15 -0500
Subject: [ofa-general] Possible memory leak and null pointer dereference 
	in local_completions()
In-Reply-To: <1233777486.23327.172.camel@chromite.mv.qlogic.com>
References: <1233689172.23327.155.camel@chromite.mv.qlogic.com>
	<f0e08f230902040429p5c01abd0y349abb413e120277@mail.gmail.com>
	<1233777486.23327.172.camel@chromite.mv.qlogic.com>
Message-ID: <f0e08f230902041535j690f91cfq10ec16cab942298d@mail.gmail.com>

On Wed, Feb 4, 2009 at 2:58 PM, Ralph Campbell
<ralph.campbell at qlogic.com> wrote:
> On Wed, 2009-02-04 at 04:29 -0800, Hal Rosenstock wrote:
>> On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell
>> <ralph.campbell at qlogic.com> wrote:
>> > I was doing some tests with different MAD packets and
>> > then reading the infiniband/core/mad.c code.
>> >
>> > handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
>> > on the mad_agent_priv->local_work work queue with
>> > local->mad_priv == NULL if device->process_mad() returns
>> > IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
>> > (!ib_response_mad(&mad_priv->mad.mad) ||
>> >  !mad_agent_priv->agent.recv_handler).
>> >
>> > In this case, local_completions() will be called with
>> > local->mad_priv == NULL. The code does check for this
>> > case and skips calling recv_mad_agent->agent.recv_handler().
>> > This means recv == 0 so kmem_cache_free() is called with a
>> > NULL pointer.
>>
>> That could be fixed by changing the check for !recv prior to the
>> kmem_cache_free there to a check for (!recv && local->mad_priv).
>
> This is what we did to continue making progress so I know
> it works.
>
>> > Even if local->mad_priv != NULL, I don't see how local->mad_priv
>> > is freed when recv == 1. Thus, it appears to be a memory leak.
>>
>> For those cases, it's either freed in local_completions (as recv is
>> set to 1 for local->mad_priv != NULL except when there is no mad recv
>> agent but that is another bug (see below)) or earlier in the else
>> clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of
>> handle_outgoing_dr_smp(). That's another issue that this points out
>> where recv = 1 needs to be moved up in local_completions.
>
> The other problem I noticed with setting recv = 1, is that recv = 0
> is outside the while (!list_empty) loop so it is never reset back
> to zero.
>
> I'm not really following you about recv = 1 needs to be moved up in
> local_completions.

I was referring to handling the case where local->mad_priv != NULL and
there is no mad recv agent:

                if (local->mad_priv) {
                        recv_mad_agent = local->recv_mad_agent;
                        if (!recv_mad_agent) {
                                printk(KERN_ERR PFX "No receive MAD agent for lo
cal completion\n");
                                goto local_send_completion;
                        }

That was another case where there was a leak so I moved recv = 1 from
below this to above it just after the check of local->mad_priv in the
patch I proposed.

> What I was really looking for was a confirmation that the original
> code had a memory leak.

I need to look at this further for this. Haven't looked at this code
much in the past couple years.

> I don't see any reason to special case the
> call to kmem_cache_free(). It seems to me that it is needed any time
> local->mad_priv != NULL.
> The NULL pointer bug is easily fixed in a number of different ways.

I agree that if it turns out that this case was missed, then your
patch is simpler but it will take me a little bit to check this out.

>> Would you try the untested patch below and see if it fixes the problem
>> you found ? Thanks.
>
> We are in the middle of moving our office so I won't be able to
> reproduce this until next week.

I no longer have any test bed setup for this. Any chance you can
regress with the Mellanox HCAs to be sure this works there ? Part of
that testing should be running OpenSM as it creates some of those
cases.

-- Hal

>> -- Hal
>>
>> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
>> index 5c54fc2..cca87e6 100644
>> --- a/drivers/infiniband/core/mad.c
>> +++ b/drivers/infiniband/core/mad.c
>> @@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work)
>>                 list_del(&local->completion_list);
>>                 spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
>>                 if (local->mad_priv) {
>> +                       recv = 1;
>>                         recv_mad_agent = local->recv_mad_agent;
>>                         if (!recv_mad_agent) {
>>                                 printk(KERN_ERR PFX "No receive MAD agent for lo
>>                                 goto local_send_completion;
>>                         }
>>
>> -                       recv = 1;
>>                         /*
>>                          * Defined behavior is to complete response
>>                          * before request
>> @@ -2422,7 +2422,7 @@ local_send_completion:
>>
>>                 spin_lock_irqsave(&mad_agent_priv->lock, flags);
>>                 atomic_dec(&mad_agent_priv->refcount);
>> -               if (!recv)
>> +               if (!recv && local->mad_priv)
>>                         kmem_cache_free(ib_mad_cache, local->mad_priv);
>>                 kfree(local);
>>         }
>>
>> > So, I'm proposing the following patch:
>> >
>> > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
>> > index 5c54fc2..93d80e5 100644
>> > --- a/drivers/infiniband/core/mad.c
>> > +++ b/drivers/infiniband/core/mad.c
>> > @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work)
>> >        struct ib_mad_local_private *local;
>> >        struct ib_mad_agent_private *recv_mad_agent;
>> >        unsigned long flags;
>> > -       int recv = 0;
>> >        struct ib_wc wc;
>> >        struct ib_mad_send_wc mad_send_wc;
>> >
>> > @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work)
>> >                                goto local_send_completion;
>> >                        }
>> >
>> > -                       recv = 1;
>> >                        /*
>> >                         * Defined behavior is to complete response
>> >                         * before request
>> > @@ -2422,7 +2420,7 @@ local_send_completion:
>> >
>> >                spin_lock_irqsave(&mad_agent_priv->lock, flags);
>> >                atomic_dec(&mad_agent_priv->refcount);
>> > -               if (!recv)
>> > +               if (local->mad_priv)
>> >                        kmem_cache_free(ib_mad_cache, local->mad_priv);
>> >                kfree(local);
>> >        }
>> >
>> >
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>
>


From chien.tin.tung at intel.com  Wed Feb  4 15:44:34 2009
From: chien.tin.tung at intel.com (Chien Tung)
Date: Wed, 4 Feb 2009 17:44:34 -0600
Subject: [ofa-general] [PATCH] RDMA/nes: ibv_devinfo displays 0 for vendor_id
	and vendor_part_id
Message-ID: <20090204234434.GA1856@ctung-MOBL>

ibv_devinfo displays 0 for vendor_id and vendor_part_id.  Fill in
OUI and device_id for those two fields.

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
---

diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c
index cb4a5f3..da966a5 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -254,6 +254,7 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) {
 	u32 adapter_size;
 	u32 arp_table_size;
 	u16 vendor_id;
+	u16 device_id;
 	u8  OneG_Mode;
 	u8  func_index;
 
@@ -356,6 +357,13 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) {
 		return NULL;
 	}
 
+	nesadapter->vendor_id = (((u32) nesadapter->mac_addr_high) << 8) |
+				(nesadapter->mac_addr_low >> 24);
+
+	pci_bus_read_config_word(nesdev->pcidev->bus, nesdev->pcidev->devfn,
+				 PCI_DEVICE_ID, &device_id);
+	nesadapter->vendor_part_id = device_id;
+
 	if (nes_init_serdes(nesdev, hw_rev, port_count, nesadapter,
 							OneG_Mode)) {
 		kfree(nesadapter);
-- 
1.5.3.3


From sashak at voltaire.com  Wed Feb  4 16:03:23 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 02:03:23 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan
	subnet configuration after SIGHUP
In-Reply-To: <498850A2.8090701@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
Message-ID: <20090205000323.GN11874@sashak.voltaire.com>

Hi Eli,

On 16:11 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
>  rescan configuration as first step on every heavy sweep
>  this is a must in case of priority change (increase) for standby SM
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/opensm/osm_state_mgr.c |   11 ++++++-----
>  1 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index fc7ceb9..622867b 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm)
>  	ib_api_status_t status;
>  	osm_remote_sm_t *p_remote_sm;
>  
> +	if (sm->p_subn->force_heavy_sweep && 
> +	    osm_subn_rescan_conf_files(sm->p_subn) < 0) {
> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> +			"osm_subn_rescan_conf_file failed\n");
> +	}
> +
>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
>  	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
>  		return;
> @@ -1131,11 +1137,6 @@ _repeat_discovery:
>  	sm->p_subn->force_reroute = FALSE;
>  	sm->p_subn->subnet_initialization_error = FALSE;
>  
> -	/* rescan configuration updates */
> -	if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
> -		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> -			"osm_subn_rescan_conf_file failed\n");
> -
>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
>  		sm->p_subn->need_update = 1;

'force_heavy_sweep' flag can be raised during light sweep too. In this
case you will miss config rescanning before incoming heavy sweep. I
guess the patch should similar to (not tested):


diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index aecfac6..f5d3837 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1041,11 +1041,14 @@ static void do_sweep(osm_sm_t * sm)
 {
 	ib_api_status_t status;
 	osm_remote_AM_t *p_remote_sm;
+	unsigned config_parsed = 0;
 
-	if (sm->p_subn->force_heavy_sweep &&
-	    osm_subn_rescan_conf_files(sm->p_subn) < 0) {
-		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
-			"osm_subn_rescan_conf_file failed\n");
+	if (sm->p_subn->force_heavy_sweep) {
+		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
+			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
+				"osm_subn_rescan_conf_file failed\n");
+		else
+			config_parsed = 1;
 	}
 
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
@@ -1137,6 +1140,11 @@ _repeat_discovery:
 	sm->p_subn->force_reroute = FALSE;
 	sm->p_subn->subnet_initialization_error = FALSE;
 
+	/* rescan configuration updates */
+	if (!config_parsed && osm_subn_rescan_conf_files(sm->p_subn) < 0)
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
+			"osm_subn_rescan_conf_file failed\n");
+
 	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
 		sm->p_subn->need_update = 1;

Sasha


From sean.hefty at intel.com  Wed Feb  4 16:32:04 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 4 Feb 2009 16:32:04 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	
	<49893FAF.3090007@voltaire.com>	
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com>
Message-ID: <D36D0D1763CA48FD98DCDCAD64BCA89F@amr.corp.intel.com>

>- an rdma-cm based app wants to determine if the route for a multicast
>group leads to IPoIB interface/device based on the outcome of
>rdma_bind_addr etc

I'm not quite following this yet.  Are you wanting a list of IP addresses that
map to RDMA devices?

>- for HA scheme, an app want to resolve the device/port and then use
>IB events as a trigger to actually start doing things such as QP
>creation, Joining multicast groups, etc

Thanks - I'll look at Yossi's patch in detail.  The general principal looks fine
to me.

Is there some notification for IP addresses becoming usable that could be used
instead?

- Sean


From sashak at voltaire.com  Wed Feb  4 16:45:39 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 02:45:39 +0200
Subject: [ofa-general] Re: [PATCHv2] opensm/PerfMgr: Primarily fix enhanced
	switch port 0 perf manager operation
In-Reply-To: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca>
References: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090205004539.GO11874@sashak.voltaire.com>

On 15:06 Wed 04 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Attached is a revised patch superceeding any outstanding perfmgr
> patches. This version fixes esp0 perfmgr operation. It determines ESP0
> for the monitored node and subsequently copies this into the db node.
> Also, it fixes redirection table size and port number validation.
> 
> -- Hal
> 

> 
> opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation 
> 
> Determine ESP0 for monitored node and copy into db node
> Also, fix redirection table size and port number validation
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

With one change - there was one more (port > p_mon_node->redir_tbl_size)
test in osm_perfmgr_mad_send_err_callback(), fixing this to
(port >= p_mon_node->redir_tbl_size).

Sasha


From ralph.campbell at qlogic.com  Wed Feb  4 17:56:37 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 04 Feb 2009 17:56:37 -0800
Subject: [ofa-general] Possible memory leak and null pointer
	dereference in local_completions()
In-Reply-To: <f0e08f230902041535j690f91cfq10ec16cab942298d@mail.gmail.com>
References: <1233689172.23327.155.camel@chromite.mv.qlogic.com>
	<f0e08f230902040429p5c01abd0y349abb413e120277@mail.gmail.com>
	<1233777486.23327.172.camel@chromite.mv.qlogic.com>
	<f0e08f230902041535j690f91cfq10ec16cab942298d@mail.gmail.com>
Message-ID: <1233798997.23327.180.camel@chromite.mv.qlogic.com>


> >> Would you try the untested patch below and see if it fixes the problem
> >> you found ? Thanks.
> >
> > We are in the middle of moving our office so I won't be able to
> > reproduce this until next week.
> 
> I no longer have any test bed setup for this. Any chance you can
> regress with the Mellanox HCAs to be sure this works there ? Part of
> that testing should be running OpenSM as it creates some of those
> cases.
> 
> -- Hal

We have a variety of Mellanox HCAs so next week when our lab is back up
I can test it.


From mkatiyar at gmail.com  Wed Feb  4 19:01:30 2009
From: mkatiyar at gmail.com (Manish Katiyar)
Date: Thu, 5 Feb 2009 08:31:30 +0530
Subject: [ofa-general] ***SPAM*** Re: [PATCH] : Define debugging variables
	only when CONFIG_INFINIBAND_NES_DEBUG is enabled
In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com>
References: <ea11fea30901271028u70f559d5y656be5610ab83a41@mail.gmail.com>
	<ea11fea30902032254v22d95d35ua3eab9a5a6d4feab@mail.gmail.com>
	<60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com>
Message-ID: <ea11fea30902041901m20ae5a95o950af24c95fcc6c2@mail.gmail.com>

On Thu, Feb 5, 2009 at 3:18 AM, Tung, Chien Tin
<chien.tin.tung at intel.com> wrote:
>
>>> Below patch removes following compilation warnings :
>>> drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused
>>variable 'tmp_addr'
>>> drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused
>>variable 'tmp_addr'
>>>
>>
>>Any feedback on this ?
>
>
> Manish,
>
> Thank you for the patch to take care of the warnings.  Upon closer
> examination on the usage of tmp_addr in the subsequent NES_DEBUG,
> it seems to be nonsense.  I am creating a patch to take out
> tmp_addr and the subsequent NES_DEBUG.

Thanks a lot Chien

Thanks -
Manish

>
> Thanks,
>
> Chien


From He.Huang at Sun.COM  Wed Feb  4 20:47:28 2009
From: He.Huang at Sun.COM (Isaac Huang)
Date: Wed, 04 Feb 2009 23:47:28 -0500
Subject: [ofa-general] troubleshooting IB_CM_REJ_INVALID_SERVICE_ID
 in	RDMA_CM_EVENT_REJECTED at active side of the connection
Message-ID: <20090205044728.GL18580@sun.com>

Hi,

I got some RDMA_CM_EVENT_REJECTED errors at active sides (i.e. nodes
doing rdma_connect), after RDMA_CM_EVENT_ADDR_RESOLVED and
RDMA_CM_EVENT_ROUTE_RESOLVED.

Poking around in CM code told me that the passive side couldn't find a
listener with requested service_id on the incoming device of the
connection request.

I suspected that either the active side or passive side could have
been bound to a wrong IB device - both sides did have multiple IB
interfaces on the fabric. Our code did bind to correct local IP
addresses at both sides, src_addr in rdma_resolve_addr and
rdma_bind_addr before rdma_listen. However, I seemed to remember that
some old OFED versions had issues in rdma_translate_ip so that a wrong
interface could be returned, e.g. bug 726 and 325. Also, the active
side was running OFED 1.3.1 and passive side could be an older
version. Could you guys give me some tips for troubleshooting? Any
debugging options or /proc file to look at? Is there any netstat-like
tool in OFED (e.g. something like a "netstat -ltp" to find out who is
listening on which device)?

The other possible cause could be ARP flux, but unfortunately arping
via IPoIB always segfault on our systems. Is there any other way to
troubleshoot possible ARP flux issues?

BTW, pinging over IPoIB addresses worked fine.

Your suggestion is greatly appreciated.

Thanks,
Isaac


From dorfman.eli at gmail.com  Wed Feb  4 23:43:04 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 05 Feb 2009 09:43:04 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <20090205000323.GN11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
Message-ID: <498A9888.5010003@gmail.com>

Sasha Khapyorsky wrote:
> Hi Eli,
> 
> On 16:11 Tue 03 Feb     , Eli Dorfman (Voltaire) wrote:
>>  rescan configuration as first step on every heavy sweep
>>  this is a must in case of priority change (increase) for standby SM
>>
>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>> ---
>>  opensm/opensm/osm_state_mgr.c |   11 ++++++-----
>>  1 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
>> index fc7ceb9..622867b 100644
>> --- a/opensm/opensm/osm_state_mgr.c
>> +++ b/opensm/opensm/osm_state_mgr.c
>> @@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm)
>>  	ib_api_status_t status;
>>  	osm_remote_sm_t *p_remote_sm;
>>  
>> +	if (sm->p_subn->force_heavy_sweep && 
>> +	    osm_subn_rescan_conf_files(sm->p_subn) < 0) {
>> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
>> +			"osm_subn_rescan_conf_file failed\n");
>> +	}
>> +
>>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
>>  	    sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING)
>>  		return;
>> @@ -1131,11 +1137,6 @@ _repeat_discovery:
>>  	sm->p_subn->force_reroute = FALSE;
>>  	sm->p_subn->subnet_initialization_error = FALSE;
>>  
>> -	/* rescan configuration updates */
>> -	if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
>> -		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
>> -			"osm_subn_rescan_conf_file failed\n");
>> -
>>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
>>  		sm->p_subn->need_update = 1;
> 
> 'force_heavy_sweep' flag can be raised during light sweep too. In this
> case you will miss config rescanning before incoming heavy sweep. I
> guess the patch should similar to (not tested):
> 
> 
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index aecfac6..f5d3837 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -1041,11 +1041,14 @@ static void do_sweep(osm_sm_t * sm)
>  {
>  	ib_api_status_t status;
>  	osm_remote_AM_t *p_remote_sm;
> +	unsigned config_parsed = 0;
>  
> -	if (sm->p_subn->force_heavy_sweep &&
> -	    osm_subn_rescan_conf_files(sm->p_subn) < 0) {
> -		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> -			"osm_subn_rescan_conf_file failed\n");
> +	if (sm->p_subn->force_heavy_sweep) {
> +		if (osm_subn_rescan_conf_files(sm->p_subn) < 0)
> +			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> +				"osm_subn_rescan_conf_file failed\n");
> +		else
> +			config_parsed = 1;
>  	}
>  
>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER &&
> @@ -1137,6 +1140,11 @@ _repeat_discovery:
>  	sm->p_subn->force_reroute = FALSE;
>  	sm->p_subn->subnet_initialization_error = FALSE;
>  
> +	/* rescan configuration updates */
> +	if (!config_parsed && osm_subn_rescan_conf_files(sm->p_subn) < 0)
> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: "
> +			"osm_subn_rescan_conf_file failed\n");
> +
>  	if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
>  		sm->p_subn->need_update = 1;
> 

ok. Please apply the fixed patch.

Thanks,
Eli


From ruffing at motama.com  Thu Feb  5 02:03:19 2009
From: ruffing at motama.com (Jan Ruffing)
Date: Thu, 05 Feb 2009 11:03:19 +0100
Subject: [ofa-general] RDMA transfers: Buffer status communications?
Message-ID: <498AB967.9010108@motama.com>

Hello,

when planning a data transfer system using Infiniband's RDMA mechanisms, I stumbled upon the following question: Is there a standard approach to inform the sender after an RDMA_write operation that the receiving buffer has been processed by the receiver and is now ready to receive new data? 

My understanding is as follows:
-  As soon as a IBV_WR_RDMA_WRITE[_WITH_IMM] operation has finished transfering data into the target buffer on the receiver side, a work completion gets put onto the sender side completion queue [and optionally the receiver's completion queue, too].
- The receiver processes the data in the buffer without the sender side noticing
- If the receiver wants to inform the sender that the buffer has been processed and is ready to accept new data, the receiver has to manually send a message to the sender (f.e. by filing a send work request containing some kind of buffer identifier).

Is my understanding of the mechanisms correct? Since locking and unlocking of data receiving buffers is a standard use case in most transport strategies, I wanted to ask if there's a more elegant way to manage this using the Infiniband architecture? Like for example delaying the sender side work completion till the buffer has been processed by the receiver?

Thanks,
Jan


-- 
Jan Ruffing
Software Developer

Motama GmbH
Lortzingstraße 10 · 66111 Saarbrücken · Germany
tel +49 681 940 85 50 · fax +49 681 940 85 49
ruffing at motama.com · www.motama.com

Companies register · district council Saarbrücken · HRB 15249
CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger

This e-mail may contain confidential and/or privileged information. 
If you are not the intended recipient (or have received this e-mail 
in error) please notify the sender immediately and destroy this 
e-mail. Any unauthorized copying, disclosure or distribution of the 
material in this e-mail is strictly forbidden.


From vlad at lists.openfabrics.org  Thu Feb  5 03:11:56 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu,  5 Feb 2009 03:11:56 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090205-0200 daily build status
Message-ID: <20090205111156.AD4E4E611D7@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From ogerlitz at voltaire.com  Thu Feb  5 03:44:53 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 5 Feb 2009 13:44:53 +0200 (IST)
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>

Hi Sean,

It seems that even when the rdma-cm consumer binds to a specific address,
the rdma-cm address resolution code follows the order of the devices/rules
in routing table. So the user can't really dictate an outgoing interface
based on the src address provided to rdma_resolve_addr. This problem seem to
happen even if the user first called rdma_bind_addr, so its either same
issue or that rdma_resolve_addr somehow stepping on the device/port
"resolved" by rdma_bind_addr.

Consider this system, with two IPoIB intefaces on the same IP subnet using
the same HCA, each on a different port. The first match for 192.168.10.0/24
would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket
bind to a diffrent interface. First, I see that two neighbours has been
created, each on a different interface, and second from sampling the interface
packet counters (not brought here) I see that each ping uses the correct interface.

Repeating the same test with rds-ping -I (rds-ping is a user space utility provided
by the rds-devel package, sending packets through the rds kernel driver) - I can see
that the two rds rdma-cm ids (rds would have two connections in that case) is using
the same port, the one corresponding to ib3, the first routing match.
Below is some info on my system.

Or, when running with multiple HCAs on Linux - we run into an problem with RDS - in that
rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind to.. it seems
to always be using the destination IP.

We put this patch together - which solves the problem on Linux... note that this is
behavior only fails on Linux - it works correctly on HPUX...as an example.

Do you see a problem with proposing that this patch be picked up by OFED ?

Rick Frank who brought this to my attention, also handed me this patch
which is claimed to workaround this issue, its badly formatted and I
couldn't really understand what it does. I hoped to be able and reproduce
this with rping or ucmatose, but neither allow me to specify a -I address
to the client side, and I don't have the time now for this enhancement.

--- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
+++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
@@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
  struct flowi fl;
  struct rtable *rt;
  struct neighbour *neigh;
+ struct net_device *dev;
  int ret;

  memset(&fl, 0, sizeof fl);
  fl.nl_u.ip4_u.daddr = dst_ip;
  fl.nl_u.ip4_u.saddr = src_ip;
+
+ if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
+ fl.oif = dev->ifindex;
+ dev_put(dev);
+
+ ret = ip_route_output_key(&rt, &fl);
+ if (ret == 0)
+ goto found;
+ /* Fall back to using any local device */
+ fl.oif = 0;
+ }
  ret = ip_route_output_key(&rt, &fl);
  if (ret)
  goto out;

+found: ;
+
  /* If the device does ARP internally, return 'done' */
  if (rt->idev->dev->flags & IFF_NOARP) {
  rdma_copy_addr(addr, rt->idev->dev, NULL);


[root at anise ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 ib3
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 ib2


[root at anise ~]# ip addr show ib2
11: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
    link/infiniband 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd
    inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2
    inet6 fe80::202:c903:3:17c1/64 scope link

[root at anise ~]# ip addr show ib3
12: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
    link/infiniband 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd
    inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3
    inet6 fe80::202:c903:3:17c2/64 scope link

[root at anise ~]# ping -I 192.168.10.60 192.168.10.89
2 packets transmitted, 2 received, 0% packet loss, time 999ms

[root at anise ~]# ping -I 192.168.10.61 192.168.10.89
3 packets transmitted, 3 received, 0% packet loss, time 1999ms

[root at anise ~]# ip n s
192.168.10.89 dev ib3 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
192.168.10.89 dev ib2 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE

[root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89
   3: 33 usec

[root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89
   3: 33 usec

[root at anise ~]# rds-info -I
RDS IB Connections:
      LocalAddr      RemoteAddr                         LocalDev                        RemoteDev
  192.168.10.61   192.168.10.89              fe80::2:c903:3:17c2             fe80::2:c902:22:efe5
  192.168.10.60   192.168.10.89              fe80::2:c903:3:17c2             fe80::2:c902:22:efe5


From ogerlitz at voltaire.com  Thu Feb  5 04:03:42 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 05 Feb 2009 14:03:42 +0200
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used
	for	bind
In-Reply-To: <Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
Message-ID: <498AD59E.4030003@voltaire.com>

Or Gerlitz wrote:
> Rick Frank who brought this to my attention, also handed me this patch
> which is claimed to workaround this issue, 
> --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
> +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
>   struct flowi fl;
>   struct rtable *rt;
>   struct neighbour *neigh;
> + struct net_device *dev;
>   int ret;
>
>   memset(&fl, 0, sizeof fl);
>   fl.nl_u.ip4_u.daddr = dst_ip;
>   fl.nl_u.ip4_u.saddr = src_ip;
> +
> + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
> + fl.oif = dev->ifindex;
> + dev_put(dev);
> +
> + ret = ip_route_output_key(&rt, &fl);
> + if (ret == 0)
> + goto found;
I assume the trick here is to somehow enforce the interface returned by 
ip_dev_find and not the one resolved by the routing table. At least as I 
understand the addr.c code, it takes the interface later from neigh->dev 
, correct?

Or.


> + /* Fall back to using any local device */
> + fl.oif = 0;
> + }
>   ret = ip_route_output_key(&rt, &fl);
>   if (ret)
>   goto out;
>
> +found: ;
> +
>   /* If the device does ARP internally, return 'done' */
>   if (rt->idev->dev->flags & IFF_NOARP) {
>   rdma_copy_addr(addr, rt->idev->dev, NULL);


From sashak at voltaire.com  Thu Feb  5 04:16:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 14:16:34 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan
	subnet configuration after SIGHUP
In-Reply-To: <498A9888.5010003@gmail.com>
References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
Message-ID: <20090205121634.GQ11874@sashak.voltaire.com>

On 09:43 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote:
> 
> ok. Please apply the fixed patch.

Did you test it?

Sasha


From Alexr at voltaire.com  Wed Feb  4 21:21:07 2009
From: Alexr at voltaire.com (Alex Rosenbaum)
Date: Thu, 5 Feb 2009 07:21:07 +0200
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <D36D0D1763CA48FD98DCDCAD64BCA89F@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	
	<49893FAF.3090007@voltaire.com>	
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com>
	<D36D0D1763CA48FD98DCDCAD64BCA89F@amr.corp.intel.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com>

>- I'm not quite following this yet.  Are you wanting a list of IP
addresses that map to RDMA devices?

When looking at a case that the user defines a local interface ip addr
which it wants to work with. The application does not know if the ip
addr maps to an rdma-cm capable device (IB or iWapr) or not (i.e.:
1GigE).
In current implemenation if the IB port is down (i.e.: cable unpluged)
but the interface is up, rdma_bind_addr fails. That will also be the
case if the rdma_bind_addr is called with an ip addr of the 1GigE
interface.
The application does not know if the failure is due to trying to bind on
a 1GigE deives which is not rdma-cm capable or if it is a capable
rdma-cm device which is in a temporery 'bad' state.
Assuming this is an rdma-cm capable device in a 'bad' state, the user
space application can wait for asyn ibv events (PORT_ACTIVE) from the
device. Once the device is active again it can retry the rdma_create_qp
or rdma_join_mc.

Alex


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb  5 06:34:24 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 05 Feb 2009 15:34:24 +0100
Subject: [ofa-general] [ibsim][PATCH] Socket name can be forced by exporting
 IBSIM_SOCKNAME
 before starting ibsim and/or preloading umad2sim so multiple simulator can
 run on the same system at the same time
Message-ID: <498AF8F0.2080707@ext.bull.net>

As we do a lot of routing tests with ibsim we had the need to be able to launch multiple simulator on the same system.
With this patch, ibsim (and umad2sim) will try to read the socket basename using a getenv("IBSIM_SOCKNAME") which makes it possible.
If IBSIM_SOCKNAME is not set, SIM_BASENAME is still used.


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  ibsim/ibsim.c         |   10 ++++++++--
  umad2sim/sim_client.c |   14 +++++++++-----
  2 files changed, 17 insertions(+), 7 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 51201d225702489862648d1380d84c1570c11c71.diff
Type: text/x-patch
Size: 3286 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/c9852f49/attachment.bin>

From dorfman.eli at gmail.com  Thu Feb  5 07:00:19 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 05 Feb 2009 17:00:19 +0200
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_subnet.c enable
	log_max_size opt update
Message-ID: <498AFF03.7090903@gmail.com>

enable log_max_size opt update

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index f589180..d6d39a6 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -132,7 +132,7 @@ static const opt_rec_t opt_tbl[] = {
 	{ "connect_roots", OPT_OFFSET(connect_roots), opts_parse_boolean, NULL, 1 },
 	{ "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_parse_boolean, NULL, 1 },
 	{ "log_file", OPT_OFFSET(log_file), opts_parse_charp, NULL, 0 },
-	{ "log_max_size", OPT_OFFSET(log_max_size), opts_parse_uint32, opts_setup_log_max_size },
+	{ "log_max_size", OPT_OFFSET(log_max_size), opts_parse_uint32, opts_setup_log_max_size, 1 },
 	{ "log_flags", OPT_OFFSET(log_flags), opts_parse_uint8, opts_setup_log_flags, 1 },
 	{ "force_log_flush", OPT_OFFSET(force_log_flush), opts_parse_boolean, opts_setup_force_log_flush, 1 },
 	{ "accum_log_file", OPT_OFFSET(accum_log_file), opts_parse_boolean, opts_setup_accum_log_file, 1 },
-- 
1.5.5


From dorfman.eli at gmail.com  Thu Feb  5 07:19:41 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 05 Feb 2009 17:19:41 +0200
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_subnet.c fix parse
	functions for big endian machines
Message-ID: <498B038D.4020009@gmail.com>

fix parse functions for big endian machines

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_subnet.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index d6d39a6..7b33659 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn,
 		  IN void *p_v, IN setup_fn_t pfn)
 {
 	uint16_t *p_val = p_v;
-	uint32_t val = strtoul(p_val_str, NULL, 0);
+	uint16_t val = strtoul(p_val_str, NULL, 0);
 
 	CL_ASSERT(val < 0x10000);
-	if (cl_hton32(val) != *p_val) {
+	if (cl_hton16(val) != *p_val) {
 		log_config_value(p_key, "0x%04x", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = cl_hton16((uint16_t) val);
+		*p_val = cl_hton16(val);
 	}
 }
 
@@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn,
 		  IN void *p_v, IN setup_fn_t pfn)
 {
 	uint8_t *p_val = p_v;
-	uint32_t val = strtoul(p_val_str, NULL, 0);
+	uint8_t val = strtoul(p_val_str, NULL, 0);
 
 	CL_ASSERT(val < 0x100);
 	if (val != *p_val) {
 		log_config_value(p_key, "%u", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = (uint8_t) val;
+		*p_val = val;
 	}
 }
 
-- 
1.5.5


From chien.tin.tung at intel.com  Thu Feb  5 07:21:06 2009
From: chien.tin.tung at intel.com (Chien Tung)
Date: Thu, 5 Feb 2009 09:21:06 -0600
Subject: [ofa-general] [PATCH] RDMA/nes: tmp_addr compilation warning
Message-ID: <20090205152106.GA2304@ctung-MOBL>

As reported by Manish Katiyar <mkatiyar at gmail.com>, tmp_addr is
causing a compilation warning when INFINIBAND_NES_DEBUG is not defined.

tmp_addr is used in a NES_DEBUG and the print does not make sense.
Taking out tmp_addr and the NES_DEBUG.

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
---
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 6f42ab6..bd918df 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -778,14 +778,10 @@ static struct nes_cm_node *find_node(struct nes_cm_core *cm_core,
 	unsigned long flags;
 	struct list_head *hte;
 	struct nes_cm_node *cm_node;
-	__be32 tmp_addr = cpu_to_be32(loc_addr);
 
 	/* get a handle on the hte */
 	hte = &cm_core->connected_nodes;
 
-	nes_debug(NES_DBG_CM, "Searching for an owner node: %pI4:%x from core %p->%p\n",
-		  &tmp_addr, loc_port, cm_core, hte);
-
 	/* walk list and find cm_node associated with this session ID */
 	spin_lock_irqsave(&cm_core->ht_lock, flags);
 	list_for_each_entry(cm_node, hte, list) {
-- 
1.5.3.3


From richard.frank at oracle.com  Thu Feb  5 07:23:53 2009
From: richard.frank at oracle.com (Richard Frank)
Date: Thu, 5 Feb 2009 10:23:53 -0500
Subject: [ofa-general] ***SPAM*** Re: pick the outgoing HCA based on the IP
	used for bind
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
Message-ID: <ED6909C60F5A49CAA2643B9A5AAF0546@us.oracle.com>

FWIW - I tested with this patch to rmda_resolve_ip - and found no difference 
in behavior.

At this point I do not think the addr.c patch resolves this... at one point 
we had two patches that were overlapping - both possilby solving the same 
problem... now that rds is explicitly binding to an IP...the resolve_ip 
patch appears to be not needed.

The original problem is that we were not getting to either the HCA or port 
associated with an IP - even in a dual HCA configuration. Now that rds is 
explicitly binding we do get the correct HCA ( based on Or's tests ), 
however, we really want to resolve down to port backing the IP.

----- Original Message ----- 
From: "Or Gerlitz" <ogerlitz at voltaire.com>
To: "Sean Hefty" <sean.hefty at intel.com>
Cc: <general at lists.openfabrics.org>; <rds-devel at oss.oracle.com>; "Richard 
Frank" <richard.frank at oracle.com>
Sent: Thursday, February 05, 2009 6:44 AM
Subject: Re: pick the outgoing HCA based on the IP used for bind


> Hi Sean,
>
> It seems that even when the rdma-cm consumer binds to a specific address,
> the rdma-cm address resolution code follows the order of the devices/rules
> in routing table. So the user can't really dictate an outgoing interface
> based on the src address provided to rdma_resolve_addr. This problem seem 
> to
> happen even if the user first called rdma_bind_addr, so its either same
> issue or that rdma_resolve_addr somehow stepping on the device/port
> "resolved" by rdma_bind_addr.
>
> Consider this system, with two IPoIB intefaces on the same IP subnet using
> the same HCA, each on a different port. The first match for 
> 192.168.10.0/24
> would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket
> bind to a diffrent interface. First, I see that two neighbours has been
> created, each on a different interface, and second from sampling the 
> interface
> packet counters (not brought here) I see that each ping uses the correct 
> interface.
>
> Repeating the same test with rds-ping -I (rds-ping is a user space utility 
> provided
> by the rds-devel package, sending packets through the rds kernel driver) - 
> I can see
> that the two rds rdma-cm ids (rds would have two connections in that case) 
> is using
> the same port, the one corresponding to ib3, the first routing match.
> Below is some info on my system.
>
> Or, when running with multiple HCAs on Linux - we run into an problem with 
> RDS - in that
> rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind 
> to.. it seems
> to always be using the destination IP.
>
> We put this patch together - which solves the problem on Linux... note 
> that this is
> behavior only fails on Linux - it works correctly on HPUX...as an example.
>
> Do you see a problem with proposing that this patch be picked up by OFED ?
>
> Rick Frank who brought this to my attention, also handed me this patch
> which is claimed to workaround this issue, its badly formatted and I
> couldn't really understand what it does. I hoped to be able and reproduce
> this with rping or ucmatose, but neither allow me to specify a -I address
> to the client side, and I don't have the time now for this enhancement.
>
> --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
> +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
>  struct flowi fl;
>  struct rtable *rt;
>  struct neighbour *neigh;
> + struct net_device *dev;
>  int ret;
>
>  memset(&fl, 0, sizeof fl);
>  fl.nl_u.ip4_u.daddr = dst_ip;
>  fl.nl_u.ip4_u.saddr = src_ip;
> +
> + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
> + fl.oif = dev->ifindex;
> + dev_put(dev);
> +
> + ret = ip_route_output_key(&rt, &fl);
> + if (ret == 0)
> + goto found;
> + /* Fall back to using any local device */
> + fl.oif = 0;
> + }
>  ret = ip_route_output_key(&rt, &fl);
>  if (ret)
>  goto out;
>
> +found: ;
> +
>  /* If the device does ARP internally, return 'done' */
>  if (rt->idev->dev->flags & IFF_NOARP) {
>  rdma_copy_addr(addr, rt->idev->dev, NULL);
>
>
>
>
> [root at anise ~]# route -n
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use 
> Iface
> 192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 
> ib3
> 192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 
> ib2
>
>
> [root at anise ~]# ip addr show ib2
> 11: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 
> 256
>    link/infiniband 
> 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd
>    inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2
>    inet6 fe80::202:c903:3:17c1/64 scope link
>
> [root at anise ~]# ip addr show ib3
> 12: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 
> 256
>    link/infiniband 
> 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd
>    inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3
>    inet6 fe80::202:c903:3:17c2/64 scope link
>
> [root at anise ~]# ping -I 192.168.10.60 192.168.10.89
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>
> [root at anise ~]# ping -I 192.168.10.61 192.168.10.89
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>
> [root at anise ~]# ip n s
> 192.168.10.89 dev ib3 lladdr 
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
> 192.168.10.89 dev ib2 lladdr 
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
>
> [root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89
>   3: 33 usec
>
> [root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89
>   3: 33 usec
>
> [root at anise ~]# rds-info -I
> RDS IB Connections:
>      LocalAddr      RemoteAddr                         LocalDev 
> RemoteDev
>  192.168.10.61   192.168.10.89              fe80::2:c903:3:17c2 
> fe80::2:c902:22:efe5
>  192.168.10.60   192.168.10.89              fe80::2:c903:3:17c2 
> fe80::2:c902:22:efe5
> 


From PHF at zurich.ibm.com  Thu Feb  5 08:22:50 2009
From: PHF at zurich.ibm.com (Philip Frey1)
Date: Thu, 5 Feb 2009 17:22:50 +0100
Subject: [ofa-general] Chelsio T3: Aggregate Throughput
Message-ID: <OF470F5E1D.BC00EC13-ONC1257554.0058CDE7-C1257554.0059FBAB@ch.ibm.com>

Hello,

we am currently looking into the scalability of the T3 in terms of
connections. We are using a 1-to-n scenario where the one server
has a chunk of data and n client that fetch this chunk over and over
again using RDMA reads (each 1MB in size).

The clients do that such that they get an average data rate of about
9Mbps each. Every second we connect a new client to the server
and see how far it goes.

What puzzles us now is that after about 800 clients, they do no longer
seem to receive much data.

The first interesting thing is that the aggregate throughput actually 
drops
(we expected it to stall). And the second interesting thing is that it 
does
so already at about 6.3Gbps which is just a bit more than half of what the
card can do. We do not experience this kind of situation when using
much less clients that RDMA read the data at a much higher data rate.

Is there any limitation on the RNIC that would give an explanation for 
this?

(Setup: T3 RNICs, OFED-1.4, 2.6.26 kernel, MTU=9000)

Many thanks for your advice,
 Philip

-- 
   Philip Frey 
   IBM Zurich Research Laboratory
   Saumerstrasse 4                                   |  Phone: +41 44 724 
8613
   CH-8803 Rueschlikon/Switzerland  |  Email: phf at zurich.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/c8a1bbbf/attachment.html>

From sean.hefty at intel.com  Thu Feb  5 09:22:00 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 09:22:00 -0800
Subject: [ofa-general] RDMA transfers: Buffer status communications?
In-Reply-To: <498AB967.9010108@motama.com>
References: <498AB967.9010108@motama.com>
Message-ID: <748E9553FCD94EA498B89E46765F309D@amr.corp.intel.com>

>Is my understanding of the mechanisms correct? Since locking and unlocking of
>data receiving buffers is a standard use case in most transport strategies, I
>wanted to ask if there's a more elegant way to manage this using the Infiniband
>architecture? Like for example delaying the sender side work completion till
>the buffer has been processed by the receiver?

Application level acks are needed to indicate when processing is complete.  The
hardware cannot determine this, so I don't know of any solution that's more
elegant in a general case.

- Sean


From sean.hefty at intel.com  Thu Feb  5 09:28:55 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 09:28:55 -0800
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
Message-ID: <FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>

>Rick Frank who brought this to my attention, also handed me this patch
>which is claimed to workaround this issue, its badly formatted and I
>couldn't really understand what it does. I hoped to be able and reproduce
>this with rping or ucmatose, but neither allow me to specify a -I address
>to the client side, and I don't have the time now for this enhancement.

ucmatose allows binding to a specific address using -b.  I haven't used rds-ping
to know if it's the same as -I in that case.  I don't have any systems myself
with dual HCAs; I don't think they have enough slots to support more than one.

- Sean 


From sashak at voltaire.com  Thu Feb  5 09:44:49 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 19:44:49 +0200
Subject: [ofa-general] Re: [ibsim][PATCH] Socket name can be forced by
	exporting
	IBSIM_SOCKNAME before starting ibsim and/or preloading umad2sim so
	multiple simulator can run on the same system at the same time
In-Reply-To: <498AF8F0.2080707@ext.bull.net>
References: <498AF8F0.2080707@ext.bull.net>
Message-ID: <20090205174449.GH5910@sashak.voltaire.com>

On 15:34 Thu 05 Feb     , Nicolas Morey Chaisemartin wrote:
> As we do a lot of routing tests with ibsim we had the need to be able to 
> launch multiple simulator on the same system.
> With this patch, ibsim (and umad2sim) will try to read the socket basename 
> using a getenv("IBSIM_SOCKNAME") which makes it possible.
> If IBSIM_SOCKNAME is not set, SIM_BASENAME is still used.
>
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

I just changed 'socket_basename' to be static in both ibsim and
umad2sim.

Sasha


From sashak at voltaire.com  Thu Feb  5 09:46:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 19:46:38 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c enable log_max_size
	opt update
In-Reply-To: <498AFF03.7090903@gmail.com>
References: <498AFF03.7090903@gmail.com>
Message-ID: <20090205174638.GI5910@sashak.voltaire.com>

On 17:00 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote:
> enable log_max_size opt update
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Feb  5 10:04:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 20:04:00 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c fix parse functions
	for big endian machines
In-Reply-To: <498B038D.4020009@gmail.com>
References: <498B038D.4020009@gmail.com>
Message-ID: <20090205180400.GJ5910@sashak.voltaire.com>

On 17:19 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote:
> fix parse functions for big endian machines
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

I'm fine with this patch - the code looks cleaner than it was before.

But could you please explain what was a problem with original code on
big endian machines (I don't see)?

Also it would be helpful to have more detailed patch comments.

Sasha

> ---
>  opensm/opensm/osm_subnet.c |   10 +++++-----
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index d6d39a6..7b33659 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn,
>  		  IN void *p_v, IN setup_fn_t pfn)
>  {
>  	uint16_t *p_val = p_v;
> -	uint32_t val = strtoul(p_val_str, NULL, 0);
> +	uint16_t val = strtoul(p_val_str, NULL, 0);
>  
>  	CL_ASSERT(val < 0x10000);
> -	if (cl_hton32(val) != *p_val) {
> +	if (cl_hton16(val) != *p_val) {
>  		log_config_value(p_key, "0x%04x", val);
>  		if (pfn)
>  			pfn(p_subn, &val);
> -		*p_val = cl_hton16((uint16_t) val);
> +		*p_val = cl_hton16(val);
>  	}
>  }
>  
> @@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn,
>  		  IN void *p_v, IN setup_fn_t pfn)
>  {
>  	uint8_t *p_val = p_v;
> -	uint32_t val = strtoul(p_val_str, NULL, 0);
> +	uint8_t val = strtoul(p_val_str, NULL, 0);
>  
>  	CL_ASSERT(val < 0x100);
>  	if (val != *p_val) {
>  		log_config_value(p_key, "%u", val);
>  		if (pfn)
>  			pfn(p_subn, &val);
> -		*p_val = (uint8_t) val;
> +		*p_val = val;
>  	}
>  }
>  
> -- 
> 1.5.5
> 


From jgunthorpe at obsidianresearch.com  Thu Feb  5 10:02:03 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 5 Feb 2009 11:02:03 -0700
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used
	for	bind
In-Reply-To: <498AD59E.4030003@voltaire.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<498AD59E.4030003@voltaire.com>
Message-ID: <20090205180203.GD3288@obsidianresearch.com>

On Thu, Feb 05, 2009 at 02:03:42PM +0200, Or Gerlitz wrote:
> Or Gerlitz wrote:
> >Rick Frank who brought this to my attention, also handed me this patch
> >which is claimed to workaround this issue, 
> >+++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> >@@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
> >  struct flowi fl;
> >  struct rtable *rt;
> >  struct neighbour *neigh;
> >+ struct net_device *dev;
> >  int ret;
> >
> >  memset(&fl, 0, sizeof fl);
> >  fl.nl_u.ip4_u.daddr = dst_ip;
> >  fl.nl_u.ip4_u.saddr = src_ip;
> >+
> >+ if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
> >+ fl.oif = dev->ifindex;
> >+ dev_put(dev);
> >+
> >+ ret = ip_route_output_key(&rt, &fl);
> >+ if (ret == 0)
> >+ goto found;

> I assume the trick here is to somehow enforce the interface returned by 
> ip_dev_find and not the one resolved by the routing table. At least as I 
> understand the addr.c code, it takes the interface later from neigh->dev 
> , correct?

That does seem to be what it is doing, but I can't see how that is
correct? The output interface is selected by the routing table, except
in very special cases (ie SO_BINDTODEVICE).

Why doesn't the original code work? It passes src_ip into the route
lookup which should be good enough.. Does 'ip route get <dest> from
<src>' return the right thing?

Jason


From weiny2 at llnl.gov  Thu Feb  5 10:03:31 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 5 Feb 2009 10:03:31 -0800
Subject: [ofa-general] [PATCH] libibmad: Use enum types for function
 parameters (WAS)
 Declare some enums as typedefs for cleaner function interfaces
In-Reply-To: <20090204103054.177aa6e2.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204182023.GP7618@obsidianresearch.com>
	<20090204182725.GX11874@sashak.voltaire.com>
	<20090204103054.177aa6e2.weiny2@llnl.gov>
Message-ID: <20090205100331.5ab5de76.weiny2@llnl.gov>

Sasha,

On Wed, 4 Feb 2009 10:30:54 -0800
Ira Weiny <weiny2 at llnl.gov> wrote:

> On Wed, 4 Feb 2009 20:27:25 +0200
> Sasha Khapyorsky <sashak at voltaire.com> wrote:
> 
> > On 11:20 Wed 04 Feb     , Jason Gunthorpe wrote:
> > > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote:
> > > 
> > > > I don't understand how enum typedefing makes things cleaner - actually
> > > > this will enforce me explicitly to verify an actual type in header
> > > > files. Sometimes typedefs could help with porting, but it is not the
> > > > case here.
> > > 
> > > Not typedefing per say, but passing an enum through an int is not that
> > > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS'
> > > for instance will do.
> > 
> > Yes, that would be fine to do.
> 
> I will redo the patch with 'enum MAD_FIELDS'.
> 

Patch below,
Ira

>From 3a52d32d7c6964a8078402c3712a58d1e43975de Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at llnl.gov>
Date: Mon, 2 Feb 2009 10:21:18 -0800
Subject: [PATCH] Use enum types for function parameters


Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |   30 +++++++++++++++---------------
 libibmad/src/fields.c             |   22 +++++++++++-----------
 libibmad/src/resolve.c            |    6 +++---
 3 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 9ff4a3e..33a233c 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -595,14 +595,14 @@ typedef struct ib_vendor_call {
 #define MAD_DEF_RETRIES		3
 #define MAD_DEF_TIMEOUT_MS	1000
 
-enum {
+enum MAD_DEST {
 	IB_DEST_LID,
 	IB_DEST_DRPATH,
 	IB_DEST_GUID,
 	IB_DEST_DRSLID,
 };
 
-enum {
+enum MAD_NODE_TYPE {
 	IB_NODE_CA = 1,
 	IB_NODE_SWITCH,
 	IB_NODE_ROUTER,
@@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey)
 }
 
 /* fields.c */
-MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field,
+MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field);
+MAD_EXPORT void mad_set_field(void *buf, int base_offs, enum MAD_FIELDS field,
 			      uint32_t val);
 /* field must be byte aligned */
-MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field);
-MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field,
+MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, enum MAD_FIELDS field);
+MAD_EXPORT void mad_set_field64(void *buf, int base_offs, enum MAD_FIELDS field,
 				uint64_t val);
-MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val);
-MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val);
-MAD_EXPORT int mad_print_field(int field, const char *name, void *val);
-MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val);
-MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val);
+MAD_EXPORT void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val);
+MAD_EXPORT void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val);
+MAD_EXPORT void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val);
+MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val);
+MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val);
+MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val);
+MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val);
 
 /* mad.c */
 MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath,
@@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
 			       ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
-				     int dest_type, ib_portid_t * sm_id);
+				     enum MAD_DEST dest, ib_portid_t * sm_id);
 MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
 			       ibmad_gid_t * gid);
 
@@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 			ib_portid_t * sm_id, int timeout, const void *srcport);
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      enum MAD_DEST dest, ib_portid_t * sm_id,
 			      const void *srcport);
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
 			const void *srcport);
diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index d5a1eb4..588c57f 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f,
 	memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8);
 }
 
-uint32_t mad_get_field(void *buf, int base_offs, int field)
+uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field)
 {
 	return _get_field(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field(void *buf, int base_offs, int field, uint32_t val)
+void mad_set_field(void *buf, int base_offs, enum MAD_FIELDS field, uint32_t val)
 {
 	_set_field(buf, base_offs, ib_mad_f + field, val);
 }
 
-uint64_t mad_get_field64(void *buf, int base_offs, int field)
+uint64_t mad_get_field64(void *buf, int base_offs, enum MAD_FIELDS field)
 {
 	return _get_field64(buf, base_offs, ib_mad_f + field);
 }
 
-void mad_set_field64(void *buf, int base_offs, int field, uint64_t val)
+void mad_set_field64(void *buf, int base_offs, enum MAD_FIELDS field, uint64_t val)
 {
 	_set_field64(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_set_array(void *buf, int base_offs, int field, void *val)
+void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val)
 {
 	_set_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_get_array(void *buf, int base_offs, int field, void *val)
+void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val)
 {
 	_get_array(buf, base_offs, ib_mad_f + field, val);
 }
 
-void mad_decode_field(uint8_t * buf, int field, void *val)
+void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val)
 	_get_array(buf, 0, f, val);
 }
 
-void mad_encode_field(uint8_t * buf, int field, void *val)
+void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val)
 {
 	const ib_field_t *f = ib_mad_f + field;
 
@@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val,
 			 valsz ? valsz : ALIGN(f->bitlen, 8) / 8);
 }
 
-int mad_print_field(int field, const char *name, void *val)
+int mad_print_field(enum MAD_FIELDS field, const char *name, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return -1;
 	return _mad_print_field(ib_mad_f + field, name, val, 0);
 }
 
-char *mad_dump_field(int field, char *buf, int bufsz, void *val)
+char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
 	return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val);
 }
 
-char *mad_dump_val(int field, char *buf, int bufsz, void *val)
+char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val)
 {
 	if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_)
 		return 0;
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index b62360b..553949d 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 }
 
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
-			      int dest_type, ib_portid_t * sm_id,
+			      enum MAD_DEST dest_type, ib_portid_t * sm_id,
 			      const void *srcport)
 {
 	uint64_t guid;
@@ -142,8 +142,8 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 	return -1;
 }
 
-int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type,
-			  ib_portid_t * sm_id)
+int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
+			enum MAD_DEST dest_type, ib_portid_t * sm_id)
 {
 	return ib_resolve_portid_str_via(portid, addr_str, dest_type,
 					 sm_id, NULL);
-- 
1.5.4.5


From sashak at voltaire.com  Thu Feb  5 10:21:28 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 20:21:28 +0200
Subject: [ofa-general] [PATCH] infiniband-diags/common: use enum MAD_DEST as
	ibd_dest_type type
In-Reply-To: <20090205100331.5ab5de76.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204182023.GP7618@obsidianresearch.com>
	<20090204182725.GX11874@sashak.voltaire.com>
	<20090204103054.177aa6e2.weiny2@llnl.gov>
	<20090205100331.5ab5de76.weiny2@llnl.gov>
Message-ID: <20090205182128.GK5910@sashak.voltaire.com>


Use introduced 'enum MAD_DEST' as type of ibd_dest_type variable.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/include/ibdiag_common.h |    2 +-
 infiniband-diags/src/ibdiag_common.c     |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h
index b92aa4d..4783b8e 100644
--- a/infiniband-diags/include/ibdiag_common.h
+++ b/infiniband-diags/include/ibdiag_common.h
@@ -41,7 +41,7 @@ extern int ibdebug;
 extern int ibverbose;
 extern char *ibd_ca;
 extern int ibd_ca_port;
-extern int ibd_dest_type;
+extern enum MAD_DEST ibd_dest_type;
 extern ib_portid_t *ibd_sm_id;
 extern int ibd_timeout;
 
diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index 7d6e772..bda1efa 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -57,7 +57,7 @@ int ibdebug;
 int ibverbose;
 char *ibd_ca;
 int ibd_ca_port;
-int ibd_dest_type = IB_DEST_LID;
+enum MAD_DEST ibd_dest_type = IB_DEST_LID;
 ib_portid_t *ibd_sm_id;
 int ibd_timeout;
 
-- 
1.6.1.rc1.45.g123ed


From sashak at voltaire.com  Thu Feb  5 10:21:49 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 5 Feb 2009 20:21:49 +0200
Subject: [ofa-general] Re: [PATCH] libibmad: Use enum types for function
	parameters (WAS)
	Declare some enums as typedefs for cleaner function interfaces
In-Reply-To: <20090205100331.5ab5de76.weiny2@llnl.gov>
References: <20090202185425.729a80b3.weiny2@llnl.gov>
	<20090204181421.GV11874@sashak.voltaire.com>
	<20090204182023.GP7618@obsidianresearch.com>
	<20090204182725.GX11874@sashak.voltaire.com>
	<20090204103054.177aa6e2.weiny2@llnl.gov>
	<20090205100331.5ab5de76.weiny2@llnl.gov>
Message-ID: <20090205182149.GL5910@sashak.voltaire.com>

On 10:03 Thu 05 Feb     , Ira Weiny wrote:
> Sasha,
> 
> On Wed, 4 Feb 2009 10:30:54 -0800
> Ira Weiny <weiny2 at llnl.gov> wrote:
> 
> > On Wed, 4 Feb 2009 20:27:25 +0200
> > Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > 
> > > On 11:20 Wed 04 Feb     , Jason Gunthorpe wrote:
> > > > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote:
> > > > 
> > > > > I don't understand how enum typedefing makes things cleaner - actually
> > > > > this will enforce me explicitly to verify an actual type in header
> > > > > files. Sometimes typedefs could help with porting, but it is not the
> > > > > case here.
> > > > 
> > > > Not typedefing per say, but passing an enum through an int is not that
> > > > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS'
> > > > for instance will do.
> > > 
> > > Yes, that would be fine to do.
> > 
> > I will redo the patch with 'enum MAD_FIELDS'.
> > 
> 
> Patch below,
> Ira
> 
> From 3a52d32d7c6964a8078402c3712a58d1e43975de Mon Sep 17 00:00:00 2001
> From: weiny2 at llnl.gov <weiny2 at llnl.gov>
> Date: Mon, 2 Feb 2009 10:21:18 -0800
> Subject: [PATCH] Use enum types for function parameters
> 
> 
> Signed-off-by: weiny2 at llnl.gov <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sean.hefty at intel.com  Thu Feb  5 11:17:32 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 11:17:32 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	
	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	
	<49893FAF.3090007@voltaire.com>	
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com>
	<D36D0D1763CA48FD98DCDCAD64BCA89F@amr.corp.intel.com>
	<39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com>
Message-ID: <6C22667CA9024C9780E41AA468FC9153@amr.corp.intel.com>

>Assuming this is an rdma-cm capable device in a 'bad' state, the user
>space application can wait for asyn ibv events (PORT_ACTIVE) from the
>device. Once the device is active again it can retry the rdma_create_qp
>or rdma_join_mc.

Will this work?  Even once the port goes active, what the application is really
waiting for is for IPoIB to come back up and rejoin its 'broadcast' multicast
group.  I guess you could just continue to retry the operation until it
succeeds...

- Sean


From sean.hefty at intel.com  Thu Feb  5 11:17:54 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 11:17:54 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <4989E6D6.5030109@Voltaire.COM>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
Message-ID: <3522BA7F49834878A674F2908834D747@amr.corp.intel.com>

>@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i
> 			event.status = ib_event->param.sidr_rep_rcvd.status;
> 			break;
> 		}
>+		ret = cma_set_qkey(id_priv);
>+		if (ret) {
>+			event.event = RDMA_CM_EVENT_ADDR_ERROR;
>+			event.status = -EINVAL;
>+			break;
>+		}
> 		if (id_priv->qkey != rep->qkey) {
> 			event.event = RDMA_CM_EVENT_UNREACHABLE;
> 			event.status = -EINVAL;
>@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma
> 			     const void *private_data, int private_data_len)
> {
> 	struct ib_cm_sidr_rep_param rep;
>+	int ret;
>
> 	memset(&rep, 0, sizeof rep);
> 	rep.status = status;
> 	if (status == IB_SIDR_SUCCESS) {
>+		ret = cma_set_qkey(id_priv);
>+		if (ret)
>+			return ret;
> 		rep.qp_num = id_priv->qp_num;
> 		rep.qkey = id_priv->qkey;
> 	}

Looking at this, I keep wanting to set the qkey when sending or receiving the
sidr req, not rep.  This is earlier than the qkey is needed, but catching the
error sooner in this case seems better to me than deferring.  Thoughts?

- Sean


From yosefe at Voltaire.COM  Thu Feb  5 11:26:54 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Thu, 05 Feb 2009 21:26:54 +0200
Subject: [ofa-general] Re: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <3522BA7F49834878A674F2908834D747@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
	<3522BA7F49834878A674F2908834D747@amr.corp.intel.com>
Message-ID: <498B3D7E.6010300@Voltaire.COM>

Sean Hefty wrote:
>> @@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i
>> 			event.status = ib_event->param.sidr_rep_rcvd.status;
>> 			break;
>> 		}
>> +		ret = cma_set_qkey(id_priv);
>> +		if (ret) {
>> +			event.event = RDMA_CM_EVENT_ADDR_ERROR;
>> +			event.status = -EINVAL;
>> +			break;
>> +		}
>> 		if (id_priv->qkey != rep->qkey) {
>> 			event.event = RDMA_CM_EVENT_UNREACHABLE;
>> 			event.status = -EINVAL;
>> @@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma
>> 			     const void *private_data, int private_data_len)
>> {
>> 	struct ib_cm_sidr_rep_param rep;
>> +	int ret;
>>
>> 	memset(&rep, 0, sizeof rep);
>> 	rep.status = status;
>> 	if (status == IB_SIDR_SUCCESS) {
>> +		ret = cma_set_qkey(id_priv);
>> +		if (ret)
>> +			return ret;
>> 		rep.qp_num = id_priv->qp_num;
>> 		rep.qkey = id_priv->qkey;
>> 	}
> 
> Looking at this, I keep wanting to set the qkey when sending or receiving the
> sidr req, not rep.  This is earlier than the qkey is needed, but catching the
> error sooner in this case seems better to me than deferring.  Thoughts?
> 
> - Sean
> 

It might be better to catch errors earlier, but there is the risk that the
flow might change somehow, and losing the (now obvious) logical connection 
between retrieving the qkey and actually using it.

-- 
--Yossi


From sean.hefty at intel.com  Thu Feb  5 11:31:09 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 11:31:09 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <498B3D7E.6010300@Voltaire.COM>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
	<3522BA7F49834878A674F2908834D747@amr.corp.intel.com>
	<498B3D7E.6010300@Voltaire.COM>
Message-ID: <F6F1C8DBB03A4CCB882ED455673DD576@amr.corp.intel.com>

>It might be better to catch errors earlier, but there is the risk that the
>flow might change somehow, and losing the (now obvious) logical connection
>between retrieving the qkey and actually using it.

I can go with that.  I don't have a strong preference.  Have you tested the
patch and verified that it works for you?

- Sean


From yosefe at Voltaire.COM  Thu Feb  5 11:41:42 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Thu, 05 Feb 2009 21:41:42 +0200
Subject: [ofa-general] Re: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <F6F1C8DBB03A4CCB882ED455673DD576@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
	<3522BA7F49834878A674F2908834D747@amr.corp.intel.com>
	<498B3D7E.6010300@Voltaire.COM>
	<F6F1C8DBB03A4CCB882ED455673DD576@amr.corp.intel.com>
Message-ID: <498B40F6.7060904@Voltaire.COM>

Sean Hefty wrote:
>> It might be better to catch errors earlier, but there is the risk that the
>> flow might change somehow, and losing the (now obvious) logical connection
>> between retrieving the qkey and actually using it.
> 
> I can go with that.  I don't have a strong preference.  Have you tested the
> patch and verified that it works for you?
> 
> - Sean
> 

Yes I did, with mckey.

When the HCA port is down:
 Without the patch, mckey fails on from rdma_resolve_route (except when ipoib is 
trying to join at the same time - then there will be a join error).
 With the patch, mckey fails on rdma_create_qp (again, except when ipoib is trying
to join at the same time).

When the HCA port is up, mckey works normally.

-- 
--Yossi


From sean.hefty at intel.com  Thu Feb  5 11:49:24 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 11:49:24 -0800
Subject: [ofa-general] RE: impossibility to bind a device/port with the
	rdma-cm when the port is down
In-Reply-To: <4989E6D6.5030109@Voltaire.COM>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
Message-ID: <DC4530D43E764B5A90F1D74B196FC595@amr.corp.intel.com>

From: Yossi Etigin <yosefe at Voltaire.COM>

  When doing rdma_resolve_addr() and relevant port is down, the function fails
and rdma_cm id is not bound to the device. Therefore, application does not have
device handle and cannot wait for the port to become active. The function
fails because ipoib is not joined to the multicast group and therefore sa does
not have a multicast record to take a qkey from.
  The proposed patch is to make lazy qkey resolution - cma_set_qkey will set
id_priv->qkey if it was not set, and will be called just before the qkey is
really required.

Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

Acked-by: Sean Hefty <sean.hefty at intel.com>
---
Roland, any objection to queuing this for 2.6.30?

> drivers/infiniband/core/cma.c |   41 +++++++++++++++++++++++++++--------------
> 1 file changed, 27 insertions(+), 14 deletions(-)
>
>Index: b/drivers/infiniband/core/cma.c
>===================================================================
>--- a/drivers/infiniband/core/cma.c	2009-02-04 20:40:20.000000000 +0200
>+++ b/drivers/infiniband/core/cma.c	2009-02-04 20:57:59.000000000 +0200
>@@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r
> 	id_priv->cma_dev = NULL;
> }
>
>-static int cma_set_qkey(struct ib_device *device, u8 port_num,
>-			enum rdma_port_space ps,
>-			struct rdma_dev_addr *dev_addr, u32 *qkey)
>+static int cma_set_qkey(struct rdma_id_private *id_priv)
> {
> 	struct ib_sa_mcmember_rec rec;
> 	int ret = 0;
>
>-	switch (ps) {
>+	if (id_priv->qkey)
>+		return;
>+
>+	switch (id_priv->id.ps) {
> 	case RDMA_PS_UDP:
>-		*qkey = RDMA_UDP_QKEY;
>+		id_priv->qkey = RDMA_UDP_QKEY;
> 		break;
> 	case RDMA_PS_IPOIB:
>-		ib_addr_get_mgid(dev_addr, &rec.mgid);
>-		ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec);
>-		*qkey = be32_to_cpu(rec.qkey);
>+		ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid);
>+		ret = ib_sa_get_mcmember_rec(id_priv->id.device,
>+		                             id_priv->id.port_num, &rec.mgid,
>+		                             &rec);
>+		if (!ret)
>+			id_priv->qkey = be32_to_cpu(rec.qkey);
> 		break;
> 	default:
> 		break;
>@@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i
> 		ret = ib_find_cached_gid(cma_dev->device, &gid,
> 					 &id_priv->id.port_num, NULL);
> 		if (!ret) {
>-			ret = cma_set_qkey(cma_dev->device,
>-					   id_priv->id.port_num,
>-					   id_priv->id.ps, dev_addr,
>-					   &id_priv->qkey);
>-			if (!ret)
>-				cma_attach_to_dev(id_priv, cma_dev);
>+			cma_attach_to_dev(id_priv, cma_dev);
> 			break;
> 		}
> 	}
>@@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd
> 	*qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT;
>
> 	if (cma_is_ud_ps(id_priv->id.ps)) {
>+		ret = cma_set_qkey(id_priv);
>+		if (ret)
>+			return ret;
>+
> 		qp_attr->qkey = id_priv->qkey;
> 		*qp_attr_mask |= IB_QP_QKEY;
> 	} else {
>@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i
> 			event.status = ib_event->param.sidr_rep_rcvd.status;
> 			break;
> 		}
>+		ret = cma_set_qkey(id_priv);
>+		if (ret) {
>+			event.event = RDMA_CM_EVENT_ADDR_ERROR;
>+			event.status = -EINVAL;
>+			break;
>+		}
> 		if (id_priv->qkey != rep->qkey) {
> 			event.event = RDMA_CM_EVENT_UNREACHABLE;
> 			event.status = -EINVAL;
>@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma
> 			     const void *private_data, int private_data_len)
> {
> 	struct ib_cm_sidr_rep_param rep;
>+	int ret;
>
> 	memset(&rep, 0, sizeof rep);
> 	rep.status = status;
> 	if (status == IB_SIDR_SUCCESS) {
>+		ret = cma_set_qkey(id_priv);
>+		if (ret)
>+			return ret;
> 		rep.qp_num = id_priv->qp_num;
> 		rep.qkey = id_priv->qkey;
> 	}


From brian at sun.com  Thu Feb  5 11:54:02 2009
From: brian at sun.com (Brian J. Murrell)
Date: Thu, 05 Feb 2009 14:54:02 -0500
Subject: [ofa-general] 1.3.1 and 1.4 compatibilty
Message-ID: <1233863642.22864.3203.camel@pc.interlinx.bc.ca>

I'm sure I know the answer to this, or will be floored if it's other
than I think, but just to do due diligence... are OFED 1.3.1 and 1.4
compatible?  That is, nodes running one version will talk to nodes of
the other version without problem, yes?

Is it complete compatibility or are there any known caveats?

Thanx!

b.


From swise at opengridcomputing.com  Thu Feb  5 13:05:49 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 05 Feb 2009 15:05:49 -0600
Subject: [ofa-general] Chelsio T3: Aggregate Throughput
In-Reply-To: <OF470F5E1D.BC00EC13-ONC1257554.0058CDE7-C1257554.0059FBAB@ch.ibm.com>
References: <OF470F5E1D.BC00EC13-ONC1257554.0058CDE7-C1257554.0059FBAB@ch.ibm.com>
Message-ID: <498B54AD.1010802@opengridcomputing.com>

Philip Frey1 wrote:
>
> Hello,
>
> we am currently looking into the scalability of the T3 in terms of
> connections. We are using a 1-to-n scenario where the one server
> has a chunk of data and n client that fetch this chunk over and over
> again using RDMA reads (each 1MB in size).
>
> The clients do that such that they get an average data rate of about
> 9Mbps each. Every second we connect a new client to the server
> and see how far it goes.
>
> What puzzles us now is that after about 800 clients, they do no longer
> seem to receive much data.
>
> The first interesting thing is that the aggregate throughput actually 
> drops
> (we expected it to stall). And the second interesting thing is that it 
> does
> so already at about 6.3Gbps which is just a bit more than half of what 
> the
> card can do. We do not experience this kind of situation when using
> much less clients that RDMA read the data at a much higher data rate.
>
> Is there any limitation on the RNIC that would give an explanation for 
> this?
>

Are the RNICs experiencing lots of pause frames during the test? 

ethtool -S ethX|grep Pause

Also, are the iWARP stacks retransmitting a lot during the test? 

cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs


Steve.


From andy.grover at oracle.com  Thu Feb  5 14:08:06 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 05 Feb 2009 14:08:06 -0800
Subject: [ofa-general] IB credit-based flow control
Message-ID: <498B6346.7000208@oracle.com>

Hi,

Steve and I have been working to debug RDS's credit-based flow control, 
and I happened to notice that IB already implements this (see ib spec 
section 9.7.7.2).

So, why is it necessary for a ULP like RDS to implement its own flow 
control? It looks like IB's flow control should result in no RNR 
retries, yet without protocol-level FC, we see RNR retries.

Thanks -- Regards -- Andy


From sean.hefty at intel.com  Thu Feb  5 14:23:10 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 14:23:10 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <498B6346.7000208@oracle.com>
References: <498B6346.7000208@oracle.com>
Message-ID: <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>

>So, why is it necessary for a ULP like RDS to implement its own flow
>control? It looks like IB's flow control should result in no RNR
>retries, yet without protocol-level FC, we see RNR retries.

If you're using a shared receive queue, end to end flow control is disabled.
Also, see 9.7.7.2.5 C9-162 - an HCA is allowed to send up to one packet for a
send request even if it doesn't have any credits available.

- Sean 


From hal.rosenstock at gmail.com  Thu Feb  5 14:55:22 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 5 Feb 2009 17:55:22 -0500
Subject: [ofa-general] ***SPAM*** [RFC] infiniband-diags/perfquery.c: Any
	objections to changing an option name ?
Message-ID: <f0e08f230902051455o3f38ee1va4f878f0c1f953cb@mail.gmail.com>

In infiniband-diags/perfquery, -e is used for extended counters and
covers up using the common errors option so I'd like to change this to
be -x for xtended. Any objections ? Without this change when perfquery
fails you can't get the more detailed error information which is very
useful for debugging problems.

-- Hal


From halr at obsidianresearch.com  Thu Feb  5 15:31:22 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:31:22 -0700
Subject: [ofa-general] [PATCH] ibsim: Eliminate unused modified variable
Message-ID: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca>

Sasha,

Trivial patch to eliminate the unused 'modified' variable.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ibsim-Eliminate-unused-modified-variable.patch
Type: application/mbox
Size: 1398 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/1b79cf70/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:31:31 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:31:31 -0700
Subject: [ofa-general] [PATCH] ibsim: Change lid print format to unsigned
Message-ID: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca>

Sasha,

Patch to change lid print format to unsigned to be consistent elsewhere.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-ibsim-Change-lid-prints-to-unsigned.patch
Type: application/mbox
Size: 6637 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/ad1a09e7/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:41:39 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:41:39 -0700
Subject: [ofa-general] [PATCH] opensm/doc/perf-manager-arch.txt: Fix some
	commentary typos
Message-ID: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca>

Sasha,

Trivial patch to fix some typos in this doc.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-opensm-doc-perf-manager-arch.txt-Fix-some-commentar.patch
Type: application/mbox
Size: 1761 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/fa399d3f/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:42:23 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:42:23 -0700
Subject: [ofa-general] [PATCH] opensm/PerfMgr: Add copyrights
Message-ID: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca>

Sasha,

This just adds copyrights missed in previous patches.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-opensm-PerfMgr-Add-copyright.patch
Type: application/mbox
Size: 2700 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/b69c252f/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:42:59 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:42:59 -0700
Subject: [ofa-general] [PATCH] libibmad: lid print format changed to unsigned
Message-ID: <1233877379.8992.511.camel@bertha1.edm.orcorp.ca>

Sasha,

This changes libibmad lid print format to unsigned to be consistent with
OpenSM and diag tools.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-libibmad-lid-printing-changed-to-unsigned-as-was-d.patch
Type: application/mbox
Size: 1698 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/f12b1a3e/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:43:34 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:43:34 -0700
Subject: [ofa-general] libibumad/umad.c: Change lid print format to unsigned
Message-ID: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca>

Sasha,

This patch changes umad.c lid print format to unsigned.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0007-libibumad-umad.c-Change-lid-prints-to-unsigned.patch
Type: application/mbox
Size: 1563 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/7c744501/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:47:33 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:47:33 -0700
Subject: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp,
	set rpc attribute ID from response
Message-ID: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca>

Sasha,

This patch sets the attribute ID based on what is in the response.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0009-libibmad-rpc.c-In-mad_rpc-and-mad_rpc_rmpp-set-rpc.patch
Type: application/mbox
Size: 1458 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/0037566c/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 15:48:08 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 16:48:08 -0700
Subject: [ofa-general] [PATCH] libibmad/gs.c: Factor out common code
Message-ID: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca>

Sasha,

This patch factors out some common code in gs.c. common_query_setup is
used by both pma_query_via and performance_reset_via.

-- Hal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0010-libibmad-gs.c-Factor-out-common-code.patch
Type: application/mbox
Size: 3036 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/9b69dae0/attachment.mbox>

From halr at obsidianresearch.com  Thu Feb  5 16:00:02 2009
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Thu, 05 Feb 2009 17:00:02 -0700
Subject: [ofa-general] [PATCH] infiniband-diags/perfquery: Change option name
	for extended counters
Message-ID: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca>

Sasha,

Per the RFC, this patch changes the option name for extended counters to
to not cover up common errors option. This changes it from -e/--extended
to -x/--xtended so -e/--errors can be used to get error information as
is common with the IB diags.

-- Hal

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0012-infiniband-diags-perfquery-Change-option-name-for-e.patch
Type: application/mbox
Size: 4217 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090205/7a47e307/attachment.mbox>

From andy.grover at oracle.com  Thu Feb  5 16:26:14 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 05 Feb 2009 16:26:14 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>
References: <498B6346.7000208@oracle.com>
	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>
Message-ID: <498B83A6.9030702@oracle.com>

Sean Hefty wrote:
>> So, why is it necessary for a ULP like RDS to implement its own flow
>> control? It looks like IB's flow control should result in no RNR
>> retries, yet without protocol-level FC, we see RNR retries.
> 
> If you're using a shared receive queue, end to end flow control is disabled.
> Also, see 9.7.7.2.5 C9-162 - an HCA is allowed to send up to one packet for a
> send request even if it doesn't have any credits available.

Good point, but just looking at the non-SRQ case:

I'm reading C9-162 and still not seeing why (according to the spec 
anyways) there should ever be RNR retries on a connection. I would think 
the receiving HCA would not credit its last WQE to the sender, and thus 
retries should never happen?

The whole point of this feature is to eliminate RNR retries, no?

Thanks -- Regards -- Andy


From sean.hefty at intel.com  Thu Feb  5 16:57:27 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Feb 2009 16:57:27 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <498B83A6.9030702@oracle.com>
References: <498B6346.7000208@oracle.com>
	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>
	<498B83A6.9030702@oracle.com>
Message-ID: <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>

>I'm reading C9-162 and still not seeing why (according to the spec
>anyways) there should ever be RNR retries on a connection. I would think
>the receiving HCA would not credit its last WQE to the sender, and thus
>retries should never happen?
>
>The whole point of this feature is to eliminate RNR retries, no?

What I'm looking at for C9-162 is:

C9-162: When the requester encounters a WQE on its send queue for
which it has no available credits, that WQE is said to be limited.
If the limited request WQE is a SEND request, the send queue shall
transmit no more than a single packet of the request message before
it must stop transmission and wait for an acknowledge packet.

My assumption is that if no credits are available when the SEND request arrives,
then the receiver generates a RNR message, but I didn't read through the entire
section to verify this.

This is totally a guess, but there needs to be some sort of recovery mechanism
in place to handle a lost credit update message.  Allowing the requester to
issue a limited request in the absence of credits will force a credit update if
any are available.

Did you verify that the HCAs you're using implement e2e flow control?

- Sean


From andy.grover at oracle.com  Thu Feb  5 18:28:25 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 05 Feb 2009 18:28:25 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>
References: <498B6346.7000208@oracle.com>
	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>
	<498B83A6.9030702@oracle.com>
	<031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>
Message-ID: <498BA049.6090006@oracle.com>

Sean Hefty wrote:
> My assumption is that if no credits are available when the SEND request arrives,
> then the receiver generates a RNR message, but I didn't read through the entire
> section to verify this.
> 
> This is totally a guess, but there needs to be some sort of recovery mechanism
> in place to handle a lost credit update message.  Allowing the requester to
> issue a limited request in the absence of credits will force a credit update if
> any are available.
> 
> Did you verify that the HCAs you're using implement e2e flow control?

How would I verify that? I'm using current HCAs (mlx4), so I'm assuming 
if the spec says an HCA must support something, is is supported?

We definitely still need ulp-level flow control for iwarp so it's not 
wasted work. But if IB doesn't, then it would be great to not incur the 
overhead.

Thanks -- Regards -- Andy


From swise at opengridcomputing.com  Thu Feb  5 19:36:53 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 05 Feb 2009 21:36:53 -0600
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <498BA049.6090006@oracle.com>
References: <498B6346.7000208@oracle.com>	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>	<498B83A6.9030702@oracle.com>	<031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>
	<498BA049.6090006@oracle.com>
Message-ID: <498BB055.8040608@opengridcomputing.com>


Andy Grover wrote:
> Sean Hefty wrote:
>> My assumption is that if no credits are available when the SEND 
>> request arrives,
>> then the receiver generates a RNR message, but I didn't read through 
>> the entire
>> section to verify this.
>>
>> This is totally a guess, but there needs to be some sort of recovery 
>> mechanism
>> in place to handle a lost credit update message.  Allowing the 
>> requester to
>> issue a limited request in the absence of credits will force a credit 
>> update if
>> any are available.
>>
>> Did you verify that the HCAs you're using implement e2e flow control?
>
> How would I verify that? I'm using current HCAs (mlx4), so I'm 
> assuming if the spec says an HCA must support something, is is supported?
>
> We definitely still need ulp-level flow control for iwarp so it's not 
> wasted work. But if IB doesn't, then it would be great to not incur 
> the overhead.
>

 From what I've seen in the various IB ULPs, the only way to remove RNRs 
is to do correct ULP flow control. 

But I never know about this IB transport level credit stuff until you 
brought it up! :)

Steve


From andy.grover at oracle.com  Thu Feb  5 19:40:29 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 05 Feb 2009 19:40:29 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <498BA049.6090006@oracle.com>
References: <498B6346.7000208@oracle.com>	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>	<498B83A6.9030702@oracle.com>	<031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>
	<498BA049.6090006@oracle.com>
Message-ID: <498BB12D.5080107@oracle.com>

Andy Grover wrote:
> How would I verify that? I'm using current HCAs (mlx4), so I'm assuming 
> if the spec says an HCA must support something, is is supported?
> 
> We definitely still need ulp-level flow control for iwarp so it's not 
> wasted work. But if IB doesn't, then it would be great to not incur the 
> overhead.

Mystery solved, RDS has ulp-level flow control specifically to support 
iwarp, so this is not needed on IB connections, due to the HW FC we've 
been discussing.

Thanks -- Regards -- Andy


From rdreier at cisco.com  Thu Feb  5 20:39:55 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 05 Feb 2009 20:39:55 -0800
Subject: [ofa-general] IB credit-based flow control
In-Reply-To: <498BA049.6090006@oracle.com> (Andy Grover's message of "Thu, 05
	Feb 2009 18:28:25 -0800")
References: <498B6346.7000208@oracle.com>
	<6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com>
	<498B83A6.9030702@oracle.com>
	<031DEB206CEA4802860C38861660EC87@amr.corp.intel.com>
	<498BA049.6090006@oracle.com>
Message-ID: <adamyd0i3fo.fsf@cisco.com>

 > How would I verify that? I'm using current HCAs (mlx4), so I'm
 > assuming if the spec says an HCA must support something, is is
 > supported?
 > 
 > We definitely still need ulp-level flow control for iwarp so it's not
 > wasted work. But if IB doesn't, then it would be great to not incur
 > the overhead.

mlx4 HCAs do support end-to-end credits.  However, as you've discovered,
that transport level flow control is not necessarily that useful: if a
sender overruns the receives that are posted, then it triggers an RNR
NAK which leads to a large delay in the connection, which can be very
bad for throughput.  So for best performance, application level flow
control is required, even with IB end-to-end credit flow control at the
transport level.

 - R.


From PHF at zurich.ibm.com  Fri Feb  6 01:59:27 2009
From: PHF at zurich.ibm.com (Philip Frey1)
Date: Fri, 6 Feb 2009 10:59:27 +0100
Subject: [ofa-general] Chelsio T3: Aggregate Throughput
In-Reply-To: <498B54AD.1010802@opengridcomputing.com>
References: <OF470F5E1D.BC00EC13-ONC1257554.0058CDE7-C1257554.0059FBAB@ch.ibm.com>
	<498B54AD.1010802@opengridcomputing.com>
Message-ID: <OF22B0B8F7.BEE92BD5-ONC1257555.00364D24-C1257555.0036E216@ch.ibm.com>

> Are the RNICs experiencing lots of pause frames during the test? 
> 
> ethtool -S ethX|grep Pause

(cheiron was the server and the others were RDMA reading from it)

[root at ajax]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 248428611
     RxPauseFrames      : 0

[root at achilles]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 250937599
     RxPauseFrames      : 0

[root at bacchus]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 21153321
     RxPauseFrames      : 70

[root at borus]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 22056840
     RxPauseFrames      : 70

[root at car]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 23540619
     RxPauseFrames      : 70

[root at cheiron]$ ethtool -S eth2 | grep Pause
      TxPauseFrames      : 0
     RxPauseFrames      : 26569935


> Also, are the iWARP stacks retransmitting a lot during the test? 
> 
> cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs

[root at ajax]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0

[root at achilles]$ cat 
/sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0

[root at bacchus]$ cat 
/sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0

[root at borus]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0

[root at car]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0

[root at cheiron]$ cat 
/sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs
 0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090206/d7131282/attachment.html>

From Line.Holen at Sun.COM  Fri Feb  6 02:14:13 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Fri, 06 Feb 2009 11:14:13 +0100
Subject: [ofa-general] 1.4 git repository for the management SW
Message-ID: <498C0D75.6090904@Sun.COM>

Hi,

I would like to get hold of the source for the 1.4 release of the 
management SW.
I've tried to clone ofed_1_4/management.git, but that seems  to be about 
2 weeks
newer than the release.
Where / how can I find the correct version ?
I was expecting to find OpenSM version 3.2.5 in the above source 
repository, but
it shows up as 3.3.0.

Line


From vlad at lists.openfabrics.org  Fri Feb  6 03:11:51 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri,  6 Feb 2009 03:11:51 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090206-0200 daily build status
Message-ID: <20090206111151.B1134E610D5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Fri Feb  6 03:53:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 6 Feb 2009 13:53:11 +0200
Subject: [ofa-general] 1.4 git repository for the management SW
In-Reply-To: <498C0D75.6090904@Sun.COM>
References: <498C0D75.6090904@Sun.COM>
Message-ID: <20090206115311.GA17713@sashak.voltaire.com>

On 11:14 Fri 06 Feb     , Line.Holen at Sun.COM wrote:
>
> I would like to get hold of the source for the 1.4 release of the 
> management SW.
> I've tried to clone ofed_1_4/management.git, but that seems  to be about 2 
> weeks
> newer than the release.
> Where / how can I find the correct version ?

Get opensm-3.2 branch of git://git.openfabrics.org/~sashak/management
tree (or opensm-3.2.5 tag).

Sasha


From mossy.boulders at gmail.com  Fri Feb  6 04:34:10 2009
From: mossy.boulders at gmail.com (Markus Uhlmann)
Date: Fri, 6 Feb 2009 13:34:10 +0100
Subject: [ofa-general] ***SPAM*** debian/ofed-1.4 - mpi global communication
	performance
Message-ID: <962e48ae0902060434n19759d92x673e9289d9915059@mail.gmail.com>

Hi all,

we have been struggling with the performance of a supermicro
(quad-core xeon) / qlogic (9024-FC) system running Debian, kernel
2.6.24-x86_64, and ofed-1.4 (from http://www.openfabrics.org/).
There are 8 nodes attached to the switch.

What happens is that the performance of MPI global communication is
extremely low (i.e. ~ factor 10 when 16 procs out of only 2 nodes
communicate). This number comes from comparison with a *similar*
system (dell/cisco).

Some test which we have performed:

* local memory bandwidth test ("stream" benchmark on 8-way node
  returns >8GB/s)

* firmware: since the hca's are on-board supermicro (board_id:
  SM_2001000001; firmware-version: 1.2.0) I don't know how/where to
  check adequacy.

* openib low-level communication tests seem okay (see output from
  ib_write_lat, ib_write_bw below)

* However, I see errors of type "RcvSwRelayErrors" when checking
  "ibcheckerrors". Is this normal?

* Mpi benchmarks reveal slow all-to-all communication (see output
  below for "osu_alltoall" test

https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c
,
  compiled with openmpi-1.3 and intel compiler 11.0)


Some questions I have:

1) Do I have to configure the switch?
   So far I have not attempted to install the "ofed+" etc. software
   which came with the qlogic hardware. Is there any chance that it
   would be compatible with ofed-1.4? Or even installable under Debian
   (without too much tweaking)?

2) Is it okay for this system to run "opensm" on one of the nodes?
   NOTE: the version is "OpenSM 3.2.5_20081207"

Any other lead or things I should test?

Thanks in advance,

MU

==============================================================
------------------------------------------------------------------
                    RDMA_Write Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           3.10          22.88             3.15
      4        1000           3.13           6.29             3.16
      8        1000           3.14           6.24             3.18
     16        1000           3.17           6.25             3.21
     32        1000           3.25           7.60             3.38
     64        1000           3.32           6.43             3.45
    128        1000           3.48           6.40             3.57
    256        1000           3.77           6.63             3.82
    512        1000           4.71           8.44             4.76
   1024        1000           5.58           7.53             5.63
   2048        1000           7.38           8.17             7.51
   4096        1000           8.64           9.04             8.77
   8192        1000          11.41          11.81            11.57
  16384        1000          16.55          17.27            16.71
  32768        1000          26.81          28.12            27.01
  65536        1000          47.41          49.43            47.62
 131072        1000          89.86          91.98            90.81
 262144        1000         174.25         176.34           175.35
 524288        1000         343.03         344.79           343.51
1048576        1000         679.04         680.57           679.72
2097152        1000        1350.88        1352.80          1351.75
4194304        1000        2693.31        2696.13          2694.50
8388608        1000        5380.45        5383.29          5381.62
------------------------------------------------------------------
------------------------------------------------------------------
                    RDMA_Write BW Test
Number of qp's running 1
Connection type : RC
Each Qp will post up to 100 messages each time
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
      2        5000               2.51                  2.51
      4        5000               5.03                  5.03
      8        5000              10.09                 10.09
     16        5000              19.71                 19.70
     32        5000              39.23                 39.22
     64        5000              77.91                 77.84
    128        5000             146.67                146.53
    256        5000             223.14                222.82
    512        5000             640.09                639.80
   1024        5000            1106.72               1106.22
   2048        5000            1271.61               1270.87
   4096        5000            1379.58               1379.44
   8192        5000            1446.01               1445.95
  16384        5000            1477.11               1477.09
  32768        5000            1498.18               1498.17
  65536        5000            1507.23               1507.22
 131072        5000            1511.83               1511.82
 262144        5000            1487.64               1487.62
 524288        5000            1485.76               1485.75
1048576        5000            1487.13               1486.54
2097152        5000            1487.95               1487.95
4194304        5000            1488.11               1488.10
8388608        5000            1488.22               1488.22
------------------------------------------------------------------
***************OUR-SYSTEM /supermicro-qlogic:********************
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1
# Size            Latency (us)
1                         7.87
2                         7.80
4                         7.77
8                         7.78
16                        7.81
32                        9.00
64                        9.00
128                      10.15
256                      11.75
512                      15.55
1024                     23.54
2048                     40.57
4096                    107.12
8192                    187.28
16384                   343.61
32768                   602.17
65536                  1135.20
131072                 3086.28
262144                 9086.50
524288                18713.30
1048576               37378.61
------------------------------------------------------------------
**************REFERENCE_SYSTEM / dell-cisco:***********************
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1
# Size            Latency (us)
1                        16.14
2                        15.93
4                        16.25
8                        16.60
16                       25.83
32                       28.66
64                       33.57
128                      40.94
256                      56.20
512                      91.24
1024                    156.13
2048                    373.17
4096                    696.95
8192                   1464.89
16384                  1367.96
32768                  2499.21
65536                  5686.46
131072                11065.98
262144                23922.69
524288                49294.71
1048576              101290.67
==============================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090206/99556eff/attachment.html>

From mossy.boulders at gmail.com  Fri Feb  6 04:38:06 2009
From: mossy.boulders at gmail.com (Markus Uhlmann)
Date: Fri, 6 Feb 2009 13:38:06 +0100
Subject: [ofa-general] ***SPAM*** debian/ofed-1.4 - mpi global communication
	performance
Message-ID: <962e48ae0902060438y26e58a7s649a7fa8cb8ac2d2@mail.gmail.com>

Sorry, the numbers for one of the tests were inserted wrongly. It should be:

***************OUR-SYSTEM /supermicro-qlogic:********************
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1
# Size            Latency (us)
1                       137.32
2                       136.23
4                       135.97
8                       135.63
16                      138.00
32                      139.19
64                      139.26
128                     140.06
256                    1770.24
512                    1772.94
1024                   1776.16
2048                   1811.75
4096                    584.51
8192                    746.64
16384                  3927.21
32768                  4576.17
65536                  6052.26
131072                 9898.08
262144                19566.90
524288                37515.47
1048576               74443.69
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090206/2d144d81/attachment.html>

From or.gerlitz at gmail.com  Fri Feb  6 08:43:51 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 6 Feb 2009 18:43:51 +0200
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for 
	bind
In-Reply-To: <FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
Message-ID: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>

>
> ucmatose allows binding to a specific address using -b.  I haven't used
> rds-ping
> to know if it's the same as -I in that case.  I don't have any systems
> myself
> with dual HCAs; I don't think they have enough slots to support more than
> one.


Hi Sean,

ucmatose doesn't do anything with the address provided with the -b param on
its --active-- side, where this problem takes place. Yes, -I to rds-ping is
the same as -I to ping (other then the fact of the former doesn't seem to
work well). As I wrote you in detail, there's no need for two HCAs to get
the problem reproduced, just have one node with two active port, each
assigned with a different IP address.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090206/d2298185/attachment.html>

From richard.frank at oracle.com  Fri Feb  6 09:05:48 2009
From: richard.frank at oracle.com (Richard Frank)
Date: Fri, 06 Feb 2009 12:05:48 -0500
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used
	for  bind
In-Reply-To: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
	<15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
Message-ID: <498C6DEC.70805@oracle.com>

I played around with this a bit more yesterday - and it looks like 
rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning 
the first matching entry in the routing table... even though we are 
providing the source ip for the bind...

Keeping in mind that both IB ports have IPs on the same subnet...

[root at vosib8 rds-tools-1.1-2]# ip a s ib0
33: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast 
qlen 256
    link/infiniband 
80:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:3b:61 brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 11.0.0.8/24 brd 11.0.0.255 scope global ib0
    inet6 fe80::202:c902:20:3b61/64 scope link
       valid_lft forever preferred_lft forever

[root at vosib8 rds-tools-1.1-2]# ip a s ib1
34: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast 
qlen 256
    link/infiniband 
80:00:04:05:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:3b:62 brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 11.0.0.18/24 brd 11.0.0.255 scope global ib1
    inet6 fe80::202:c902:20:3b62/64 scope link
       valid_lft forever preferred_lft forever

[root at vosib8 rds-tools-1.1-2]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use 
Iface
11.0.0.0        *               255.255.255.0   U     0      0        0 ib0
11.0.0.0        *               255.255.255.0   U     0      0        0 ib1
10.10.0.0       *               255.255.255.0   U     0      0        0 eth3
42.2.0.0        *               255.255.255.0   U     0      0        0 eth2
139.185.139.0   *               255.255.255.0   U     0      0        0 eth1
10.12.0.0       *               255.255.255.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 ib1
default         whq2op-swi-1-rt 0.0.0.0         UG    0      0        0 eth1


Or Gerlitz wrote:
>
>     ucmatose allows binding to a specific address using -b.  I haven't
>     used rds-ping
>     to know if it's the same as -I in that case.  I don't have any
>     systems myself
>     with dual HCAs; I don't think they have enough slots to support
>     more than one.
>
>
> Hi Sean,
>
> ucmatose doesn't do anything with the address provided with the -b 
> param on its --active-- side, where this problem takes place. Yes, -I 
> to rds-ping is the same as -I to ping (other then the fact of the 
> former doesn't seem to work well). As I wrote you in detail, there's 
> no need for two HCAs to get the problem reproduced, just have one node 
> with two active port, each assigned with a different IP address.
>
> Or.
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Fri Feb  6 10:10:11 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 6 Feb 2009 10:10:11 -0800
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>	
	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
	<15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
Message-ID: <BC7760FC5EFE4EAF8B911FD4485C0A8A@amr.corp.intel.com>

>ucmatose doesn't do anything with the address provided with the -b param on its
>--active-- side, where this problem takes place.

It passes the address into rdma_resolve_addr() as the source address, which
results in binding to that address.

- Sean


From hal.rosenstock at gmail.com  Fri Feb  6 11:12:08 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 6 Feb 2009 14:12:08 -0500
Subject: [ofa-general] [RFC] OpenSM vendor layer
Message-ID: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>

Hi,

I'm looking at adding pkey support into the OpenSM vendor layer. The
pkey table is a per port structure and is part of ib_port_attr_t. That
structure also include num_pkeys. There is only related API:
osm_vendor_get_all_port_attr which takes several pointers, the second
one is a pointer to a preallocated array of port attributes (memory
allocation for that is done by the client). ib_port_attr_t includes a
pointer to the pkey table. So the only way this can work is if that
allocation is also done by the client which makes that a valid
parameter on input (as well as output). Similarly for num_pkeys so the
vendor layer doesn't go past the end of the supplied table. So both
num_pkeys and p_pkey_table in that struct need to be in/out
parameters. num_pkeys could always be returned as the total number of
pkeys for the port when num_pkeys is set to 0 on input.

Similar thing is true for gid table in ib_port_attr_t.

I'm also not sure which vendor layers are important. I don't see how
to fix them all (e.g. osm_vendor_al.c is one, there are some others)
as some of them appear to do a straight memory to memory copy of the
ib_port_attr_t structure (others are OK and fixable).

The only other alternative I see is to change this API and possibly
this structure which is way more disruptive and risky (especially with
the inability to test anything but one of the vendor layers).

Thoughts ?

-- Hal


From brian at sun.com  Fri Feb  6 11:39:58 2009
From: brian at sun.com (Brian J. Murrell)
Date: Fri, 06 Feb 2009 14:39:58 -0500
Subject: [ofa-general] build warnings on rhel4 U6
Message-ID: <1233949198.3257.19.camel@pc.interlinx.bc.ca>

I get these warnings trying to build with RHEL4U6 and ofa_kernel from OFED 1.4:

include/linux/jbd.h:1204:1: warning: "assert_spin_locked" redefined
In file included from include/linux/wait.h:25,
                 from include/linux/fs.h:12,
                 from /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/fs.h:4,
                 from /cache/build/BUILD/lustre-1.6.7.50/lustre/lvfs/fsfilt.c:42:
/cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/spinlock.h:8:1: warning: this is the location of the previous definition

The code in question is (from jbd.h):

#ifdef __KERNEL__

#ifdef CONFIG_SMP
#define assert_spin_locked(lock)	J_ASSERT(spin_is_locked(lock))
#else
#define assert_spin_locked(lock)	do {} while(0)
#endif

and (from the backport spinlock.h):

#ifndef BACKPORT_LINUX_SPINLOCK_H
#define BACKPORT_LINUX_SPINLOCK_H

#include_next <linux/spinlock.h>

#define spin_lock_nested(lock, subclass) spin_lock(lock)

#define assert_spin_locked(lock)  do { (void)(lock); } while(0)

#endif

Any thoughts on how to resolve?

b.


From hal.rosenstock at gmail.com  Fri Feb  6 11:47:17 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 6 Feb 2009 14:47:17 -0500
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
Message-ID: <f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>

On Fri, Feb 6, 2009 at 2:12 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> Hi,
>
> I'm looking at adding pkey support into the OpenSM vendor layer. The
> pkey table is a per port structure and is part of ib_port_attr_t. That
> structure also include num_pkeys. There is only related API:
> osm_vendor_get_all_port_attr which takes several pointers, the second
> one is a pointer to a preallocated array of port attributes (memory
> allocation for that is done by the client). ib_port_attr_t includes a
> pointer to the pkey table. So the only way this can work is if that
> allocation is also done by the client which makes that a valid
> parameter on input (as well as output). Similarly for num_pkeys so the
> vendor layer doesn't go past the end of the supplied table. So both
> num_pkeys and p_pkey_table in that struct need to be in/out
> parameters. num_pkeys could always be returned as the total number of
> pkeys for the port when num_pkeys is set to 0 on input.
>
> Similar thing is true for gid table in ib_port_attr_t.
>
> I'm also not sure which vendor layers are important. I don't see how
> to fix them all (e.g. osm_vendor_al.c is one, there are some others)
> as some of them appear to do a straight memory to memory copy of the
> ib_port_attr_t structure (others are OK and fixable).
>
> The only other alternative I see is to change this API and possibly
> this structure which is way more disruptive and risky (especially with
> the inability to test anything but one of the vendor layers).

Actually, although more disruptive, it might be cleaner (and safer in
the long run) to add to the vendor API. There could be additional osm
vendor APIs for pkeys and gids and these could return some suitable
IB_ error from ib_types in vendor layers where they are unimplemented.
IB_UNSUPPORTED looks good to me. I'm likely to head down this approach
unless I hear otherwise.

-- Hal

> Thoughts ?
>
> -- Hal
>


From or.gerlitz at gmail.com  Fri Feb  6 12:58:11 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 6 Feb 2009 22:58:11 +0200
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for 
	bind
In-Reply-To: <BC7760FC5EFE4EAF8B911FD4485C0A8A@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
	<15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
	<BC7760FC5EFE4EAF8B911FD4485C0A8A@amr.corp.intel.com>
Message-ID: <15ddcffd0902061258p3c0c1971y17fcb2401bd03ef4@mail.gmail.com>

On Fri, Feb 6, 2009 at 8:10 PM, Sean Hefty <sean.hefty at intel.com> wrote:
> It passes the address into rdma_resolve_addr() as the source address, which
> results in binding to that address.

OK, I managed to reproduce the problem with ucmatose in the same
manner it happened with rds-ping: two running interfaces, two runs,
telling ucmatose to bind a different interface address on each run,
and in both runs the same local port was used (as ucmatose doesn't
have prints, I used perfquery to see on what port data really goes).

Or.


From or.gerlitz at gmail.com  Fri Feb  6 13:11:24 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 6 Feb 2009 23:11:24 +0200
Subject: [ofa-general] troubleshooting IB_CM_REJ_INVALID_SERVICE_ID in 
	RDMA_CM_EVENT_REJECTED at active side of the connection
In-Reply-To: <20090205044728.GL18580@sun.com>
References: <20090205044728.GL18580@sun.com>
Message-ID: <15ddcffd0902061311kbb4c4d7j24ad93dc51791609@mail.gmail.com>

On Thu, Feb 5, 2009 at 6:47 AM, Isaac Huang <He.Huang at sun.com> wrote:
> I got some RDMA_CM_EVENT_REJECTED errors at active sides (i.e. nodes
> Poking around in CM code told me that the passive side couldn't find a listener with
> requested service_id on the incoming device of the connection request.

for this rdma-cm event, the status field would be a value from the
ib_cm_rej_reason,
so I assume you were getting IB_CM_REJ_INVALID_SERVICE_ID

> Could you guys give me some tips for troubleshooting? Any
> debugging options or /proc file to look at? Is there any netstat-like
> tool (e.g. something like a "netstat -ltp" to find out who is
> listening on which device)?

yes, this pain in the ass, currently there's no netstat line support
for RDMA connections

> The other possible cause could be ARP flux, but unfortunately arping
> via IPoIB always segfault on our systems. Is there any other way to
> troubleshoot possible ARP flux issues?

yes, ping could serve you in that respect, just use it and then look
on the resulted neighbours by doing $ip neigh show and comparing with
$ip addr show on the system you are pinging. Your problem may be
solved through correct setting of the arp_ignore sysctl attribute,
take a look on the known issues section in the ipoib release notes
provided with the ofed-docs package.

Or.


From rdreier at cisco.com  Fri Feb  6 13:18:22 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:18:22 -0800
Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset
	calculation is 64b.
In-Reply-To: <20090204202612.27031.78831.stgit@dell3.ogc.int> (Steve Wise's
	message of "Wed, 04 Feb 2009 14:26:12 -0600")
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
Message-ID: <adatz77jmch.fsf@cisco.com>

 > The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64.

I assume this fixes an overflow.  What's the impact of this overflow,
and when does it trigger?  ie is this urgent enough for 2.6.29 maybe?

 - R.


From rdreier at cisco.com  Fri Feb  6 13:19:45 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:19:45 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection
	termination fixes.
In-Reply-To: <20090204202614.27031.22248.stgit@dell3.ogc.int> (Steve Wise's
	message of "Wed, 04 Feb 2009 14:26:14 -0600")
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
	<20090204202614.27031.22248.stgit@dell3.ogc.int>
Message-ID: <adaprhvjma6.fsf@cisco.com>

 > +		BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe));

BUG_ON()s are kind of nasty -- possibly killing the whole box because of
a driver issue or an unanticipated HW quirk -- is there any way to
report this problem and try to limp on?

 - R.


From swise at opengridcomputing.com  Fri Feb  6 13:27:10 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 06 Feb 2009 15:27:10 -0600
Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset
	calculation is 64b.
In-Reply-To: <adatz77jmch.fsf@cisco.com>
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
	<adatz77jmch.fsf@cisco.com>
Message-ID: <498CAB2E.2070105@opengridcomputing.com>

Roland Dreier wrote:
>  > The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64.
>
> I assume this fixes an overflow.  What's the impact of this overflow,
> and when does it trigger?  ie is this urgent enough for 2.6.29 maybe?
>
>  - R.
>   
This was actually found by a customer using another OS derived from the 
ofed code. 2.6.30 is ok with me.

Steve.


From swise at opengridcomputing.com  Fri Feb  6 13:32:54 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 06 Feb 2009 15:32:54 -0600
Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection
	termination fixes.
In-Reply-To: <adaprhvjma6.fsf@cisco.com>
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>	<20090204202614.27031.22248.stgit@dell3.ogc.int>
	<adaprhvjma6.fsf@cisco.com>
Message-ID: <498CAC86.3090005@opengridcomputing.com>

Roland Dreier wrote:
>  > +		BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe));
>
> BUG_ON()s are kind of nasty -- possibly killing the whole box because of
> a driver issue or an unanticipated HW quirk -- is there any way to
> report this problem and try to limp on?
>   

I'm not sure I agree with trying to limp on. This BUG_ON() doesn't 
indicate a HW quirk. It indicates the driver logic is busted. Isn't that 
what BUG_ON() should be used for?

Steve.


From rdreier at cisco.com  Fri Feb  6 13:38:00 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:38:00 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection
	termination fixes.
In-Reply-To: <498CAC86.3090005@opengridcomputing.com> (Steve Wise's message of
	"Fri, 06 Feb 2009 15:32:54 -0600")
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
	<20090204202614.27031.22248.stgit@dell3.ogc.int>
	<adaprhvjma6.fsf@cisco.com> <498CAC86.3090005@opengridcomputing.com>
Message-ID: <adawsc3i6vb.fsf@cisco.com>

 > I'm not sure I agree with trying to limp on. This BUG_ON() doesn't
 > indicate a HW quirk. It indicates the driver logic is busted. Isn't
 > that what BUG_ON() should be used for?

Yeah, I guess so -- the only issue is that it's very annoying for some
buggy driver to kill the whole system when only some non-critical piece
is busted.  But it's not a killer.


From rdreier at cisco.com  Fri Feb  6 13:39:49 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:39:49 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection
	termination fixes.
In-Reply-To: <20090204202614.27031.22248.stgit@dell3.ogc.int> (Steve Wise's
	message of "Wed, 04 Feb 2009 14:26:14 -0600")
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>
	<20090204202614.27031.22248.stgit@dell3.ogc.int>
Message-ID: <adaskmri6sa.fsf@cisco.com>

applied 1-2


From rdreier at cisco.com  Fri Feb  6 13:40:50 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:40:50 -0800
Subject: [ofa-general] [PATCH] RDMA/nes: ibv_devinfo displays 0 for
	vendor_id and vendor_part_id
In-Reply-To: <20090204234434.GA1856@ctung-MOBL> (Chien Tung's message of "Wed, 
	4 Feb 2009 17:44:34 -0600")
References: <20090204234434.GA1856@ctung-MOBL>
Message-ID: <adaocxfi6ql.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Fri Feb  6 13:42:34 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:42:34 -0800
Subject: [ofa-general] Re: [PATCH] RDMA/nes: tmp_addr compilation warning
In-Reply-To: <20090205152106.GA2304@ctung-MOBL> (Chien Tung's message of "Thu, 
	5 Feb 2009 09:21:06 -0600")
References: <20090205152106.GA2304@ctung-MOBL>
Message-ID: <adak583i6np.fsf@cisco.com>

thanks, applied


From swise at opengridcomputing.com  Fri Feb  6 13:44:24 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 06 Feb 2009 15:44:24 -0600
Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection
	termination fixes.
In-Reply-To: <adawsc3i6vb.fsf@cisco.com>
References: <20090204202612.27031.78831.stgit@dell3.ogc.int>	<20090204202614.27031.22248.stgit@dell3.ogc.int>	<adaprhvjma6.fsf@cisco.com>
	<498CAC86.3090005@opengridcomputing.com>
	<adawsc3i6vb.fsf@cisco.com>
Message-ID: <498CAF38.80106@opengridcomputing.com>

Roland Dreier wrote:
>  > I'm not sure I agree with trying to limp on. This BUG_ON() doesn't
>  > indicate a HW quirk. It indicates the driver logic is busted. Isn't
>  > that what BUG_ON() should be used for?
>
> Yeah, I guess so -- the only issue is that it's very annoying for some
> buggy driver to kill the whole system when only some non-critical piece
> is busted.  
>   
I agree with you there.  But I think trying to gracefully bail on on 
these conditions is painful and complicated and prone to resulting in a 
BUG_ON() somewhere else. :)


From jgunthorpe at obsidianresearch.com  Fri Feb  6 13:52:52 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 6 Feb 2009 14:52:52 -0700
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used
	for  bind
In-Reply-To: <498C6DEC.70805@oracle.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
	<15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
	<498C6DEC.70805@oracle.com>
Message-ID: <20090206215252.GE19892@obsidianresearch.com>

On Fri, Feb 06, 2009 at 12:05:48PM -0500, Richard Frank wrote:
> I played around with this a bit more yesterday - and it looks like 
> rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning the 
> first matching entry in the routing table... even though we are providing 
> the source ip for the bind...

Right, thats the trouble, it shouldn't be calling ip_dev_find on the
bind path with any address.. ip_route_output_key needs to be used to
get the device.

Just looking at the 2.6.27 upstream it looks like ip_dev_find is used
in many places where a route lookup would probably be more appropriate..

Jason


From rdreier at cisco.com  Fri Feb  6 13:59:59 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 13:59:59 -0800
Subject: [ofa-general] Re: [PATCH 1 of 2 for 2.6.28] core: Fix Raw Ethertype
	QP support
In-Reply-To: <200808121720.11878.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 12 Aug 2008 17:20:11 +0300")
References: <200808121720.11878.jackm@dev.mellanox.co.il>
Message-ID: <adaeiybi5uo.fsf@cisco.com>

 > @@ -752,6 +752,11 @@ struct ib_send_wr {
 >  			int				access_flags;
 >  			u32				rkey;
 >  		} fast_reg;
 > +		struct {
 > +			struct ib_unpacked_lrh	*lrh;
 > +			u32			eth_type;
 > +			u8			static_rate;
 > +		} raw_ety;

Would it be more sensible to make eth_type __be16, since it's limited to
16 bits, and ethertype is usually specified in network endian?

Also rather than an LRH structure, would it make more sense to give
dlid, source path bits and service level?  Otherwise it seems the
consumer needs to keep track of the port's assigned LID to make sure the
SLID field is correct (not to mention computing packet length, setting
LNH properly, etc).


It seems there are some changes needed to the ib_wc structure to be able
to return the ethertype on receive?

 - R.


From rdreier at cisco.com  Fri Feb  6 14:05:44 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 06 Feb 2009 14:05:44 -0800
Subject: [ofa-general] [PATCH 2 of 2 for 2.6.28] mlx4: Add Raw Ethertype
	QP support
In-Reply-To: <200812151312.53603.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 15 Dec 2008 13:12:53 +0200")
References: <200812151312.53603.jackm@dev.mellanox.co.il>
Message-ID: <ada7i43i5l3.fsf@cisco.com>

 > -	    type != IB_QPT_SMI && type != IB_QPT_GSI)
 > +	    type != IB_QPT_SMI && type != IB_QPT_GSI && type != IB_QPT_RAW_ETY)

Seems we're at the point where mlx4 could use a "is_special_qpt()"
helper maybe?

 >  		err = create_qp_common(dev, pd, init_attr, udata,
 >  				       dev->dev->caps.sqp_start +
 > -				       (init_attr->qp_type == IB_QPT_SMI ? 0 : 2) +
 > +				       (init_attr->qp_type == IB_QPT_RAW_ETY ? 4 :
 > +				       (init_attr->qp_type == IB_QPT_SMI ? 0 : 2)) +
 >  				       init_attr->port_num - 1,

I think this is now way past the point where we should use a helper
function to compute this?

 > @@ -60,6 +60,7 @@ enum {
 >  	MLX4_DEV_CAP_FLAG_IPOIB_CSUM	= 1 <<  7,
 >  	MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR	= 1 <<  8,
 >  	MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR	= 1 <<  9,
 > +	MLX4_DEV_CAP_FLAG_RAW_ETY	= 1 << 13,
 >  	MLX4_DEV_CAP_FLAG_MEM_WINDOW	= 1 << 16,
 >  	MLX4_DEV_CAP_FLAG_APM		= 1 << 17,
 >  	MLX4_DEV_CAP_FLAG_ATOMIC	= 1 << 18,

probably nice to add this is dump_dev_cap_flags() so someone can check
dmesg output to see if raw ethertype is supported.

I don't see any changes to the poll cq side of things.  Is there
anything required to handle receiving raw ethertype datagrams?

 - R.


From or.gerlitz at gmail.com  Fri Feb  6 15:02:28 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Sat, 7 Feb 2009 01:02:28 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad/rpc.c: In
	mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response
In-Reply-To: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca>
References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca>
Message-ID: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com>

On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock
<halr at obsidianresearch.com> wrote:
> Sasha,
> This patch sets the attribute ID based on what is in the response.

Hal,

Your patches can't really be reviewed when being sent as attachment,
any reason not
to send them embedded within the email message?

Or.


From richard.frank at oracle.com  Fri Feb  6 15:31:40 2009
From: richard.frank at oracle.com (Richard Frank)
Date: Fri, 06 Feb 2009 18:31:40 -0500
Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used
	for  bind
In-Reply-To: <20090206215252.GE19892@obsidianresearch.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<FC05439E12A44FA3BF723E4F20A096ED@amr.corp.intel.com>
	<15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com>
	<498C6DEC.70805@oracle.com>
	<20090206215252.GE19892@obsidianresearch.com>
Message-ID: <498CC85C.8070903@oracle.com>

Interesting - Andy Grover pointed this out too - and I totally (as 
usual) missed the point. :(

Jason Gunthorpe wrote:
> On Fri, Feb 06, 2009 at 12:05:48PM -0500, Richard Frank wrote:
>   
>> I played around with this a bit more yesterday - and it looks like 
>> rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning the 
>> first matching entry in the routing table... even though we are providing 
>> the source ip for the bind...
>>     
>
> Right, thats the trouble, it shouldn't be calling ip_dev_find on the
> bind path with any address.. ip_route_output_key needs to be used to
> get the device.
>
> Just looking at the 2.6.27 upstream it looks like ip_dev_find is used
> in many places where a route lookup would probably be more appropriate..
>
> Jason
>   


From acceptany at gmail.com  Fri Feb  6 19:01:45 2009
From: acceptany at gmail.com (=?GB2312?B?zfXUyrHy?=)
Date: Sat, 7 Feb 2009 11:01:45 +0800
Subject: [ofa-general] ***SPAM*** problem about the installation of the OPED
Message-ID: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com>

I want to install the OFED on a PC without any infiniband devices ,can this
idea work? or what i need (hardware) when i want to install this software on
a general computer ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090207/3e95c1f9/attachment.html>

From cameron at harr.org  Fri Feb  6 19:34:29 2009
From: cameron at harr.org (Cameron Harr)
Date: Fri, 06 Feb 2009 20:34:29 -0700
Subject: [ofa-general] ***SPAM*** problem about the installation of the
	OPED
In-Reply-To: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com>
References: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com>
Message-ID: <498D0145.7030801@harr.org>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090206/6c21fc75/attachment.html>

From sashak at voltaire.com  Sat Feb  7 01:43:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 11:43:24 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: clean_val() remove
	trailing quotation
Message-ID: <20090207094324.GD17713@sashak.voltaire.com>


Remove training quotation character from parsed string values.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 2b3f463..bd52f76 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -794,7 +794,7 @@ static char *clean_val(char *val)
 	/* clean quotas */
 	if ((*val == '\"' && *p == '\"') || (*val == '\'' && *p == '\'')) {
 		val++;
-		p--;
+		*p-- = '\0';
 	}
 	return val;
 }
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb  7 01:43:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 11:43:56 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: break matching when
	config parameter already found
Message-ID: <20090207094356.GE17713@sashak.voltaire.com>


Break config parameter matching procedure when it is already found.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 42c5682..3324af9 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1165,6 +1165,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 			p_field = (void *)p_opts + r->opt_offset;
 			/* don't call setup function first time */
 			r->parse_fn(NULL, p_key, p_val, p_field, NULL);
+			break;
 		}
 	}
 	fclose(opts_file);
@@ -1216,6 +1217,7 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 
 			p_field = (void *)p_opts + r->opt_offset;
 			r->parse_fn(p_subn, p_key, p_val, p_field, r->setup_fn);
+			break;
 		}
 	}
 	fclose(opts_file);
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb  7 01:48:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 11:48:10 +0200
Subject: [ofa-general] [PATCH] opensm: avoid memory leaks on config
	parameters reloading
In-Reply-To: <20090203122450.GB11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com>
	<20090203122450.GB11874@sashak.voltaire.com>
Message-ID: <20090207094810.GF17713@sashak.voltaire.com>


When OpenSM string config parameters are loaded it will always allocate
memory (except NULL value), and will free and reallocate on reloading.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

On 14:24 Tue 03 Feb     , Sasha Khapyorsky wrote:
> 
> I'm applying this with several changes:
> 
> - disable update option and setup function for all string parameter -
>   as I commented originally opts_parse_charp() will leak memory and this
>   cannot be ignored if config file is rescanned. Exception is QoS string
>   parameters where memory leak is handled.

This probably solves an issue with potential memory leaks....

 opensm/opensm/main.c       |   33 +++++++++++++++-----------
 opensm/opensm/osm_subnet.c |   55 +++++++++++++------------------------------
 2 files changed, 36 insertions(+), 52 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index c09a54e..a8dc9e6 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -507,6 +507,11 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
 
 /**********************************************************************
  **********************************************************************/
+#define SET_STR_OPT(opt, val) do { \
+	if (opt) free(opt); \
+	opt = val ? strdup(val) : NULL ; \
+} while (0)
+
 int main(int argc, char *argv[])
 {
 	osm_opensm_t osm;
@@ -650,7 +655,7 @@ int main(int argc, char *argv[])
 			/*
 			   Specifies ignore guids file.
 			 */
-			opt.port_prof_ignore_file = optarg;
+			SET_STR_OPT(opt.port_prof_ignore_file, optarg);
 			printf(" Ignore Guids File = %s\n",
 			       opt.port_prof_ignore_file);
 			break;
@@ -706,7 +711,7 @@ int main(int argc, char *argv[])
 			    || strcmp(optarg, OSM_LOOPBACK_CONSOLE) == 0
 #endif
 			    )
-				opt.console = optarg;
+				SET_STR_OPT(opt.console, optarg);
 			else
 				printf("-console %s option not understood\n",
 				       optarg);
@@ -763,7 +768,7 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'f':
-			opt.log_file = optarg;
+			SET_STR_OPT(opt.log_file, optarg);
 			break;
 
 		case 'L':
@@ -778,7 +783,7 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'P':
-			opt.partition_config_file = optarg;
+			SET_STR_OPT(opt.partition_config_file, optarg);
 			break;
 
 		case 'N':
@@ -790,7 +795,7 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'Y':
-			opt.qos_policy_file = optarg;
+			SET_STR_OPT(opt.qos_policy_file, optarg);
 			printf(" QoS policy file \'%s\'\n", optarg);
 			break;
 
@@ -829,7 +834,7 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'R':
-			opt.routing_engine_names = optarg;
+			SET_STR_OPT(opt.routing_engine_names, optarg);
 			printf(" Activate \'%s\' routing engine(s)\n", optarg);
 			break;
 
@@ -844,17 +849,17 @@ int main(int argc, char *argv[])
 			break;
 
 		case 'M':
-			opt.lid_matrix_dump_file = optarg;
+			SET_STR_OPT(opt.lid_matrix_dump_file, optarg);
 			printf(" Lid matrix dump file is \'%s\'\n", optarg);
 			break;
 
 		case 'U':
-			opt.lfts_file = optarg;
+			SET_STR_OPT(opt.lfts_file, optarg);
 			printf(" LFTs file is \'%s\'\n", optarg);
 			break;
 
 		case 'S':
-			opt.sa_db_file = optarg;
+			SET_STR_OPT(opt.sa_db_file, optarg);
 			printf(" SA DB file is \'%s\'\n", optarg);
 			break;
 
@@ -862,7 +867,7 @@ int main(int argc, char *argv[])
 			/*
 			   Specifies root guids file
 			 */
-			opt.root_guid_file = optarg;
+			SET_STR_OPT(opt.root_guid_file, optarg);
 			printf(" Root Guid File: %s\n", opt.root_guid_file);
 			break;
 
@@ -870,20 +875,20 @@ int main(int argc, char *argv[])
 			/*
 			   Specifies compute node guids file
 			 */
-			opt.cn_guid_file = optarg;
+			SET_STR_OPT(opt.cn_guid_file, optarg);
 			printf(" Compute Node Guid File: %s\n",
 			       opt.cn_guid_file);
 			break;
 
 		case 'm':
 			/* Specifies ids guid file */
-			opt.ids_guid_file = optarg;
+			SET_STR_OPT(opt.ids_guid_file, optarg);
 			printf(" IDs Guid File: %s\n", opt.ids_guid_file);
 			break;
 
 		case 'X':
 			/* Specifies guid routing order file */
-			opt.guid_routing_order_file = optarg;
+			SET_STR_OPT(opt.guid_routing_order_file, optarg);
 			printf(" GUID Routing Order File: %s\n", opt.guid_routing_order_file);
 			break;
 
@@ -912,7 +917,7 @@ int main(int argc, char *argv[])
 #endif				/* ENABLE_OSM_PERF_MGR */
 
 		case 3:
-			opt.prefix_routes_file = optarg;
+			SET_STR_OPT(opt.prefix_routes_file, optarg);
 			break;
 		case 4:
 			opt.consolidate_ipv6_snm_req = TRUE;
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index bd52f76..42c5682 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -488,21 +488,15 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt)
 {
 	opt->max_vls = 0;
 	opt->high_limit = -1;
-	opt->vlarb_high = NULL;
-	opt->vlarb_low = NULL;
-	opt->sl2vl = NULL;
-}
-
-static void subn_free_qos_options(IN osm_qos_options_t * opt)
-{
 	if (opt->vlarb_high)
 		free(opt->vlarb_high);
-
+	opt->vlarb_high = NULL;
 	if (opt->vlarb_low)
 		free(opt->vlarb_low);
-
+	opt->vlarb_low = NULL;
 	if (opt->sl2vl)
 		free(opt->sl2vl);
+	opt->sl2vl = NULL;
 }
 
 /**********************************************************************
@@ -518,7 +512,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->m_key_lease_period = 0;
 	p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
 	p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
-	p_opt->console = OSM_DEFAULT_CONSOLE;
+	p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
 	p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
 	p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
 	/* by default we will consider waiting for 50x transaction timeout normal */
@@ -566,13 +560,13 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->dump_files_dir = getenv("OSM_TMP_DIR");
 	if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir))
 		p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR;
-
-	p_opt->log_file = OSM_DEFAULT_LOG_FILE;
+	p_opt->dump_files_dir = strdup(p_opt->dump_files_dir);
+	p_opt->log_file = strdup(OSM_DEFAULT_LOG_FILE);
 	p_opt->log_max_size = 0;
-	p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE;
+	p_opt->partition_config_file = strdup(OSM_DEFAULT_PARTITION_CONFIG_FILE);
 	p_opt->no_partition_enforcement = FALSE;
 	p_opt->qos = FALSE;
-	p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE;
+	p_opt->qos_policy_file = strdup(OSM_DEFAULT_QOS_POLICY_FILE);
 	p_opt->accum_log_file = TRUE;
 	p_opt->port_prof_ignore_file = NULL;
 	p_opt->port_profile_switch_nodes = FALSE;
@@ -591,7 +585,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->exit_on_fatal = TRUE;
 	p_opt->enable_quirks = FALSE;
 	p_opt->no_clients_rereg = FALSE;
-	p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE;
+	p_opt->prefix_routes_file = strdup(OSM_DEFAULT_PREFIX_ROUTES_FILE);
 	p_opt->consolidate_ipv6_snm_req = FALSE;
 	subn_init_qos_options(&p_opt->qos_options);
 	subn_init_qos_options(&p_opt->qos_ca_options);
@@ -753,25 +747,16 @@ static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key,
 	char **p_val = p_v;
 	const char *current_str = *p_val ? *p_val : null_str ;
 
-	if (!p_val_str)
-		return;
-
-	if (strcmp(p_val_str, current_str)) {
+	if (p_val_str && strcmp(p_val_str, current_str)) {
+		char *new;
 		log_config_value(p_key, "%s", p_val_str);
 		/* special case the "(null)" string */
-		if (strcmp(null_str, p_val_str) == 0) {
-			if (pfn)
-				pfn(p_subn, NULL);
-			*p_val = NULL;
-		} else {
-			if (pfn)
-				pfn(p_subn, p_val_str);
-			/*
-			  Ignore the possible memory leak here;
-			  the pointer may be to a static default.
-			*/
-			*p_val = strdup(p_val_str);
-		}
+		new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL;
+		if (pfn)
+			pfn(p_subn, new);
+		if (*p_val)
+			free(*p_val);
+		*p_val = new;
 	}
 }
 
@@ -1211,12 +1196,6 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 		return -1;
 	}
 
-	subn_free_qos_options(&p_opts->qos_options);
-	subn_free_qos_options(&p_opts->qos_ca_options);
-	subn_free_qos_options(&p_opts->qos_sw0_options);
-	subn_free_qos_options(&p_opts->qos_swe_options);
-	subn_free_qos_options(&p_opts->qos_rtr_options);
-
 	subn_init_qos_options(&p_opts->qos_options);
 	subn_init_qos_options(&p_opts->qos_ca_options);
 	subn_init_qos_options(&p_opts->qos_sw0_options);
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb  7 01:53:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 11:53:41 +0200
Subject: [ofa-general] ***SPAM*** [RFC] infiniband-diags/perfquery.c:
	Any objections to changing an option name ?
In-Reply-To: <f0e08f230902051455o3f38ee1va4f878f0c1f953cb@mail.gmail.com>
References: <f0e08f230902051455o3f38ee1va4f878f0c1f953cb@mail.gmail.com>
Message-ID: <20090207095341.GG17713@sashak.voltaire.com>

Hi Hal,

On 17:55 Thu 05 Feb     , Hal Rosenstock wrote:
> In infiniband-diags/perfquery, -e is used for extended counters and
> covers up using the common errors option so I'd like to change this to
> be -x for xtended. Any objections ?

AFAIK '-e' is not used in infiniband-diags scripts and proposed change
likely will not break any known usage. I'm fine with change.

Sasha

> Without this change when perfquery
> fails you can't get the more detailed error information which is very
> useful for debugging problems.
> 
> -- Hal
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Sat Feb  7 02:25:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 12:25:10 +0200
Subject: [ofa-general] Re: [PATCH] ibsim: Eliminate unused modified variable
In-Reply-To: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca>
References: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207102510.GH17713@sashak.voltaire.com>

On 16:31 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Trivial patch to eliminate the unused 'modified' variable.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb  7 02:44:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 12:44:21 +0200
Subject: [ofa-general] Re: [PATCH] ibsim: Change lid print format to unsigned
In-Reply-To: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca>
References: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207104421.GI17713@sashak.voltaire.com>

On 16:31 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Patch to change lid print format to unsigned to be consistent elsewhere.

dev_sysfs_create() umad2sim.c generates simulated sysfs tree. In native
sysfs tree port lid and sm_lid files store lid values in hex form
(core/sysfs.c lid_show() and sm_lid_show()), so I don't see any good
reason to make simulation differently (unless you are going to change
this in kernel first). I'm removing this part from the patch.

Sasha


From sashak at voltaire.com  Sat Feb  7 02:47:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 12:47:24 +0200
Subject: [ofa-general] Re: [PATCH] opensm/doc/perf-manager-arch.txt: Fix some
	commentary typos
In-Reply-To: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca>
References: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207104724.GJ17713@sashak.voltaire.com>

On 16:41 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Trivial patch to fix some typos in this doc.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb  7 02:50:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 12:50:24 +0200
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Add copyrights
In-Reply-To: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca>
References: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207105024.GK17713@sashak.voltaire.com>

On 16:42 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> This just adds copyrights missed in previous patches.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb  7 02:55:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 12:55:45 +0200
Subject: [ofa-general] Re: libibumad/umad.c: Change lid print format to
	unsigned
In-Reply-To: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca>
References: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207105545.GL17713@sashak.voltaire.com>

On 16:43 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> This patch changes umad.c lid print format to unsigned.

Both applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb  7 03:10:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 13:10:00 +0200
Subject: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp,
	set rpc attribute ID from response
In-Reply-To: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com>
References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca>
	<15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com>
Message-ID: <20090207110953.GM17713@sashak.voltaire.com>

On 01:02 Sat 07 Feb     , Or Gerlitz wrote:
> On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock
> <halr at obsidianresearch.com> wrote:
> > Sasha,
> > This patch sets the attribute ID based on what is in the response.
> 
> Hal,
> 
> Your patches can't really be reviewed when being sent as attachment,

This is true :(. And likely at some point I will be need to start to
reject such patches.

Sasha


From vlad at lists.openfabrics.org  Sat Feb  7 03:14:17 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  7 Feb 2009 03:14:17 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090207-0200 daily build status
Message-ID: <20090207111418.0BB91E61085@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Sat Feb  7 03:57:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 13:57:56 +0200
Subject: [ofa-general] Re: [PATCH] libibmad/gs.c: Factor out common code
In-Reply-To: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca>
References: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207115750.GN17713@sashak.voltaire.com>

On 16:48 Thu 05 Feb     , Hal Rosenstock wrote:
> 
> This patch factors out some common code in gs.c. common_query_setup is
> used by both pma_query_via and performance_reset_via.

Should rcvbuf be initialized a common code? I'm not sure, but if it is
valid then mad_rpc call could look like:

	mad_rpc(srcport, &rpc, dest, NULL, rcvbuf);

to prevent empty payload copying in mad_build_pkt().

Sasha


From sashak at voltaire.com  Sat Feb  7 04:09:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 14:09:31 +0200
Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change option
	name for extended counters
In-Reply-To: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca>
References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca>
Message-ID: <20090207120924.GO17713@sashak.voltaire.com>

On 17:00 Thu 05 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> Per the RFC, this patch changes the option name for extended counters to
> to not cover up common errors option. This changes it from -e/--extended
> to -x/--xtended so -e/--errors can be used to get error information as
> is common with the IB diags.

To avoid typos this can be done as -x/--extended and -e/--errors:

	{ "extended", 'x', ... },

getopt*() will handle this properly.

Sasha


From sashak at voltaire.com  Sat Feb  7 04:33:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 14:33:55 +0200
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
Message-ID: <20090207123355.GP17713@sashak.voltaire.com>

On 14:12 Fri 06 Feb     , Hal Rosenstock wrote:
> 
> I'm looking at adding pkey support into the OpenSM vendor layer. The
> pkey table is a per port structure and is part of ib_port_attr_t. That
> structure also include num_pkeys. There is only related API:
> osm_vendor_get_all_port_attr which takes several pointers, the second
> one is a pointer to a preallocated array of port attributes (memory
> allocation for that is done by the client). ib_port_attr_t includes a
> pointer to the pkey table. So the only way this can work is if that
> allocation is also done by the client which makes that a valid
> parameter on input (as well as output).

This could be a client choice: if pkey table pointer is initialized as
NULL osm_vendor_get_all_port_attr() allocates memory and initialize the
table and its size, otherwise it fills up only provided by client pkey
table entries.

Sasha


From sashak at voltaire.com  Sat Feb  7 04:38:37 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 14:38:37 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
Message-ID: <20090207123830.GQ17713@sashak.voltaire.com>

On 14:47 Fri 06 Feb     , Hal Rosenstock wrote:
> 
> Actually, although more disruptive, it might be cleaner (and safer in
> the long run) to add to the vendor API. There could be additional osm
> vendor APIs for pkeys and gids

I don't think so - existing osm_vendor_get_all_port_attr() call
following its name could/should provide *all* port attributes already,
no needs for new APIs.

Sasha


From hal.rosenstock at gmail.com  Sat Feb  7 04:39:49 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 07:39:49 -0500
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <20090207123355.GP17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070439u629e2884t12bf90674199aba9@mail.gmail.com>

On Sat, Feb 7, 2009 at 7:33 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 14:12 Fri 06 Feb     , Hal Rosenstock wrote:
>>
>> I'm looking at adding pkey support into the OpenSM vendor layer. The
>> pkey table is a per port structure and is part of ib_port_attr_t. That
>> structure also include num_pkeys. There is only related API:
>> osm_vendor_get_all_port_attr which takes several pointers, the second
>> one is a pointer to a preallocated array of port attributes (memory
>> allocation for that is done by the client). ib_port_attr_t includes a
>> pointer to the pkey table. So the only way this can work is if that
>> allocation is also done by the client which makes that a valid
>> parameter on input (as well as output).
>
> This could be a client choice: if pkey table pointer is initialized as
> NULL osm_vendor_get_all_port_attr() allocates memory and initialize the
> table and its size, otherwise it fills up only provided by client pkey
> table entries.

Right; that's what I was trying to describe. The downside of this
approach is that it breaks in and out of tree uses of this API as the
passed in structure is uninitialized. I can fix the in tree ones (I
know about).

-- Hal

> Sasha
>


From hal.rosenstock at gmail.com  Sat Feb  7 04:41:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 07:41:12 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <20090207123830.GQ17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>

On Sat, Feb 7, 2009 at 7:38 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 14:47 Fri 06 Feb     , Hal Rosenstock wrote:
>>
>> Actually, although more disruptive, it might be cleaner (and safer in
>> the long run) to add to the vendor API. There could be additional osm
>> vendor APIs for pkeys and gids
>
> I don't think so - existing osm_vendor_get_all_port_attr() call
> following its name could/should provide *all* port attributes already,
> no needs for new APIs.

I can see cases where rather than getting all port attr, it would be
useful to get the bound port's attributes without all the rest.

-- Hal

> Sasha
>


From sashak at voltaire.com  Sat Feb  7 05:20:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 15:20:19 +0200
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902070439u629e2884t12bf90674199aba9@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902070439u629e2884t12bf90674199aba9@mail.gmail.com>
Message-ID: <20090207132019.GR17713@sashak.voltaire.com>

On 07:39 Sat 07 Feb     , Hal Rosenstock wrote:
> 
> Right; that's what I was trying to describe. The downside of this
> approach is that it breaks in and out of tree uses of this API as the
> passed in structure is uninitialized. I can fix the in tree ones (I
> know about).

All OpenSM vendor layer users are opensm, osmtest, saquery and ibis.

BTW, why and where do you need this?

Sasha


From sashak at voltaire.com  Sat Feb  7 05:28:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 15:28:01 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
Message-ID: <20090207132753.GS17713@sashak.voltaire.com>

On 07:41 Sat 07 Feb     , Hal Rosenstock wrote:
> 
> I can see cases where rather than getting all port attr, it would be
> useful to get the bound port's attributes without all the rest.

Then it is probably simpler just to use umad_get_port(). Why to bother
with all those OpenSM vendor junks?

Sasha


From hal.rosenstock at gmail.com  Sat Feb  7 05:40:03 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 08:40:03 -0500
Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change 
	option name for extended counters
In-Reply-To: <20090207120924.GO17713@sashak.voltaire.com>
References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca>
	<20090207120924.GO17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070540ieda9333r9756d90be9aaafd4@mail.gmail.com>

On Sat, Feb 7, 2009 at 7:09 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 17:00 Thu 05 Feb     , Hal Rosenstock wrote:
>> Sasha,
>>
>> Per the RFC, this patch changes the option name for extended counters to
>> to not cover up common errors option. This changes it from -e/--extended
>> to -x/--xtended so -e/--errors can be used to get error information as
>> is common with the IB diags.
>
> To avoid typos this can be done as -x/--extended and -e/--errors:
>
>        { "extended", 'x', ... },

Do you want a revised patch for this ?

-- Hal

>
> getopt*() will handle this properly.
>
> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Sat Feb  7 05:42:33 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 08:42:33 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <20090207132753.GS17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>

On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:41 Sat 07 Feb     , Hal Rosenstock wrote:
>>
>> I can see cases where rather than getting all port attr, it would be
>> useful to get the bound port's attributes without all the rest.
>
> Then it is probably simpler just to use umad_get_port(). Why to bother
> with all those OpenSM vendor junks?

Is bypassing it's vendor layer acceptable for OpenSM unless we are
going to totally remove it and go straight to umad (which I'm not
proposing) ?

-- Hal

> Sasha
>


From sashak at voltaire.com  Sat Feb  7 05:56:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 15:56:18 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on
	index port order incrementation
In-Reply-To: <49896FF7.8060908@ext.bull.net>
References: <4981DC18.9030400@ext.bull.net>
	<49896B9C.8040006@dev.mellanox.co.il>
	<49896FF7.8060908@ext.bull.net>
Message-ID: <20090207135618.GT17713@sashak.voltaire.com>

Yevgeny and Nicolas,

On 11:37 Wed 04 Feb     , Nicolas Morey Chaisemartin wrote:
>
> That seems good.
> I'm going to think a bit more about the case where there are no downports.

I hope eventually updated version of the patch will be posted to the
list.

Sasha


From sashak at voltaire.com  Sat Feb  7 06:44:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 16:44:26 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
	<f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
Message-ID: <20090207144426.GU17713@sashak.voltaire.com>

On 08:42 Sat 07 Feb     , Hal Rosenstock wrote:
> On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 07:41 Sat 07 Feb     , Hal Rosenstock wrote:
> >>
> >> I can see cases where rather than getting all port attr, it would be
> >> useful to get the bound port's attributes without all the rest.
> >
> > Then it is probably simpler just to use umad_get_port(). Why to bother
> > with all those OpenSM vendor junks?
> 
> Is bypassing it's vendor layer acceptable for OpenSM

Sure, so it is why I asked where and for what purpose do you need pkey
table and why is OpenSM vendor layer chosen there?

> unless we are
> going to totally remove it and go straight to umad (which I'm not
> proposing) ?

BTW, WinOF now has libibumad implemented too, it could be an option to
switch.

Sasha


From sashak at voltaire.com  Sat Feb  7 06:46:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 16:46:27 +0200
Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change
	option name for extended counters
In-Reply-To: <f0e08f230902070540ieda9333r9756d90be9aaafd4@mail.gmail.com>
References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca>
	<20090207120924.GO17713@sashak.voltaire.com>
	<f0e08f230902070540ieda9333r9756d90be9aaafd4@mail.gmail.com>
Message-ID: <20090207144627.GV17713@sashak.voltaire.com>

On 08:40 Sat 07 Feb     , Hal Rosenstock wrote:
> On Sat, Feb 7, 2009 at 7:09 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 17:00 Thu 05 Feb     , Hal Rosenstock wrote:
> >> Sasha,
> >>
> >> Per the RFC, this patch changes the option name for extended counters to
> >> to not cover up common errors option. This changes it from -e/--extended
> >> to -x/--xtended so -e/--errors can be used to get error information as
> >> is common with the IB diags.
> >
> > To avoid typos this can be done as -x/--extended and -e/--errors:
> >
> >        { "extended", 'x', ... },
> 
> Do you want a revised patch for this ?

I will fix in my tree.

Sasha


From hal.rosenstock at gmail.com  Sat Feb  7 07:24:04 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 10:24:04 -0500
Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <20090207132019.GR17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902070439u629e2884t12bf90674199aba9@mail.gmail.com>
	<20090207132019.GR17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070724g77b937aft376ceadd391cb29d@mail.gmail.com>

On Sat, Feb 7, 2009 at 8:20 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:39 Sat 07 Feb     , Hal Rosenstock wrote:
>>
>> Right; that's what I was trying to describe. The downside of this
>> approach is that it breaks in and out of tree uses of this API as the
>> passed in structure is uninitialized. I can fix the in tree ones (I
>> know about).
>
> All OpenSM vendor layer users are opensm, osmtest, saquery and ibis.
>
> BTW, why and where do you need this?

For some PerfMgr work I'm doing.

-- Hal

>
> Sasha
>


From hal.rosenstock at gmail.com  Sat Feb  7 07:27:17 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 10:27:17 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <20090207144426.GU17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
	<f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
	<20090207144426.GU17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070727u6c6c4b2fxdabf73d53b387026@mail.gmail.com>

On Sat, Feb 7, 2009 at 9:44 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 08:42 Sat 07 Feb     , Hal Rosenstock wrote:
>> On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> > On 07:41 Sat 07 Feb     , Hal Rosenstock wrote:
>> >>
>> >> I can see cases where rather than getting all port attr, it would be
>> >> useful to get the bound port's attributes without all the rest.
>> >
>> > Then it is probably simpler just to use umad_get_port(). Why to bother
>> > with all those OpenSM vendor junks?
>>
>> Is bypassing it's vendor layer acceptable for OpenSM
>
> Sure, so it is why I asked where and for what purpose do you need pkey
> table and why is OpenSM vendor layer chosen there?
>
>> unless we are
>> going to totally remove it and go straight to umad (which I'm not
>> proposing) ?
>
> BTW, WinOF now has libibumad implemented too,

Yes, it seems pretty far along now.

> it could be an option to switch.

Could be but what about the other vendor layers ? Would we orphan those ?

-- Hal

> Sasha
>


From sashak at voltaire.com  Sat Feb  7 09:02:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 19:02:34 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902070727u6c6c4b2fxdabf73d53b387026@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
	<f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
	<20090207144426.GU17713@sashak.voltaire.com>
	<f0e08f230902070727u6c6c4b2fxdabf73d53b387026@mail.gmail.com>
Message-ID: <20090207170234.GX17713@sashak.voltaire.com>

On 10:27 Sat 07 Feb     , Hal Rosenstock wrote:
> >> Is bypassing it's vendor layer acceptable for OpenSM
> >
> > Sure, so it is why I asked where and for what purpose do you need pkey
> > table and why is OpenSM vendor layer chosen there?
> >
> >> unless we are
> >> going to totally remove it and go straight to umad (which I'm not
> >> proposing) ?
> >
> > BTW, WinOF now has libibumad implemented too,
> 
> Yes, it seems pretty far along now.
> 
> > it could be an option to switch.
> 
> Could be but what about the other vendor layers ? Would we orphan those ?

Who needs this really, it is broken long time anyway.

Sasha


From hal.rosenstock at gmail.com  Sat Feb  7 09:02:29 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 7 Feb 2009 12:02:29 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <20090207170234.GX17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
	<f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
	<20090207144426.GU17713@sashak.voltaire.com>
	<f0e08f230902070727u6c6c4b2fxdabf73d53b387026@mail.gmail.com>
	<20090207170234.GX17713@sashak.voltaire.com>
Message-ID: <f0e08f230902070902k31a67f06qbaa29e0c531a5e6d@mail.gmail.com>

On Sat, Feb 7, 2009 at 12:02 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 10:27 Sat 07 Feb     , Hal Rosenstock wrote:
>> >> Is bypassing it's vendor layer acceptable for OpenSM
>> >
>> > Sure, so it is why I asked where and for what purpose do you need pkey
>> > table and why is OpenSM vendor layer chosen there?
>> >
>> >> unless we are
>> >> going to totally remove it and go straight to umad (which I'm not
>> >> proposing) ?
>> >
>> > BTW, WinOF now has libibumad implemented too,
>>
>> Yes, it seems pretty far along now.
>>
>> > it could be an option to switch.
>>
>> Could be but what about the other vendor layers ? Would we orphan those ?
>
> Who needs this really,

AFAIK the carrying along of these came from Mellanox. If they no
longer need these and Windows is ready to switch over officially to
umad, then I don't see an issue.

> it is broken long time anyway.

What are you referring to as broken here ?

-- Hal

> Sasha
>


From sashak at voltaire.com  Sat Feb  7 10:37:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 20:37:01 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 1/3] Added io_guid_file options and
	variables in the different structures and functions.
In-Reply-To: <494A5396.5040106@ext.bull.net>
References: <494A5339.9030304@ext.bull.net> <494A5396.5040106@ext.bull.net>
Message-ID: <20090207183701.GA27757@sashak.voltaire.com>

On 14:43 Thu 18 Dec     , Nicolas Morey Chaisemartin wrote:
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  opensm/include/opensm/osm_subnet.h |    5 ++
>  opensm/opensm/main.c               |   13 ++++++
>  opensm/opensm/osm_subnet.c         |    9 ++++
>  opensm/opensm/osm_ucast_ftree.c    |   81 
> ++++++++++++++++++++++++++++++++----
>  4 files changed, 100 insertions(+), 8 deletions(-)

> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index fe456d5..3f3d919 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -190,6 +190,7 @@ typedef struct osm_subn_opt {
>  	char *lfts_file;
>  	char *root_guid_file;
>  	char *cn_guid_file;
> +	char *io_guid_file;
>  	char *ids_guid_file;
>  	char *guid_routing_order_file;
>  	char *sa_db_file;
> @@ -382,6 +383,10 @@ typedef struct osm_subn_opt {
>  *		Name of the file that contains list of compute node guids that
>  *		will be used by fat-tree routing (provided by User)
>  *
> +*	io_guid_file
> +*		Name of the file that contains list of I/O node guids that
> +*		will be used by fat-tree routing (provided by User)
> +*
>  *	ids_guid_file
>  *		Name of the file that contains list of ids which should be
>  *		used by Up/Down algorithm instead of node GUIDs
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 999e92f..3c1bcf2 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -207,6 +207,9 @@ static void show_usage(void)
>  	printf("--cn_guid_file, -u <path to file>\n"
>  	       "          Set the compute nodes for the Fat-Tree routing algorithm\n"
>  	       "          to the guids provided in the given file (one to a line)\n\n");
> +	printf("--io_guid_file, -G <path to file>\n"
> +	       "          Set the I/O nodes for the Fat-Tree routing algorithm\n"
> +	       "          to the guids provided in the given file (one to a line)\n\n");
>  	printf("--ids_guid_file, -m <path to file>\n"
>  	       "          Name of the map file with set of the IDs which will be used\n"
>  	       "          by Up/Down routing algorithm instead of node GUIDs\n"
> @@ -570,6 +573,7 @@ int main(int argc, char *argv[])
>  		{"sadb_file", 1, NULL, 'S'},
>  		{"root_guid_file", 1, NULL, 'a'},
>  		{"cn_guid_file", 1, NULL, 'u'},
> +		{"io_guid_file", 1, NULL, 'G'},

"G:" should be added to short_options too.

>  		{"ids_guid_file", 1, NULL, 'm'},
>  		{"guid_routing_order_file", 1, NULL, 'X'},
>  		{"stay_on_fatal", 0, NULL, 'y'},
> @@ -880,6 +884,15 @@ int main(int argc, char *argv[])
>  			       opt.cn_guid_file);
>  			break;
>  
> +		case 'G':
> +			/*
> +			   Specifies I/O node guids file
> +			 */
> +			opt.io_guid_file = optarg;
> +			printf(" I/O Node Guid File: %s\n",
> +			       opt.io_guid_file);
> +			break;
> +
>  		case 'm':
>  			/* Specifies ids guid file */
>  			opt.ids_guid_file = optarg;
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 9136021..5bfb6ae 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -410,6 +410,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->lfts_file = NULL;
>  	p_opt->root_guid_file = NULL;
>  	p_opt->cn_guid_file = NULL;
> +	p_opt->io_guid_file = NULL;
>  	p_opt->ids_guid_file = NULL;
>  	p_opt->guid_routing_order_file = NULL;
>  	p_opt->sa_db_file = NULL;
> @@ -1163,6 +1164,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
>  		opts_unpack_charp("cn_guid_file",
>  				  p_key, p_val, &p_opts->cn_guid_file);
>  
> +		opts_unpack_charp("io_guid_file",
> +				  p_key, p_val, &p_opts->io_guid_file);
> +
>  		opts_unpack_charp("ids_guid_file",
>  				  p_key, p_val, &p_opts->ids_guid_file);
>  
> @@ -1465,6 +1469,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>  		p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str);
>  
>  	fprintf(out,
> +		"# The file holding the fat-tree I/O node guids\n"
> +		"# One guid in each line\nio_guid_file %s\n\n",
> +		p_opts->io_guid_file ? p_opts->io_guid_file : null_str);
> +
> +	fprintf(out,
>  		"# The file holding the node ids which will be used by"
>  		" Up/Down algorithm instead\n# of GUIDs (one guid and"
>  		" id in each line)\nids_guid_file %s\n\n",
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index b7da20b..c24c517 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -155,6 +155,7 @@ typedef struct ftree_port_group_t_ {
>  	ftree_hca_or_sw remote_hca_or_sw;	/* pointer to remote hca/switch */
>  	cl_ptr_vector_t ports;	/* vector of ports to the same lid */
>  	boolean_t is_cn;	/* whether this port is a compute node */
> +	boolean_t is_io;	/* whether this port is an I/O node */
>  	uint32_t counter_down;	/* number of allocated routs downwards */
>  } ftree_port_group_t;
>  
> @@ -205,6 +206,7 @@ typedef struct ftree_fabric_t_ {
>  	cl_qmap_t sw_by_tuple_tbl;
>  	cl_qlist_t root_guid_list;
>  	cl_qmap_t cn_guid_tbl;
> +	cl_qmap_t io_guid_tbl;
>  	unsigned cn_num;
>  	uint8_t leaf_switch_rank;
>  	uint8_t max_switch_rank;
> @@ -392,7 +394,8 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid,
>  			      IN ib_net64_t remote_node_guid,
>  			      IN uint8_t remote_node_type,
>  			      IN void *p_remote_hca_or_sw,
> -			      IN boolean_t is_cn)
> +			      IN boolean_t is_cn,
> +			      IN boolean_t is_io)
>  {
>  	ftree_port_group_t *p_group =
>  	    (ftree_port_group_t *) malloc(sizeof(ftree_port_group_t));
> @@ -440,6 +443,7 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid,
>  	cl_ptr_vector_init(&p_group->ports, 0,	/* min size */
>  			   8);	/* grow size */
>  	p_group->is_cn = is_cn;
> +	p_group->is_io = is_io;
>  	return p_group;
>  }				/* __osm_ftree_port_group_create() */
>  
> @@ -705,7 +709,7 @@ __osm_ftree_sw_add_port(IN ftree_sw_t * p_sw,
>  							remote_node_guid,
>  							remote_node_type,
>  							p_remote_hca_or_sw,
> -							FALSE);
> +							     FALSE,FALSE);

Please don't break indentation.

Also here and in another places space after ',' is needed (you can look
at opensm/doc/opensm-coding-style.txt and use opensm/opensm/osm_indent to
get an idea about desired formatting style).

>  		CL_ASSERT(p_group);
>  
>  		if (direction == FTREE_DIRECTION_UP)
> @@ -836,7 +840,8 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
>  			 IN ib_net64_t remote_port_guid,
>  			 IN ib_net64_t remote_node_guid,
>  			 IN uint8_t remote_node_type,
> -			 IN void *p_remote_hca_or_sw, IN boolean_t is_cn)
> +			 IN void *p_remote_hca_or_sw, IN boolean_t is_cn,
> +			 IN boolean_t is_io)
>  {
>  	ftree_port_group_t *p_group;
>  
> @@ -859,7 +864,7 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
>  							remote_node_guid,
>  							remote_node_type,
>  							p_remote_hca_or_sw,
> -							is_cn);
> +							is_cn,is_io);
>  		p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group;
>  	}
>  	__osm_ftree_port_group_add_port(p_group, port_num, remote_port_num);
> @@ -885,6 +890,7 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
>  	cl_qmap_init(&p_ftree->sw_tbl);
>  	cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
>  	cl_qmap_init(&p_ftree->cn_guid_tbl);
> +	cl_qmap_init(&p_ftree->io_guid_tbl);
>  
>  	cl_qlist_init(&p_ftree->root_guid_list);
>  
> @@ -953,6 +959,18 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
>  	}
>  	cl_qmap_remove_all(&p_ftree->cn_guid_tbl);
>  
> +	/* remove all the elements of io_guid_tbl */
> +	p_next_guid_element =
> +	    (name_map_item_t *) cl_qmap_head(&p_ftree->io_guid_tbl);
> +	while (p_next_guid_element !=
> +	       (name_map_item_t *) cl_qmap_end(&p_ftree->io_guid_tbl)) {
> +		p_guid_element = p_next_guid_element;
> +		p_next_guid_element =
> +		    (name_map_item_t *) cl_qmap_next(&p_guid_element->item);
> +		free(p_guid_element);
> +	}
> +	cl_qmap_remove_all(&p_ftree->io_guid_tbl);
> +
>  	/* remove all the elements of root_guid_list */
>  	while (!cl_is_qlist_empty(&p_ftree->root_guid_list))
>  		free(cl_qlist_remove_head(&p_ftree->root_guid_list));
> @@ -1347,6 +1365,14 @@ static inline boolean_t __osm_ftree_fabric_cns_provided(IN ftree_fabric_t *
>  
>  /***************************************************/
>  
> +static inline boolean_t __osm_ftree_fabric_ios_provided(IN ftree_fabric_t *
> +							p_ftree)
> +{
> +	return (p_ftree->p_osm->subn.opt.io_guid_file != NULL);
> +}
> +
> +/***************************************************/
> +
>  static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree)
>  {
>  	ftree_sw_t *p_sw;
> @@ -2816,6 +2842,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
>  	uint8_t i;
>  	uint8_t remote_port_num;
>  	boolean_t is_cn = FALSE;
> +	boolean_t is_io = FALSE;
>  	int res = 0;
>  
>  	for (i = 0; i < osm_node_get_num_physp(p_node); i++) {
> @@ -2893,9 +2920,27 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
>  				"Marking CN port GUID 0x%016" PRIx64 "\n",
>  				cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
>  		} else {
> -			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> -				"Marking non-CN port GUID 0x%016" PRIx64 "\n",
> -				cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
> +		       if (__osm_ftree_fabric_ios_provided(p_ftree)) {

		} else if (...) {
			....

> +			       name_map_item_t *p_elem =
> +				  (name_map_item_t *) cl_qmap_get(&p_ftree->
> +								      io_guid_tbl,
> +								      cl_ntoh64(osm_physp_get_port_guid
> +										  (p_osm_port)));
> +				if (p_elem !=
> +				    (name_map_item_t *) cl_qmap_end(&p_ftree->
> +									 io_guid_tbl))
> +				    is_io = TRUE;
> +
> +
> +				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> +					 "Marking I/O port GUID 0x%016" PRIx64 "\n",
> +					 cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
> +
> +			} else {
> +			       OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> +					 "Marking non-CN port GUID 0x%016" PRIx64 "\n",
> +					 cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
> +			}
>  		}
>  
>  		__osm_ftree_hca_add_port(p_hca,	/* local ftree_hca object */
> @@ -2908,7 +2953,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
>  					 remote_node_guid,	/* remote node guid */
>  					 remote_node_type,	/* remote node type */
>  					 (void *)p_remote_sw,	/* remote ftree_hca/sw object */
> -					 is_cn);	/* whether this port is compute node */
> +					    is_cn,is_io);	/* whether this port is compute node */
>  	}
>  
>  Exit:
> @@ -3399,6 +3444,26 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
>  		}
>  	}
>  
> +
> +	if (__osm_ftree_fabric_ios_provided(p_ftree)) {
> +		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> +			"Fetching I/O nodes from file %s\n",
> +			p_ftree->p_osm->subn.opt.io_guid_file);
> +
> +		if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file,
> +				   add_guid_item_to_map,
> +				   &p_ftree->io_guid_tbl)) {
> +			status = -1;
> +			goto Exit;
> +		}
> +
> +		if (!cl_qmap_count(&p_ftree->io_guid_tbl)) {
> +			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: "
> +				"I/O node guids file has no valid guids\n");
> +			status = -1;
> +			goto Exit;
> +		}

Should empty io_guids file be an error (I don't know)?

Sasha

> +	}
>  Exit:
>  	OSM_LOG_EXIT(&p_ftree->p_osm->log);
>  	return status;
> 


From sashak at voltaire.com  Sat Feb  7 10:39:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 20:39:54 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902070902k31a67f06qbaa29e0c531a5e6d@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090207123830.GQ17713@sashak.voltaire.com>
	<f0e08f230902070441l7688374awe3e203d7e84fa58@mail.gmail.com>
	<20090207132753.GS17713@sashak.voltaire.com>
	<f0e08f230902070542l4699e2b3r3aa5b1bdda468aac@mail.gmail.com>
	<20090207144426.GU17713@sashak.voltaire.com>
	<f0e08f230902070727u6c6c4b2fxdabf73d53b387026@mail.gmail.com>
	<20090207170234.GX17713@sashak.voltaire.com>
	<f0e08f230902070902k31a67f06qbaa29e0c531a5e6d@mail.gmail.com>
Message-ID: <20090207183954.GB27757@sashak.voltaire.com>

On 12:02 Sat 07 Feb     , Hal Rosenstock wrote:
> 
> > it is broken long time anyway.
> 
> What are you referring to as broken here ?

All those not-used vendor implementations are not supported many years
and likely will not work with current version of OpenSM.

Sasha


From sashak at voltaire.com  Sat Feb  7 10:47:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 20:47:25 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 3/3] Added possible reverse hops for
	Ftree algorithm.
In-Reply-To: <494A53AE.8080706@ext.bull.net>
References: <494A5339.9030304@ext.bull.net> <494A53AE.8080706@ext.bull.net>
Message-ID: <20090207184725.GC27757@sashak.voltaire.com>

On 14:44 Thu 18 Dec     , Nicolas Morey Chaisemartin wrote:
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  opensm/opensm/osm_ucast_ftree.c |  102 
> ++++++++++++++++++++++++++++++++-------
>  1 files changed, 85 insertions(+), 17 deletions(-)

> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index c24c517..d4d3e70 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -2131,7 +2131,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  					       IN ib_net16_t target_lid,
>  					       IN uint8_t target_rank,
>  					       IN boolean_t is_real_lid,
> -					       IN boolean_t is_main_path)
> +					       IN boolean_t is_main_path,
> +					       IN uint16_t reverse_hop_credit)
>  {
>  	ftree_sw_t *p_remote_sw;
>  	uint16_t ports_num;
> @@ -2155,8 +2156,36 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  						       p_sw->rank);	/* the highest visited point in the tree before going down */
>  
>  	/* recursion stop condition - if it's a root switch, */
> -	if (p_sw->rank == 0)
> -		return;
> +	if (p_sw->rank == 0){
> +              if(reverse_hop_credit>0){

	if (p_sw->rank == 0 && reverse_hop_credit > 0) {
		...

> +                     /* We go up by going down as we have some reverse_hop_credit left*/
> +                     /* We use the index to scatter a bit the reverse up routes */
> +                     p_sw->down_port_groups_idx =
> +                            (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +                     i=p_sw->down_port_groups_idx;
> +                     for (j = 0; j < p_sw->down_port_groups_num; j++) {
> +
> +                            p_group = p_sw->down_port_groups[i];
> +                            i = (i + 1) % p_sw->down_port_groups_num;
> +
> +                            /* Skip this port group unless it points to a switch */
> +                            if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> +                                   continue;
> +                            p_remote_sw = p_group->remote_hca_or_sw.p_sw;
> +
> +                            __osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
> +                                                                           p_sw,	/* this switch - prev. position switch for the function */
> +                                                                           target_lid,	/* LID that we're routing to */
> +                                                                           target_rank,	/* rank of the LID that we're routing to */
> +                                                                           is_real_lid,	/* whether this target LID is real or dummy */
> +                                                                           is_main_path,reverse_hop_credit-1);	/* whether this is path to HCA that should by tracked by counters */
> +                            return;
> +                     }
> +
> +              }
> +              return;
> +	}
> +
>  
>  	/* Find the least loaded upgoing port group */
>  	p_min_group = NULL;
> @@ -2242,14 +2271,17 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  		p_min_group->counter_down++;
>  		p_min_port->counter_down++;
>  		if (is_real_lid) {
> -			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
> -				p_min_port->remote_port_num;
> -			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> -				"Switch %s: set path to CA LID %u through port %u\n",
> -				__osm_ftree_tuple_to_str(p_remote_sw->tuple),
> -				cl_ntoh16(target_lid),
> -				p_min_port->remote_port_num);
> -
> +			/* This LID may already be in the LFT in the reverse_hop feature is used */
> +			/* We update the LFT only if this LID isn't already present. */
> +			if(p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] == OSM_NO_PATH) {
> +				p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
> +					p_min_port->remote_port_num;
> +				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> +					"Switch %s: set path to CA LID %u through port %u\n",
> +					__osm_ftree_tuple_to_str(p_remote_sw->tuple),
> +					cl_ntoh16(target_lid),
> +					p_min_port->remote_port_num);
> +			}
>  			/* On the remote switch that is pointed by the min_group,
>  			   set hops for ALL the ports in the remote group. */
>  
> @@ -2274,7 +2306,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  							       target_lid,	/* LID that we're routing to */
>  							       target_rank,	/* rank of the LID that we're routing to */
>  							       is_real_lid,	/* whether this target LID is real or dummy */
> -							       is_main_path);	/* whether this is path to HCA that should by tracked by counters */
> +                                                               is_main_path,	/* whether this is path to HCA that should by tracked by counters */
> +							       reverse_hop_credit);
>  	}
>  
>  	/* we're done for the third case */
> @@ -2360,9 +2393,39 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  							       target_lid,	/* LID that we're routing to */
>  							       target_rank,	/* rank of the LID that we're routing to */
>  							       TRUE,	/* whether the target LID is real or dummy */
> -							       FALSE);	/* whether this is path to HCA that should by tracked by counters */
> +                                                               FALSE,reverse_hop_credit);	/* whether this is path to HCA that should by tracked by counters */
>  	}
>  
> +
> +       /* If we don't have any reverse hop credits, we are done */
> +       if(reverse_hop_credit==0)
> +              return;
> +
> +       /* We explore all the down group ports */
> +       /* We try to reverse jump for each of them */
> +       /* They already have a route to us from the upgoing_by_going_down started earlier */
> +       /* This is only so it'll continue exploring up, after this step backwards*/
> +	for (i = 0; i < p_sw->down_port_groups_num; i++) {
> +		p_group = p_sw->down_port_groups[i];
> +		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
> +
> +
> +              /* Skip this port group unless it points to a switch */
> +              if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> +                     continue;
> +
> +
> +		/* Recursion step:
> +		   Assign downgoing ports by stepping up, fter doing one step down starting on REMOTE switch. */
> +		__osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
> +							       p_sw,	/* this switch - prev. position switch for the function */
> +							       target_lid,	/* LID that we're routing to */
> +							       target_rank,	/* rank of the LID that we're routing to */
> +							       TRUE,	/* whether the target LID is real or dummy */
> +                                                               TRUE,reverse_hop_credit-1);	/* whether this is path to HCA that should by tracked by counters */
> +	}
> +
> +
>  }				/* ftree_fabric_route_downgoing_by_going_up() */
>  
>  /***************************************************/
> @@ -2448,7 +2511,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
>  								       hca_lid,	/* LID that we're routing to */
>  								       p_sw->rank + 1,	/* rank of the LID that we're routing to */
>  								       TRUE,	/* whether this HCA LID is real or dummy */
> -								       TRUE);	/* whether this path to HCA should by tracked by counters */
> +                                                                       TRUE,0);	/* whether this path to HCA should by tracked by counters */
>  
>  			/* count how many real targets have been routed from this leaf switch */
>  			routed_targets_on_leaf++;
> @@ -2473,7 +2536,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
>  									       0,	/* LID that we're routing to - ignored for dummy HCA */
>  									       0,	/* rank of the LID that we're routing to - ignored for dummy HCA */
>  									       FALSE,	/* whether this HCA LID is real or dummy */
> -									       TRUE);	/* whether this path to HCA should by tracked by counters */
> +                                                                               TRUE,0);	/* whether this path to HCA should by tracked by counters */
>  			}
>  		}
>  	}
> @@ -2558,7 +2621,8 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
>  								       hca_lid,	/* LID that we're routing to */
>  								       p_sw->rank + 1,	/* rank of the LID that we're routing to */
>  								       TRUE,	/* whether this HCA LID is real or dummy */
> -								       TRUE);	/* whether this path to HCA should by tracked by counters */
> +                                                                       TRUE, 	/* whether this path to HCA should by tracked by counters */
> +                                                                       p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops :0  ); /* Number or reverse hops allowed*/
>  		}
>  		/* done with all the port groups of this HCA - go to next HCA */
>  	}
> @@ -2610,7 +2674,7 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
>  							       p_sw->base_lid,	/* LID that we're routing to */
>  							       p_sw->rank,	/* rank of the LID that we're routing to */
>  							       TRUE,	/* whether the target LID is a real or dummy */
> -							       FALSE);	/* whether this path should by tracked by counters */
> +                                                        FALSE,0);	/* whether this path should by tracked by counters */
>  	}
>  
>  	OSM_LOG_EXIT(&p_ftree->p_osm->log);
> @@ -3432,6 +3496,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
>  		if (parse_node_map(p_ftree->p_osm->subn.opt.cn_guid_file,
>  				   add_guid_item_to_map,
>  				   &p_ftree->cn_guid_tbl)) {
> +			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: "
> +				"Problem parsin CN guid file\n");
>  			status = -1;
>  			goto Exit;
>  		}
> @@ -3453,6 +3519,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
>  		if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file,
>  				   add_guid_item_to_map,
>  				   &p_ftree->io_guid_tbl)) {
> +			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: "
> +				"Problem parsin I/O guid file\n");

"ERR AB**" codes should be unique.

Sasha

>  			status = -1;
>  			goto Exit;
>  		}
> 


From sashak at voltaire.com  Sat Feb  7 10:55:51 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 20:55:51 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between
	non-CN nodes
In-Reply-To: <494A5339.9030304@ext.bull.net>
References: <494A5339.9030304@ext.bull.net>
Message-ID: <20090207185551.GD27757@sashak.voltaire.com>

Hi Nicolas,

On 14:42 Thu 18 Dec     , Nicolas Morey Chaisemartin wrote:
>
> We are current working on a Ftree topology where IO nodes are connected on 
> spine switches.
> Using the cn_guid_file and root_guid_file works great.
> It is possible to route the whole tree as a fat tree. All the CNs are 
> connected to the other CN and IO nodes.
> However, we are missing some connectivity between IO nodes. This is the 
> expected behavior as the route between those IO nodes would have
> to go down to go back up on another spine switch.
>
> However, we need at least a bit of connectivity between those nodes. There 
> won't be any real traffic but just some "ping" for HA purposes.
>
> Therefore, I have implemented two new options to openSM: io_guid_file and 
> max_reverse_hops.
> The io_guid_file provides a list of all the IO guid (it may differs from 
> the list of non-CN nodes)

"IO" is specific for your setup. Could we find more generic name for such
nodes?

> The max_reverse_hops gives the number of time IO nodes (described by 
> io_guid_file) are allowed to use a switch backward.

Don't those two options duplicate each others somehow? If we want to
connect io nodes anyway, why max_reverse_hops should be important?

Or probably instead of having io nodes guids list we prefer to connect
everything N hops from roots? Then sort of --connect-roots extension
(--connect-roots=3) could work. No?

>
> According to my tests this has absolutely no effects on regular routing and 
> manages to connect the io nodes together, if max_reverse_hops is big 
> enough.
>
> This is a first draft for this feature. I'd be happy to have some feedback 
> about how to upgrade it and make it as clean as possible, wether it is 
> integrated in the mainstream or not.

Since this functionality is optional, useful and shouldn't change a
default behavior it can be suitable for main stream IMO.

Sasha


From devel at morey-chaisemartin.com  Sat Feb  7 11:48:13 2009
From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Sat, 07 Feb 2009 20:48:13 +0100
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between
	non-CN nodes
In-Reply-To: <20090207185551.GD27757@sashak.voltaire.com>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
Message-ID: <498DE57D.4030501@morey-chaisemartin.com>

Sasha Khapyorsky a écrit :
> Hi Nicolas,
>
> On 14:42 Thu 18 Dec     , Nicolas Morey Chaisemartin wrote:
>   
>> We are current working on a Ftree topology where IO nodes are connected on 
>> spine switches.
>> Using the cn_guid_file and root_guid_file works great.
>> It is possible to route the whole tree as a fat tree. All the CNs are 
>> connected to the other CN and IO nodes.
>> However, we are missing some connectivity between IO nodes. This is the 
>> expected behavior as the route between those IO nodes would have
>> to go down to go back up on another spine switch.
>>
>> However, we need at least a bit of connectivity between those nodes. There 
>> won't be any real traffic but just some "ping" for HA purposes.
>>
>> Therefore, I have implemented two new options to openSM: io_guid_file and 
>> max_reverse_hops.
>> The io_guid_file provides a list of all the IO guid (it may differs from 
>> the list of non-CN nodes)
>>     
>
> "IO" is specific for your setup. Could we find more generic name for such
> nodes?
>
>   
Sure. Any ideas?
>> The max_reverse_hops gives the number of time IO nodes (described by 
>> io_guid_file) are allowed to use a switch backward.
>>     
>
> Don't those two options duplicate each others somehow? If we want to
> connect io nodes anyway, why max_reverse_hops should be important?
>   
Because we may not want to connect all of them to all the nodes. By
specifying a small max_reverse_hop you can restrain (depending on your
topology) the effect of the io_guid_file so an "IO" node will only see
the closests "IO" node through reverse routes but not all of them
As the effect on credit loop is not certain yet, I think the less
reverse route we create, the better it is.
> Or probably instead of having io nodes guids list we prefer to connect
> everything N hops from roots? Then sort of --connect-roots extension
> (--connect-roots=3) could work. No?
>
>   
That should work too but it is less flexible than io_guid_file for
tweaking the configuration and have the best routing scheme.
>> According to my tests this has absolutely no effects on regular routing and 
>> manages to connect the io nodes together, if max_reverse_hops is big 
>> enough.
>>
>> This is a first draft for this feature. I'd be happy to have some feedback 
>> about how to upgrade it and make it as clean as possible, wether it is 
>> integrated in the mainstream or not.
>>     
>
> Since this functionality is optional, useful and shouldn't change a
> default behavior it can be suitable for main stream IMO.
>
> Sasha
Okay, I'll fix the indentation and few coding style error. There is also
a bug in the current patch as the hops counter are not set to the right
value when creating route which had reverse hops number of reverse
hops*2 should be added.

I'll have to rewrite the patches so they work with the current HEAD.
Specially with the option system changes it won't merge cleanly.

Nicolas


From sashak at voltaire.com  Sat Feb  7 12:23:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Feb 2009 22:23:19 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing
	between non-CN nodes
In-Reply-To: <498DE57D.4030501@morey-chaisemartin.com>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
Message-ID: <20090207202319.GE27757@sashak.voltaire.com>

On 20:48 Sat 07 Feb     , Nicolas Morey-Chaisemartin wrote:
> >
> > "IO" is specific for your setup. Could we find more generic name for such
> > nodes?
> >
> >   
> Sure. Any ideas?

No, I didn't think about it.

> >> The max_reverse_hops gives the number of time IO nodes (described by 
> >> io_guid_file) are allowed to use a switch backward.
> >>     
> >
> > Don't those two options duplicate each others somehow? If we want to
> > connect io nodes anyway, why max_reverse_hops should be important?
> >   
> Because we may not want to connect all of them to all the nodes. By
> specifying a small max_reverse_hop you can restrain (depending on your
> topology) the effect of the io_guid_file so an "IO" node will only see
> the closests "IO" node through reverse routes but not all of them
> As the effect on credit loop is not certain yet, I think the less
> reverse route we create, the better it is.
> > Or probably instead of having io nodes guids list we prefer to connect
> > everything N hops from roots? Then sort of --connect-roots extension
> > (--connect-roots=3) could work. No?
> >
> >   
> That should work too but it is less flexible than io_guid_file for
> tweaking the configuration and have the best routing scheme.
> >> According to my tests this has absolutely no effects on regular routing and 
> >> manages to connect the io nodes together, if max_reverse_hops is big 
> >> enough.
> >>
> >> This is a first draft for this feature. I'd be happy to have some feedback 
> >> about how to upgrade it and make it as clean as possible, wether it is 
> >> integrated in the mainstream or not.
> >>     
> >
> > Since this functionality is optional, useful and shouldn't change a
> > default behavior it can be suitable for main stream IMO.
> >
> > Sasha
> Okay, I'll fix the indentation and few coding style error. There is also
> a bug in the current patch as the hops counter are not set to the right
> value when creating route which had reverse hops number of reverse
> hops*2 should be added.
> 
> I'll have to rewrite the patches so they work with the current HEAD.
> Specially with the option system changes it won't merge cleanly.

Use 'git-rebase master <yourbranch>' - it does the job with only two
trivial conflicts.

Sasha


From ogerlitz at voltaire.com  Sat Feb  7 22:53:23 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 08 Feb 2009 08:53:23 +0200
Subject: [ofa-general] Re: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <498B40F6.7060904@Voltaire.COM>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>
	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>
	<4989E6D6.5030109@Voltaire.COM>
	<3522BA7F49834878A674F2908834D747@amr.corp.intel.com>
	<498B3D7E.6010300@Voltaire.COM>
	<F6F1C8DBB03A4CCB882ED455673DD576@amr.corp.intel.com>
	<498B40F6.7060904@Voltaire.COM>
Message-ID: <498E8163.6090803@voltaire.com>

Yossi Etigin wrote:
>> Have you tested the patch and verified that it works for you?
>>
> Yes I did, with mckey.  When the HCA port is down:  Without the patch, 
> mckey fails on from rdma_resolve_route (except when ipoib is trying to 
> join at the same time - then there will be a join error).  With the 
> patch, mckey fails on rdma_create_qp (again, except when ipoib is 
> trying  to join at the same time).  When the HCA port is up, mckey 
> works normally.
mckey shouldn't be calling rdma_resolve_route, so I assume you referred 
to rdma_resolve_addr

Or.


From vlad at lists.openfabrics.org  Sun Feb  8 03:11:34 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  8 Feb 2009 03:11:34 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090208-0200 daily build status
Message-ID: <20090208111134.F1EFCE60F20@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From tziporet at mellanox.co.il  Sun Feb  8 07:57:34 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Sun, 8 Feb 2009 17:57:34 +0200
Subject: [ofa-general] OFED (EWG) meeting agenda for tomorrow (Feb 09)
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com>

> These are the agenda items for the meeting tomorrow:
> 
> 1. OFED 1.4.1 release status:
*	New OSes: 
*	RH 5.3 - done, we still have an issue with Itanium
*	SLES 11 - schedule is OK. RC3 already available - Any volunteers
to prepare the backports?
*	Open MPI 1.3 - I heard there are some critical bugs. What is the
status of 1.3.1? - Jeff S.
*	RDS with iWARP support - Steve
*	NFS/RDMA backports - Steve
*	Critical bug fixes
As far as I know these are the critical bugs that should be fixed:
		1383    	blo  	P3  	jackm at mellanox.co.il
Local protection error on transmit from ipoib datagram to...
		1471 	cri 	P3 	amirv at mellanox.co.il
Performance degradation in ofed 1.4
		Please send more bugs that are critical for the release


2. Decide on 1.4.1 schedule:
	Proposal:
*	RC1 - Mar 3
*	RC2 - Mar 17
*	RC3 - Mar 31
*	GA  - Apr 7

3. Sonoma updates (if any) - Bill Boas

> Tziporet
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090208/363aa9b1/attachment.html>

From dorfman.eli at gmail.com  Sun Feb  8 09:07:45 2009
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sun, 8 Feb 2009 19:07:45 +0200
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_subnet.c fix parse
	functions for big endian machines
In-Reply-To: <20090205180400.GJ5910@sashak.voltaire.com>
References: <498B038D.4020009@gmail.com>
	<20090205180400.GJ5910@sashak.voltaire.com>
Message-ID: <694d48600902080907u3a6b40f7s7d0d612fd6a793ce@mail.gmail.com>

On Thu, Feb 5, 2009 at 8:04 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 17:19 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote:
>> fix parse functions for big endian machines
>>
>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>
> Applied. Thanks.
>
> I'm fine with this patch - the code looks cleaner than it was before.
>
> But could you please explain what was a problem with original code on
> big endian machines (I don't see)?

The problem was that setup function that is called from the parse
uint8 function assumed
that void * p_val is a pointer to uint8 but it was uint32


>
> Also it would be helpful to have more detailed patch comments.
>
> Sasha
>
>> ---
>>  opensm/opensm/osm_subnet.c |   10 +++++-----
>>  1 files changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>> index d6d39a6..7b33659 100644
>> --- a/opensm/opensm/osm_subnet.c
>> +++ b/opensm/opensm/osm_subnet.c
>> @@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn,
>>                 IN void *p_v, IN setup_fn_t pfn)
>>  {
>>       uint16_t *p_val = p_v;
>> -     uint32_t val = strtoul(p_val_str, NULL, 0);
>> +     uint16_t val = strtoul(p_val_str, NULL, 0);
>>
>>       CL_ASSERT(val < 0x10000);
>> -     if (cl_hton32(val) != *p_val) {
>> +     if (cl_hton16(val) != *p_val) {
>>               log_config_value(p_key, "0x%04x", val);
>>               if (pfn)
>>                       pfn(p_subn, &val);
>> -             *p_val = cl_hton16((uint16_t) val);
>> +             *p_val = cl_hton16(val);
>>       }
>>  }
>>
>> @@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn,
>>                 IN void *p_v, IN setup_fn_t pfn)
>>  {
>>       uint8_t *p_val = p_v;
>> -     uint32_t val = strtoul(p_val_str, NULL, 0);
>> +     uint8_t val = strtoul(p_val_str, NULL, 0);
>>
>>       CL_ASSERT(val < 0x100);
>>       if (val != *p_val) {
>>               log_config_value(p_key, "%u", val);
>>               if (pfn)
>>                       pfn(p_subn, &val);
>> -             *p_val = (uint8_t) val;
>> +             *p_val = val;
>>       }
>>  }
>>
>> --
>> 1.5.5
>>
>


From dorfman.eli at gmail.com  Sun Feb  8 11:23:27 2009
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sun, 8 Feb 2009 21:23:27 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c
	rescan subnet configuration after SIGHUP
In-Reply-To: <20090205121634.GQ11874@sashak.voltaire.com>
References: <497DC87F.2090308@gmail.com>
	<20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
Message-ID: <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>

On Thu, Feb 5, 2009 at 2:16 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 09:43 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote:
>>
>> ok. Please apply the fixed patch.
>
> Did you test it?

yes, but wouldn't it be better to separate between heavy sweep and
config rescan (due to SIGHUP).
I think that user should know when configuration is updated and not
wait for heavy sweep.

Eli


From sashak at voltaire.com  Sun Feb  8 13:38:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 8 Feb 2009 23:38:26 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan
	subnet configuration after SIGHUP
In-Reply-To: <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
References: <20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
	<694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
Message-ID: <20090208213826.GA24254@sashak.voltaire.com>

Hi Eli,

On 21:23 Sun 08 Feb     , Eli Dorfman wrote:
> 
> yes, but wouldn't it be better to separate between heavy sweep and
> config rescan (due to SIGHUP).

SIGHUP main purpose always was to trigger heavy sweep.

> I think that user should know when configuration is updated and not
> wait for heavy sweep.

I'm not following - SIGHUP will cause heavy sweep and config update,
where is a waiting?

Sasha


From kliteyn at dev.mellanox.co.il  Sun Feb  8 14:19:59 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Feb 2009 00:19:59 +0200
Subject: [ofa-general] Re: saquery & osm vendor AL - ca_names missing from
 osm_vendor_t ?
In-Reply-To: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
Message-ID: <498F5A8F.2000101@dev.mellanox.co.il>

Hi Stan,

Adding Sasha (OFED management maintainer)
and the openib mailing list.

Stan C. Smith wrote:
> Hello,
>   The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in opensm\user\include\vendor\osm_vendor_al.h) is missing
> the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'.
> saquery.c expects to find ca_names in osm_vendor_t.
> 
> A couple of observations:
> 1) Windows currently supports a much older version of opensm than what OFED 1.4 tools expect.

Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple of minor fixes.

> 2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad interfaces.

By "OpenSM mad interfaces" you mean libosmvendor?

>    If libibmad from saquery, then OpenSM would not need libibmad references for Windows.

Not sure what you mean here. You mean removing libibmad dependency from saquery?

> 3) How bad is it to create libibmad dependencies for OpenSM?

Pretty bad. I don't think we should add a new dependency unless there's a
really good reason for it.

> 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; the rest use
>    libibmad.
>
> Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'. This cmd-line input is resolved thru
> libibmad interfaces.
> Saquery is no exception in that it expects to match the '-C ca_name' against osm_vendor_t.ca_names[]. 'ibstat -l' lists
> CA names.
> 
> The question becomes how best to resolve the missing ca_names?
> 
> 1) modify saquery to call libibmad interface to get CA names; leaves osm_vendor_t unmodified.
>    So far, saquery is the only diag pgm which uses OSM mad interfaces; expecting ca_names
>    in osm_vendor_t.
> 
> 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate ca_names
>    from OpenSM code?

I'd say that this option is much better.

>    Creates libibmad dependencies for opensm.

But it doesn't have to. Can IBAL expose some function to get these names,
so that Win osmvendor will use this function instead of libibmad?

Also, Linux osmvendor doesn't have libibmad dependency.
It uses umad function umad_get_cas_names() to obtain the CA names.
I know that there is a Windows version of umad, but I don't know what is
its status. If we *have* to add an additional dependency, then it should
be libibumad and not libibmad.

At some point in the future we would really want to have the new version
of OFED OpenSM ported to WinOF. If there will be a match between Linux and
Windows libraries, then the whole vendor concept can be simplified and
there won't be a need to have a separate vendor for IBAL. The things
that would be different are platform-dependent issues like threads, locks,
syslog, but not IB-related code.

-- Yevgeny


> Comments?
> 
> Thanks,
> 
> Stan.
> 
> 
> 
> 


From kliteyn at dev.mellanox.co.il  Sun Feb  8 14:36:43 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Feb 2009 00:36:43 +0200
Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor AL - ca_names
 missing from osm_vendor_t ?
In-Reply-To: <498F5A8F.2000101@dev.mellanox.co.il>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
Message-ID: <498F5E7B.6020208@dev.mellanox.co.il>

Yevgeny Kliteynik wrote:
> Hi Stan,

Oops... Looks like I was having a problem with my mail client.
By now my response is partially outdated...

-- Yevgeny

> Adding Sasha (OFED management maintainer)
> and the openib mailing list.
> 
> Stan C. Smith wrote:
>> Hello,
>>   The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in 
>> opensm\user\include\vendor\osm_vendor_al.h) is missing
>> the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'.
>> saquery.c expects to find ca_names in osm_vendor_t.
>>
>> A couple of observations:
>> 1) Windows currently supports a much older version of opensm than what 
>> OFED 1.4 tools expect.
> 
> Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple of 
> minor fixes.
> 
>> 2) saquery uses OpenSM mad interfaces while it 'could' be using 
>> libibmad interfaces.
> 
> By "OpenSM mad interfaces" you mean libosmvendor?
> 
>>    If libibmad from saquery, then OpenSM would not need libibmad 
>> references for Windows.
> 
> Not sure what you mean here. You mean removing libibmad dependency from 
> saquery?
> 
>> 3) How bad is it to create libibmad dependencies for OpenSM?
> 
> Pretty bad. I don't think we should add a new dependency unless there's a
> really good reason for it.
> 
>> 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD 
>> interfaces; the rest use
>>    libibmad.
>>
>> Most of the OFED diagnostic tools support the cmd-line option '-C 
>> ca_name'. This cmd-line input is resolved thru
>> libibmad interfaces.
>> Saquery is no exception in that it expects to match the '-C ca_name' 
>> against osm_vendor_t.ca_names[]. 'ibstat -l' lists
>> CA names.
>>
>> The question becomes how best to resolve the missing ca_names?
>>
>> 1) modify saquery to call libibmad interface to get CA names; leaves 
>> osm_vendor_t unmodified.
>>    So far, saquery is the only diag pgm which uses OSM mad interfaces; 
>> expecting ca_names
>>    in osm_vendor_t.
>>
>> 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and 
>> populate ca_names
>>    from OpenSM code?
> 
> I'd say that this option is much better.
> 
>>    Creates libibmad dependencies for opensm.
> 
> But it doesn't have to. Can IBAL expose some function to get these names,
> so that Win osmvendor will use this function instead of libibmad?
> 
> Also, Linux osmvendor doesn't have libibmad dependency.
> It uses umad function umad_get_cas_names() to obtain the CA names.
> I know that there is a Windows version of umad, but I don't know what is
> its status. If we *have* to add an additional dependency, then it should
> be libibumad and not libibmad.
> 
> At some point in the future we would really want to have the new version
> of OFED OpenSM ported to WinOF. If there will be a match between Linux and
> Windows libraries, then the whole vendor concept can be simplified and
> there won't be a need to have a separate vendor for IBAL. The things
> that would be different are platform-dependent issues like threads, locks,
> syslog, but not IB-related code.
> 
> -- Yevgeny
> 
> 
>> Comments?
>>
>> Thanks,
>>
>> Stan.
>>
>>
>>
>>
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 


From jsquyres at cisco.com  Sun Feb  8 14:43:35 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Sun, 8 Feb 2009 14:43:35 -0800
Subject: [ofa-general] OFED (EWG) meeting agenda for tomorrow (Feb 09)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com>
Message-ID: <00B5AD34-1DFF-440E-8BDB-3C9DE98110AE@cisco.com>

On Feb 8, 2009, at 7:57 AM, Tziporet Koren wrote:

> 		• Open MPI 1.3 - I heard there are some critical bugs. What is the  
> status of 1.3.1? - Jeff S.
>

I'm unfortunately unable to make it to the call tomorrow.

What bugs do you want to know about -- are there any in particular  
that you're asking about?  OMPI v1.3.1 is readying for release;  
*possibly* this week (50/50 chance of that).

-- 
Jeff Squyres
Cisco Systems


From sashak at voltaire.com  Sun Feb  8 14:54:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 00:54:12 +0200
Subject: [ofa-general] [PATCH] opensm/qos_config: no invalid option message
	on default values
Message-ID: <20090208225412.GA24514@sashak.voltaire.com>


Don't comply about invalid QoS options when its default values are used.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 3324af9..69937c1 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -911,9 +911,11 @@ static ib_api_status_t osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn
  **********************************************************************/
 static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt)
 {
-	if (!(*max_vls) || *max_vls > 15) {
-		log_report(" Invalid Cached Option: %s_max_vls=%u: "
-			   "Using Default = %u\n", prefix, *max_vls, dflt);
+	if (!*max_vls || *max_vls > 15) {
+		if (*max_vls)
+			log_report(" Invalid Cached Option: %s_max_vls=%u: "
+				   "Using Default = %u\n",
+				   prefix, *max_vls, dflt);
 		*max_vls = dflt;
 	}
 }
@@ -921,8 +923,10 @@ static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned
 static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt)
 {
 	if (*high_limit < 0 || *high_limit > 255) {
-		log_report(" Invalid Cached Option: %s_high_limit=%d: "
-			   "Using Default: %d\n", prefix, *high_limit, dflt);
+		if (*high_limit > 255)
+			log_report(" Invalid Cached Option: %s_high_limit=%d: "
+				   "Using Default: %d\n",
+				   prefix, *high_limit, dflt);
 		*high_limit = dflt;
 	}
 }
@@ -934,8 +938,6 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix,
 	int count = 0;
 
 	if (*vlarb == NULL) {
-		log_report(" Invalid Cached Option: %s_vlarb_%s: "
-		"Using Default\n", prefix, suffix);
 		*vlarb = strdup(dflt);
 		return;
 	}
@@ -1003,8 +1005,6 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt)
 	int count = 0;
 
 	if (*sl2vl == NULL) {
-		log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n",
-			   prefix);
 		*sl2vl = strdup(dflt);
 		return;
 	}
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sun Feb  8 15:01:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 01:01:54 +0200
Subject: [ofa-general] [PATCH] opensm: sort port order for routing by switch
	loads
Message-ID: <20090208230154.GB24514@sashak.voltaire.com>


This follows "port order" routing load balancer improvements
(implemented using "--guid_routing_order_file" command line option).

The idea of the patch is about default behavior and it is to balance
routing paths in such order that most loaded links enter balancer first
- in most cases it should provide a better performance than just
random balancing (as it is done now by default).

The implementation is simple - endport list for load balancer is reverse
sorted by number of endport links of leaf switches.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

Changes from RFC version of this patch are:
- ignore port state during endport_links counting - it is b/c initially
  links can be in other than ACTIVE states (INIT, ARMED), remote port
  existence should be good enough criteria by itself.
- store endport_links value in osm_switch structure and don't recount it
  during qsort()
- minor simplifications

 opensm/include/opensm/osm_switch.h |    1 +
 opensm/opensm/osm_ucast_mgr.c      |   62 +++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index dbc22e5..6279727 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -104,6 +104,7 @@ typedef struct osm_switch {
 	uint8_t *new_lft;
 	osm_mcast_tbl_t mcast_tbl;
 	uint32_t discovery_count;
+	unsigned endport_links;
 	unsigned need_update;
 	void *priv;
 } osm_switch_t;
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 96921a0..7232fbc 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -744,6 +744,65 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx)
 	}
 }
 
+static void add_sw_endports_to_order_list(osm_switch_t *sw, osm_ucast_mgr_t *m)
+{
+	osm_port_t *port;
+	osm_physp_t *p;
+	int i;
+
+	for (i = 1; i < sw->num_ports; i++) {
+		p = osm_node_get_physp_ptr(sw->p_node, i);
+		if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) {
+			port = osm_get_port_by_guid(m->p_subn,
+						    p->p_remote_physp->port_guid);
+			cl_qlist_insert_tail(&m->port_order_list,
+					     &port->list_item);
+			port->flag = 1;
+		}
+	}
+}
+
+static void sw_count_endport_links(osm_switch_t *sw)
+{
+	osm_physp_t *p;
+	int i;
+
+	sw->endport_links = 0;
+	for (i = 1; i < sw->num_ports; i++) {
+		p = osm_node_get_physp_ptr(sw->p_node, i);
+		if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw)
+			sw->endport_links++;
+	}
+}
+
+static int compar_sw_load(const void *s1, const void *s2)
+{
+#define get_sw_endport_links(s) (*(osm_switch_t **)s)->endport_links
+	return get_sw_endport_links(s2) - get_sw_endport_links(s1);
+}
+
+static void sort_ports_by_switch_load(osm_ucast_mgr_t *m)
+{
+	int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl);
+	void **s = malloc(num * sizeof(*s));
+	if (!s) {
+		OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: "
+			"No memory, skip by switch load sorting.\n");
+		return;
+	}
+	s[0] = cl_qmap_head(&m->p_subn->sw_guid_tbl);
+	for (i = 1; i < num; i++)
+		s[i] = cl_qmap_next(s[i-1]);
+
+	for (i = 0; i < num; i++)
+		sw_count_endport_links(s[i]);
+
+	qsort(s, num, sizeof(*s), compar_sw_load);
+
+	for (i = 0; i < num; i++)
+		add_sw_endports_to_order_list(s[i], m);
+}
+
 static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr)
 {
 	cl_qlist_init(&p_mgr->port_order_list);
@@ -758,7 +817,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr)
 			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : "
 				"cannot parse guid routing order file \'%s\'\n",
 				p_mgr->p_subn->opt.guid_routing_order_file);
-	}
+	} else
+		sort_ports_by_switch_load(p_mgr);
 
 	if (p_mgr->p_subn->opt.port_prof_ignore_file) {
 		cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sun Feb  8 15:04:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 01:04:06 +0200
Subject: [ofa-general] [PATCH] opensm/ftree: cleanup ftree_sw_tbl_element_t
	use
Message-ID: <20090208230406.GC24514@sashak.voltaire.com>


cl_list() allocates memory needed for storing an object in the list -
no need additional wrappers like ftree_sw_tbl_element_t.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_ftree.c |   17 ++++-------------
 1 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 68900d8..10096c7 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -1418,7 +1418,6 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
 	ftree_tuple_t new_tuple;
 	uint32_t i;
 	cl_list_t bfs_list;
-	ftree_sw_tbl_element_t *p_sw_tbl_element;
 
 	OSM_LOG_ENTER(&p_ftree->p_osm->log);
 
@@ -1465,14 +1464,10 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
 	 */
 
 	cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
-	cl_list_insert_tail(&bfs_list,
-			    &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
+	cl_list_insert_tail(&bfs_list, p_sw);
 
 	while (!cl_is_list_empty(&bfs_list)) {
-		p_sw_tbl_element =
-		    (ftree_sw_tbl_element_t *) cl_list_remove_head(&bfs_list);
-		p_sw = p_sw_tbl_element->p_sw;
-		__osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
+		p_sw = (ftree_sw_t *) cl_list_remove_head(&bfs_list);
 
 		/* Discover all the nodes from ports that are pointing down */
 
@@ -1509,9 +1504,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
 								new_tuple);
 
 				/* add the newly discovered switch to the BFS queue */
-				cl_list_insert_tail(&bfs_list,
-						    &__osm_ftree_sw_tbl_element_create
-						    (p_remote_sw)->map_item);
+				cl_list_insert_tail(&bfs_list, p_sw);
 			}
 			/* Done assigning indexes to all the remote switches
 			   that are pointed by the downgoing ports.
@@ -1547,9 +1540,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
 								p_remote_sw,
 								new_tuple);
 				/* add the newly discovered switch to the BFS queue */
-				cl_list_insert_tail(&bfs_list,
-						    &__osm_ftree_sw_tbl_element_create
-						    (p_remote_sw)->map_item);
+				cl_list_insert_tail(&bfs_list, p_sw);
 			}
 			/* Done assigning indexes to all the remote switches
 			   that are pointed by the upgoing ports.
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sun Feb  8 15:08:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 01:08:30 +0200
Subject: [ofa-general] [PATCH] opensm/ftree: simplify root guids setup.
Message-ID: <20090208230830.GD24514@sashak.voltaire.com>


Eliminate root_guid_list storage - parse it directly to bfs list.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_ftree.c |  101 +++++++++++++-------------------------
 1 files changed, 35 insertions(+), 66 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 10096c7..35f2ea1 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -100,11 +100,6 @@ struct ftree_fabric_t_;
 typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN];
 typedef uint64_t ftree_tuple_key_t;
 
-struct guid_list_item {
-	cl_list_item_t list;
-	uint64_t guid;
-};
-
 /***************************************************
  **
  **  ftree_sw_table_element_t definition
@@ -203,7 +198,6 @@ typedef struct ftree_fabric_t_ {
 	cl_qmap_t hca_tbl;
 	cl_qmap_t sw_tbl;
 	cl_qmap_t sw_by_tuple_tbl;
-	cl_qlist_t root_guid_list;
 	cl_qmap_t cn_guid_tbl;
 	unsigned cn_num;
 	uint8_t leaf_switch_rank;
@@ -886,8 +880,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
 	cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
 	cl_qmap_init(&p_ftree->cn_guid_tbl);
 
-	cl_qlist_init(&p_ftree->root_guid_list);
-
 	return p_ftree;
 }
 
@@ -953,10 +945,6 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
 	}
 	cl_qmap_remove_all(&p_ftree->cn_guid_tbl);
 
-	/* remove all the elements of root_guid_list */
-	while (!cl_is_qlist_empty(&p_ftree->root_guid_list))
-		free(cl_qlist_remove_head(&p_ftree->root_guid_list));
-
 	/* free the leaf switches array */
 	if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches))
 		free(p_ftree->leaf_switches);
@@ -3045,16 +3033,41 @@ Exit:
 
 /***************************************************
  ***************************************************/
+struct rank_root_cxt {
+	ftree_fabric_t *fabric;
+	cl_list_t *list;
+};
+
+static int rank_root_sw_by_guid(void *cxt, uint64_t guid, char *p)
+{
+	struct rank_root_cxt *c = cxt;
+	ftree_sw_t *sw;
+
+	sw = __osm_ftree_fabric_get_sw_by_guid(c->fabric, cl_hton64(guid));
+	if (!sw) {
+		/* the specified root guid wasn't found in the fabric */
+		OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_ERROR, "ERR AB24: "
+			"Root switch GUID 0x%" PRIx64 " not found\n", guid);
+		return 0;
+	}
+
+	OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_DEBUG,
+		"Ranking root switch with GUID 0x%" PRIx64 "\n", guid);
+	sw->rank = 0;
+	cl_list_insert_tail(c->list, sw);
+
+	return 0;
+}
 
 static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree)
 {
+	struct rank_root_cxt context;
 	osm_node_t *p_osm_node;
 	osm_node_t *p_remote_osm_node;
 	osm_physp_t *p_osm_physp;
 	ftree_sw_t *p_sw;
 	ftree_sw_t *p_remote_sw;
 	cl_list_t ranking_bfs_list;
-	struct guid_list_item *item;
 	int res = 0;
 	unsigned num_roots;
 	unsigned max_rank = 0;
@@ -3064,25 +3077,16 @@ static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree)
 	cl_list_init(&ranking_bfs_list, 10);
 
 	/* Rank all the roots and add them to list */
-	for (item = (void *)cl_qlist_head(&p_ftree->root_guid_list);
-	     item != (void *)cl_qlist_end(&p_ftree->root_guid_list);
-	     item = (void *)cl_qlist_next(&item->list)) {
-		p_sw =
-		    __osm_ftree_fabric_get_sw_by_guid(p_ftree,
-						      cl_hton64(item->guid));
-		if (!p_sw) {
-			/* the specified root guid wasn't found in the fabric */
-			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB24: "
-				"Root switch GUID 0x%" PRIx64 " not found\n",
-				item->guid);
-			continue;
-		}
+	OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+		"Fetching root nodes from file %s\n",
+		p_ftree->p_osm->subn.opt.root_guid_file);
 
-		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-			"Ranking root switch with GUID 0x%" PRIx64 "\n",
-			item->guid);
-		p_sw->rank = 0;
-		cl_list_insert_tail(&ranking_bfs_list, p_sw);
+	context.fabric = p_ftree;
+	context.list = &ranking_bfs_list;
+	if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file,
+			   rank_root_sw_by_guid, &context)) {
+		res = -1;
+		goto Exit;
 	}
 
 	num_roots = cl_list_count(&ranking_bfs_list);
@@ -3314,21 +3318,6 @@ Exit:
 
 /***************************************************
  ***************************************************/
-static int add_guid_item_to_list(void *cxt, uint64_t guid, char *p)
-{
-	cl_qlist_t *list = cxt;
-	struct guid_list_item *item;
-
-	item = malloc(sizeof(*item));
-	if (!item)
-		return -1;
-
-	item->guid = guid;
-	cl_qlist_insert_tail(list, &item->list);
-
-	return 0;
-}
-
 static int add_guid_item_to_map(void *cxt, uint64_t guid, char *p)
 {
 	cl_qmap_t *map = cxt;
@@ -3350,26 +3339,6 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
 
 	OSM_LOG_ENTER(&p_ftree->p_osm->log);
 
-	if (__osm_ftree_fabric_roots_provided(p_ftree)) {
-		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-			"Fetching root nodes from file %s\n",
-			p_ftree->p_osm->subn.opt.root_guid_file);
-
-		if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file,
-				   add_guid_item_to_list,
-				   &p_ftree->root_guid_list)) {
-			status = -1;
-			goto Exit;
-		}
-
-		if (!cl_qlist_count(&p_ftree->root_guid_list)) {
-			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB22: "
-				"Root guids file has no valid guids\n");
-			status = -1;
-			goto Exit;
-		}
-	}
-
 	if (__osm_ftree_fabric_cns_provided(p_ftree)) {
 		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 			"Fetching compute nodes from file %s\n",
-- 
1.6.1.2.319.gbd9e


From nicolas.morey-chaisemartin at ext.bull.net  Sun Feb  8 23:01:32 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 09 Feb 2009 08:01:32 +0100
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: Fixed bug on index
	port incrementation
Message-ID: <498FD4CC.8070900@ext.bull.net>

Here is an updated version of the patch including Yevgeni's feedback.

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/opensm/osm_ucast_ftree.c |   39 +++++++++++++++++++++++----------------
  1 files changed, 23 insertions(+), 16 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2f1d358f2bdf67838fe8776438b7757d9dcd6e15.diff
Type: text/x-patch
Size: 3805 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/d68e52fa/attachment.bin>

From ofedrnicuser at yahoo.com  Sun Feb  8 23:36:34 2009
From: ofedrnicuser at yahoo.com (Ofed User)
Date: Sun, 8 Feb 2009 23:36:34 -0800 (PST)
Subject: [ofa-general] ***SPAM*** non zero lkey in send(),
	write() with  num_sge > 1?
Message-ID: <661509.82751.qm@web111205.mail.gq1.yahoo.com>


Hi,

Can stack pass num_sge > 1, and lkey !=0 as part of sg_list[] elements, in post_send() call?

Regards,
Bill


From nicolas.morey-chaisemartin at ext.bull.net  Sun Feb  8 23:40:05 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 09 Feb 2009 08:40:05 +0100
Subject: [ofa-general] [RFC] Fat-Tree upgrades
Message-ID: <498FDDD5.1090204@ext.bull.net>

Hi everyone,

We have been working quite a lot at Bull lately on the Ftree algorithm 
and we have made some upgrades.
However, as they modify the behavior of the ftree algorithm, we haven't 
pushed them until now.
I'm just going to detail which upgrades we have done and let you decide 
if you are interested, if and how they should be pushed upstream (new 
routing algorithm, option in the ftree, etc.)

Here is a simplify model of the topology we have been working on

                         L3  L3
       ___________________|__|____________________
      /          /               \               \                <= All 
the L2 are connected on 2 L3 switches
   L2-1         L2-2            L2-1           L2-2                 <= 
There are service nodes connected directly on L2 switches
  /    \S1       / \S2       S3/    \       S4/   \                 <== 
The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
  L1           L1                 L1             L1
  /|\         /|\                 /|\           /|\
 ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
multiple set. In each set, all L0 lead to all L1 of their set.

   L0           L0                 L0           L0
 /   \        /    \             /    \       /     \
CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN

To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 
switches.
Plus a common top of L2 and L3 switches.

In each set, there are groups of compute nodes. Each group is connected 
to a single L0 switch.
In a given set, all L0 are connected to all L1.

The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
through a L2, the Nth L1 can only see the Nth L1 of the other sets)
There are Services nodes connected to the L2 switches.
All the L2 are connected to a couple of L3.

The problem we have seen when routing on this topology is that most of 
the routes from CN to SN (service nodes) go through the L3 switches. 
With the current algorithm, the less loaded link is choosed to go down 
by going up. Therefore, the primary path goes through a L2, then a L3 
from where it covers all the network.
This wasn't acceptable for us as L3 switches would be overloaded when 
there were less loaded/shorter paths to achieve the same HCA.

So what we have introduced here is a "balanced min_hop" within the ftree 
algorithm.
Basically, instead of just leaving when we reach a LFT which has already 
been configured for the target lid, we check the hops count of the 
switch toward this lid, and the hop count on the path we came through. 
If we have found a shorter path, we update the LFT and minhop tables to 
use this new path.
This means that the difference between primary_path and secondary path 
is not so important anymore.
Secondary path may increment port counters but only if routes to HCA 
were created (see opensm/osm_ucast_ftree.c: Fixed bug on index    port 
incrementation which makes this possible).
I acknowledge that port count may be slightly wrong as a primary path 
that is replaced with a shorter secondary path has incremented counters 
and they won't be removed. However, in most the cases the primary path 
would have created other routes than the one replaced so counters are fine.

For all regular ftree topology, I have see no change with this update 
but with topologies where two levels are not fully interconnected, this 
helps a lot !


Another thing we have developped here is to balance more secondary path.
In the current algorithm, secondary down path (going_down_by_going up) 
are created in port_group order.
This means that if the primary path didn't reach all the network 
(because a switch is broken for examples), all the routes missing will 
be created through the first port group. Which unbalance the network 
load a lot.
To solve this,  we create the secondary path by port group load.
The previous patch has made us increment the port/portgroup counters  
when secondary routes towards HCA are created, therefore these counters 
are significant even when creating secondary routes.
What our patch does is at the beginning of the function sort all the 
port group from lowest load to highest. Pick the first one for the 
primary path, and try secondary path from the 2nd to the last.
Once again this seems to have no effect on regular topology but it made 
a real impact on our failover tests.


Feel free to comment this, and more specially if and how you would want 
them upstream.

Thanks in advance

Nicolas


From nicolas.morey-chaisemartin at ext.bull.net  Sun Feb  8 23:43:27 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 09 Feb 2009 08:43:27 +0100
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing	between
	non-CN nodes
In-Reply-To: <20090207202319.GE27757@sashak.voltaire.com>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
Message-ID: <498FDE9F.7080604@ext.bull.net>

Sasha Khapyorsky wrote:
> On 20:48 Sat 07 Feb     , Nicolas Morey-Chaisemartin wrote:
>   
>>> "IO" is specific for your setup. Could we find more generic name for such
>>> nodes?
>>>
>>>   
>>>       
>> Sure. Any ideas?
>>     
>
> No, I didn't think about it.
>
>   
>>
>> Okay, I'll fix the indentation and few coding style error. There is also
>> a bug in the current patch as the hops counter are not set to the right
>> value when creating route which had reverse hops number of reverse
>> hops*2 should be added.
>>
>> I'll have to rewrite the patches so they work with the current HEAD.
>> Specially with the option system changes it won't merge cleanly.
>>     
>
> Use 'git-rebase master <yourbranch>' - it does the job with only two
> trivial conflicts.
>
> Sasha
>
>
>   
Well I still need to rename the option, fix the hop counts but most of 
all it will conflict (and bug) with the fix I've just reposted about 
port incrementation (reverse_hop also has boolean value to return which 
it doesn't right now).
I have one working in Bull tree but there has been to many modifications 
in the code around to merge it cleanly.
I'll rewrite it cleanly as soon as I got some time.

Nicolas


From nicolas.morey-chaisemartin at ext.bull.net  Mon Feb  9 00:26:04 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 09 Feb 2009 09:26:04 +0100
Subject: [ofa-general] [PATCH] opensm/osm_console.c : Added getguid function
 to console to generate a list of guid matching one or more regexps
Message-ID: <498FE89C.2020304@ext.bull.net>

This add a getguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such
by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/opensm/osm_console.c |  131 +++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 131 insertions(+), 0 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 006049bce16cd282d40dc9598f4baaa2aa5b0fdf.diff
Type: text/x-patch
Size: 4324 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/ff1fa357/attachment.bin>

From vlad at lists.openfabrics.org  Mon Feb  9 03:16:53 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon,  9 Feb 2009 03:16:53 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090209-0200 daily build status
Message-ID: <20090209111653.A76E5E60F20@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From dorfman.eli at gmail.com  Mon Feb  9 05:47:54 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Mon, 09 Feb 2009 15:47:54 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <20090208213826.GA24254@sashak.voltaire.com>
References: <20090202205924.GF5910@sashak.voltaire.com>
	<49880E4D.2090107@gmail.com>
	<20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
	<694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
	<20090208213826.GA24254@sashak.voltaire.com>
Message-ID: <4990340A.10004@gmail.com>

Sasha Khapyorsky wrote:
> Hi Eli,
> 
> On 21:23 Sun 08 Feb     , Eli Dorfman wrote:
>> yes, but wouldn't it be better to separate between heavy sweep and
>> config rescan (due to SIGHUP).
> 
> SIGHUP main purpose always was to trigger heavy sweep.
> 
>> I think that user should know when configuration is updated and not
>> wait for heavy sweep.
> 
> I'm not following - SIGHUP will cause heavy sweep and config update,
> where is a waiting?
> 

i meant that if the user is changing config file and there is a heavy sweep then
config may be updated, while using specific flag for config rescan will avoid this case.

Eli


From sashak at voltaire.com  Mon Feb  9 06:17:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 16:17:32 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan
	subnet configuration after SIGHUP
In-Reply-To: <4990340A.10004@gmail.com>
References: <20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
	<694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
	<20090208213826.GA24254@sashak.voltaire.com>
	<4990340A.10004@gmail.com>
Message-ID: <20090209141732.GF26139@sashak.voltaire.com>

On 15:47 Mon 09 Feb     , Eli Dorfman (Voltaire) wrote:
> Sasha Khapyorsky wrote:
> > Hi Eli,
> > 
> > On 21:23 Sun 08 Feb     , Eli Dorfman wrote:
> >> yes, but wouldn't it be better to separate between heavy sweep and
> >> config rescan (due to SIGHUP).
> > 
> > SIGHUP main purpose always was to trigger heavy sweep.
> > 
> >> I think that user should know when configuration is updated and not
> >> wait for heavy sweep.
> > 
> > I'm not following - SIGHUP will cause heavy sweep and config update,
> > where is a waiting?
> > 
> 
> i meant that if the user is changing config file and there is a heavy sweep then
> config may be updated,

Are you about race between file reading (by OpenSM) and writing (by
user)? Using write lock on reading would solve an issue.

> while using specific flag for config rescan will avoid this case.

What do you mean by "specific flag"? Using separate signal? Assuming so,
this will not prevent read/write race.

Sasha


From sashak at voltaire.com  Mon Feb  9 06:19:15 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 16:19:15 +0200
Subject: [ofa-general] [PATCH] ibsim: fix port initial state
Message-ID: <20090209141915.GG26139@sashak.voltaire.com>


Port initial state was ACTIVE in PortInfo template for connected ports.
This prevented from OpenSM to make INIT -> ARMED -> ACTIVE PortInfo
transition typical for a real fabric.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 ibsim/sim_net.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index ee268e0..7a42cb6 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -80,7 +80,7 @@ static const uint8_t swport[] = {
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x03, 0x02,
-	0x14, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08,
+	0x12, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08,
 	0x08, 0x04, 0xFF, 0x10, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -102,7 +102,7 @@ static const uint8_t hcaport[] = {
 	0xFE, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x02, 0x00, 0x01, 0x00, 0x50, 0x02, 0x48,
 	0x00, 0x00, 0x0F, 0xF9, 0x01, 0x03, 0x03, 0x02,
-	0x14, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08,
+	0x12, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08,
 	0x08, 0x04, 0xFF, 0x10, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x20, 0x1F, 0x00, 0x00, 0x00, 0x00,
 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Mon Feb  9 06:41:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 16:41:35 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c: Fixed bug on
	index port incrementation
In-Reply-To: <498FD4CC.8070900@ext.bull.net>
References: <498FD4CC.8070900@ext.bull.net>
Message-ID: <20090209144135.GH26139@sashak.voltaire.com>

Hi Nicolas,

On 08:01 Mon 09 Feb     , Nicolas Morey Chaisemartin wrote:
> Here is an updated version of the patch including Yevgeni's feedback.

Could you provide more descriptive commit message? This text will be
stored in OpenSM change history and your current comment doesn't say a
lot.

If you need to place in patch message some text which should not enter
change log (such as details about differences against previous version
of the patch or any other) it should be placed after '---' below.

>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---

<some optional text unrelated to commit message>

Sasha

>  opensm/opensm/osm_ucast_ftree.c |   39 
> +++++++++++++++++++++++----------------
>  1 files changed, 23 insertions(+), 16 deletions(-)
>
>

> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index 68900d8..3ea61a1 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
>   *        assign-up-going-port-by-descending-down to r-port node (recursion)
>   */
>  
> -static void
> +static boolean_t
>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  					       IN ftree_sw_t * p_sw,
>  					       IN ftree_sw_t * p_prev_sw,
> @@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  	uint16_t i;
>  	uint16_t j;
>  	uint16_t k;
> +	boolean_t created_route = FALSE;
>  
>  	/* we shouldn't enter here if both real_lid and main_path are false */
>  	CL_ASSERT(is_real_lid || is_main_path);
>  
>  	/* if there is no down-going ports */
>  	if (p_sw->down_port_groups_num == 0)
> -		return;
> -
> -	/* promote the index that indicates which group should we
> -	   start with when going through all the downgoing groups */
> -	p_sw->down_port_groups_idx =
> -		(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +		return FALSE;;
>  
>  	/* foreach down-going port group (in indexing order) */
>  	i = p_sw->down_port_groups_idx;
> @@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  		p_group = p_sw->down_port_groups[i];
>  		i = (i + 1) % p_sw->down_port_groups_num;
>  
> -		/* Skip this port group unless it points to a switch */
> -		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> +		/* If this port group doesn't point to a switch, mark
> +		   that the route was created and skip to the next group */
> +		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH){
> +			created_route = TRUE;
>  			continue;
> +		}
>  
>  		if (p_prev_sw
>  		    && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
> @@ -2073,16 +2072,24 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  
>  		/* Recursion step:
>  		   Assign upgoing ports by stepping down, starting on REMOTE switch */
> -		__osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
> -							       NULL,	/* prev. position - NULL to mark that we went down and not up */
> -							       target_lid,	/* LID that we're routing to */
> -							       target_rank,	/* rank of the LID that we're routing to */
> -							       is_real_lid,	/* whether the target LID is real or dummy */
> -							       is_main_path,	/* whether this is path to HCA that should by tracked by counters */
> -							       highest_rank_in_route);	/* highest visited point in the tree before going down */
> +		created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
> +											 NULL,	/* prev. position - NULL to mark that we went down and not up */
> +											 target_lid,	/* LID that we're routing to */
> +											 target_rank,	/* rank of the LID that we're routing to */
> +											 is_real_lid,	/* whether the target LID is real or dummy */
> +											 is_main_path,	/* whether this is path to HCA that should by tracked by counters */
> +											 highest_rank_in_route);	/* highest visited point in the tree before going down */
>  	}
>  	/* done scanning all the down-going port groups */
>  
> +	/* if the route was created, promote the index that
> +	   indicates which group should we start with when
> +	   going through all the downgoing groups */
> +	if (created_route)
> +		p_sw->down_port_groups_idx =
> +			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +	
> +	return created_route; 
>  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */
>  
>  /***************************************************/
> 


From sashak at voltaire.com  Mon Feb  9 07:14:51 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 17:14:51 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid
	function to
	console to generate a list of guid matching one or more regexps
In-Reply-To: <498FE89C.2020304@ext.bull.net>
References: <498FE89C.2020304@ext.bull.net>
Message-ID: <20090209151451.GI26139@sashak.voltaire.com>

Hi Nicolas,

Some initial comments...

On 09:26 Mon 09 Feb     , Nicolas Morey Chaisemartin wrote:
> This add a getguid functionnality to openSM console which makes it really 
> easy to generate cn_guid_file, root_guid_file and such
> by dumping into a file all port guids whom nodedesc contains at least one 
> of the provided regexps

I see that this specific command is about port guids and not node guids.
What is about better name such "dump_portguids"? (Another possibility
would be implementation of single "dump" command with various parameters
such as "config", "portguids", "nodeguids", etc.).

>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  opensm/opensm/osm_console.c |  131 
> +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 131 insertions(+), 0 deletions(-)
>
>

> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> index c6e8e59..e4dc6e9 100644
> --- a/opensm/opensm/osm_console.c
> +++ b/opensm/opensm/osm_console.c
> @@ -42,6 +42,7 @@
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  #include <netdb.h>
> +#include <regex.h>
>  #ifdef ENABLE_OSM_CONSOLE_SOCKET
>  #include <arpa/inet.h>
>  #endif
> @@ -1172,6 +1173,135 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>  	fprintf(out, "%s build %s %s\n", p_osm->osm_version, __DATE__, __TIME__);
>  }
>  
> +typedef struct _regexp_list {
> +       regex_t exp;
> +       struct _regexp_list* next;
> +} regexp_list_t;
> +
> +
> +static void getguid_parse(char **p_last, osm_opensm_t *p_osm, FILE *out)
> +{
> +	cl_qmap_t *p_port_guid_tbl;
> +	osm_port_t* p_port;
> +	osm_port_t* p_next_port;
> +
> +	regexp_list_t* p_head_regexp=NULL;
> +	regexp_list_t* p_regexp;
> +	
> +	/* Option variables*/
> +	char* p_cmd=NULL;
> +	FILE* output=out;
> +	int exit_after_run=0;
> +	extern volatile unsigned int osm_exit_flag;
> +
> +	/* Read commande line */
> +
> +	while(1){

Try opensm/osm_indent (many places in the patch will be affected).

> +		p_cmd = next_token(p_last);
> +		if (p_cmd) {
> +			if (strcmp(p_cmd, "exit_after_run") == 0) {
> +				exit_after_run = 1;
> +			} else if (strcmp(p_cmd, "file") == 0) {
> +				p_cmd=next_token(p_last);
> +				if(p_cmd){
> +					output = fopen(p_cmd,"w+");
> +					if(output == NULL){
> +						fprintf(out,"Could not open file %s: %s\n",p_cmd,strerror(errno));
> +						output = out;
> +					}
> +				} else {
> +					/* No file name passed */
> +					fprintf(out,"No file name passed\n");
> +				}
> +			} else {
> +				p_regexp = malloc(sizeof(*p_regexp));
> +				if(regcomp(&(p_regexp->exp),p_cmd,REG_NOSUB|REG_EXTENDED)!=0){
> +					fprintf(out,"Couldn't parse regular expression %s. Skipping it.\n",p_cmd);
> +				}
> +				p_regexp->next = p_head_regexp;
> +				p_head_regexp = p_regexp;
> +			}
> +		} else {
> +			/* No more tokens */
> +			break;
> +		}

Here and in other places - no need braces about single operation.

> +	}
> +
> +	/* Check we have at least one expression to match */
> +	if(p_head_regexp == NULL){
> +		fprintf(out,"No valid expression provided. Aborting\n");
> +		return;
> +	}
> +
> +	/* Ensure this SM is master (so we have the LFT) */
> +
> + getguid_wait_init:
> +	if(osm_exit_flag)
> +		return;
> +	cl_spinlock_acquire(&p_osm->sm.state_lock);
> +	/* If the subnet struct is not properly initialized, we exit */
> +	if(p_osm->sm.p_subn == NULL){
> +	  cl_spinlock_release(&p_osm->sm.state_lock);
> +	  sleep(1);
> +	  goto getguid_wait_init;
> +	}

The console is initialized after osm_subnet. When will the case
(p_osm->sm.p_subn == NULL) be valid?

> +	if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){
> +	  cl_spinlock_release(&p_osm->sm.state_lock);
> +	  sleep(1);
> +	  goto getguid_wait_init;
> +	}

This will cause to endless loop when OpenSM is in Standby or Inactive
states.

> +	cl_spinlock_release(&p_osm->sm.state_lock);
> +	if(p_osm->sm.p_subn->need_update != 0){
> +	  sleep(1);
> +	  goto getguid_wait_init;
> +	}

Subnet discovery/setup could take some time. An user may want to use
console for other things in this time. I don't think that sleeping is
suitable here, better to print "try later" message or like this.

> +
> +	/* Subnet doesn't need to be updated so we can carry on */
> +
> +
> +	CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock);
> +	p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl);
> +
> +
> +

No need more than one empty line as separator (osm_indent... :)).

> +	p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl);
> +	while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) {
> +
> +		p_port = p_next_port;
> +		p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item);
> +
> +		for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){
> +			if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){
> +				fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid));
> +			}
> +		}
> +	}
> +	
> +CL_PLOCK_RELEASE(p_osm->sm.p_lock);
> +	if(output != out)
> +		fclose(output);
> +	if(exit_after_run)
> +		osm_exit_flag = 1;

Why this 'exit_after_run'?

If you need functionality to exit OpenSM triggered from console (but it
is not clear for me why) use another command.

> +
> +}
> +
> +
> +
> +

No need more than one empty line as separator (osm_indent... :)).

> +static void help_getguid(FILE * out, int detail)
> +{
> +	fprintf(out, "getguid [exit_after_run|file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n");
> +	if (detail) {
> +		fprintf(out,
> +			"getguid -- Dump all the port GUID whom node_desc matches one of the provided regexp\n");
> +		fprintf(out,
> +			"   [file filename] -- Send the port GUID list to the specified file instead of regular output\n");
> +		fprintf(out,
> +			 "   [exit_after_run] -- Quit OpenSM once the port GUID have been displayed\n");
> +	}
> +
> +}
> +
>  /* more parse routines go here */
>  
>  static const struct command console_cmds[] = {
> @@ -1192,6 +1322,7 @@ static const struct command console_cmds[] = {
>  #ifdef ENABLE_OSM_PERF_MGR
>  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
>  #endif				/* ENABLE_OSM_PERF_MGR */
> +	{"getguid", &help_getguid, &getguid_parse},
>  	{NULL, NULL, NULL}	/* end of array */
>  };
>  
> 


From nicolas.morey-chaisemartin at ext.bull.net  Mon Feb  9 07:55:46 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 09 Feb 2009 16:55:46 +0100
Subject: [ofa-general] [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on
	index port incrementation
Message-ID: <49905202.3050406@ext.bull.net>

This patch fixes a bug in index port incrementation in the fat-tree algorithm.
Problem happens (at least) with a 4 level Fat tree as below:


                          L3  L3
        ___________________|__|____________________
       /          /               \               \                <= All the L2 are connected on 2 L3 switches
    L2-1         L2-2            L2-1           L2-2
   /             /                 \              \                 <== The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
   L1           L1                 L1             L1
   /|\         /|\                 /|\           /|\
  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have multiple set. In each set, all L0 lead to all L1 of their set.

    L0           L0                 L0           L0
  /   \        /    \             /    \       /     \
CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN


To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 switches.
Plus a common top of L2 and L3 switches.

In each set, there are groups of compute nodes. Each group is connected to a single L0 switch.
In a given set, all L0 are connected to all L1.

The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
All the L2 are connected to a couple of L3.


If we dont put the L3. We have a perfectly balanced fat tree and well equilibrated routes.
But when we add the L3, it introduce a huge difference. As it is not necessary, no route is going through L3 (which is fine).
However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is twice overused (compared to the balanced state).

This comes from the down_port_groups_idx which is incremented each time the algorithm goes down through a node whether it creates routes to HCA (port != switch)
or not. As route coming up from a L1 reaches only one L2, the algorithm goes through all the other L2 while going down, incrementing their index.
Our case here is a bit specific but in a case where your L1 doesn't have full connectivity to all your L2, and another switch rank above, the problem may appear.

To avoid this problem,  __osm_ftree_fabric_route_upgoing_by_going_down function has been changed so it returns a value to indicate if routes to HCA (in fact to leaf switch) were created.
With this information, we only increase the index when the algorithm has created routes to HCA.
After applying this patch and measuring the link usage, we are perfectly balanced  (L2<->L3 links are still not used but that is to be expected).

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
   opensm/opensm/osm_ucast_ftree.c |   39 +++++++++++++++++++++++----------------
   1 files changed, 23 insertions(+), 16 deletions(-)


Repost of the patch with Yevgeni's comment and a more complete description :)
Hope it's good this time.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2f1d358f2bdf67838fe8776438b7757d9dcd6e15.diff
Type: text/x-patch
Size: 3806 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/276fd8a1/attachment.bin>

From devel at morey-chaisemartin.com  Mon Feb  9 08:04:58 2009
From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Mon, 09 Feb 2009 17:04:58 +0100
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid
	function to	console to generate a list of guid matching one or more
	regexps
In-Reply-To: <20090209151451.GI26139@sashak.voltaire.com>
References: <498FE89C.2020304@ext.bull.net>
	<20090209151451.GI26139@sashak.voltaire.com>
Message-ID: <4990542A.5040907@morey-chaisemartin.com>

Sasha Khapyorsky a écrit :
> Hi Nicolas,
>
> Some initial comments...
>
> On 09:26 Mon 09 Feb     , Nicolas Morey Chaisemartin wrote:
>   
>> This add a getguid functionnality to openSM console which makes it really 
>> easy to generate cn_guid_file, root_guid_file and such
>> by dumping into a file all port guids whom nodedesc contains at least one 
>> of the provided regexps
>>     
>
> I see that this specific command is about port guids and not node guids.
> What is about better name such "dump_portguids"? (Another possibility
> would be implementation of single "dump" command with various parameters
> such as "config", "portguids", "nodeguids", etc.).
>
>   
Dumping port guid is specially useful to generate config files.  I've
never had the need to dump nodeguid. If people need it, why not make a
global dump.
If not, it may be simpler to rename to dump_portguids
>
> Try opensm/osm_indent (many places in the patch will be affected).
>
>   
Last time I tried osm_indent, it introduced a real lot of changes to the
code (even the one I didn't edited) so I haven't used it on my patches.
I'll fix the indentation.
>> +	/* Ensure this SM is master (so we have the LFT) */
>> +
>> + getguid_wait_init:
>> +	if(osm_exit_flag)
>> +		return;
>> +	cl_spinlock_acquire(&p_osm->sm.state_lock);
>> +	/* If the subnet struct is not properly initialized, we exit */
>> +	if(p_osm->sm.p_subn == NULL){
>> +	  cl_spinlock_release(&p_osm->sm.state_lock);
>> +	  sleep(1);
>> +	  goto getguid_wait_init;
>> +	}
>>     
>
> The console is initialized after osm_subnet. When will the case
> (p_osm->sm.p_subn == NULL) be valid?
>
>   
I didn't knew that, I was just checking my pointers to be sure.
>> +	if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){
>> +	  cl_spinlock_release(&p_osm->sm.state_lock);
>> +	  sleep(1);
>> +	  goto getguid_wait_init;
>> +	}
>>     
>
> This will cause to endless loop when OpenSM is in Standby or Inactive
> states.
>
>   
This is some code I used for another function that looks at LFT table.
In the other case, I need the SM to be master.
I'll change it.
>> +	cl_spinlock_release(&p_osm->sm.state_lock);
>> +	if(p_osm->sm.p_subn->need_update != 0){
>> +	  sleep(1);
>> +	  goto getguid_wait_init;
>> +	}
>>     
>
> Subnet discovery/setup could take some time. An user may want to use
> console for other things in this time. I don't think that sleeping is
> suitable here, better to print "try later" message or like this.
>
>   
See comment below
>
>> +	p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl);
>> +	while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) {
>> +
>> +		p_port = p_next_port;
>> +		p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item);
>> +
>> +		for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){
>> +			if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){
>> +				fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid));
>> +			}
>> +		}
>> +	}
>> +	
>> +CL_PLOCK_RELEASE(p_osm->sm.p_lock);
>> +	if(output != out)
>> +		fclose(output);
>> +	if(exit_after_run)
>> +		osm_exit_flag = 1;
>>     
>
> Why this 'exit_after_run'?
>
> If you need functionality to exit OpenSM triggered from console (but it
> is not clear for me why) use another command.
>
>   

For the last 2 comments, the purpose is to be able to easily script the
configuration file generation. We have netlist generation here and it's
much easier to be able to just do
echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" |
opensm ...


Nicolas


From yosefe at Voltaire.COM  Mon Feb  9 08:49:07 2009
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Mon, 09 Feb 2009 18:49:07 +0200
Subject: [ofa-general] RE: impossibility to bind a device/port with the
 rdma-cm when the port is down
In-Reply-To: <DC4530D43E764B5A90F1D74B196FC595@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902031108310.4470@zuben.voltaire.com>	<F91D1E3103634CD3A1510EC1B00CC754@amr.corp.intel.com>	<49893FAF.3090007@voltaire.com>	<7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com>	<4989E6D6.5030109@Voltaire.COM>
	<DC4530D43E764B5A90F1D74B196FC595@amr.corp.intel.com>
Message-ID: <49905E83.3020508@Voltaire.COM>

  When doing rdma_resolve_addr() and relevant port is down, the function fails
and rdma_cm id is not bound to the device. Therefore, application does not have
device handle and cannot wait for the port to become active. The function
fails because ipoib is not joined to the multicast group and therefore sa does 
not have a multicast record to take a qkey from.
  The proposed patch is to make lazy qkey resolution - cma_set_qkey will set 
id_priv->qkey if it was not set, and will be called just before the qkey is
really required.

Signed-off-by: Yossi Etigin <yosefe at voltaire.com>
Acked-by: Sean Hefty <sean.hefty at intel.com>

---
Fix checkpatch.pl error.

 drivers/infiniband/core/cma.c |   41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

Index: kernel-ib/drivers/infiniband/core/cma.c
===================================================================
--- kernel-ib.orig/drivers/infiniband/core/cma.c	2009-02-04 20:40:20.000000000 +0200
+++ kernel-ib/drivers/infiniband/core/cma.c	2009-02-09 18:45:13.000000000 +0200
@@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r
 	id_priv->cma_dev = NULL;
 }
 
-static int cma_set_qkey(struct ib_device *device, u8 port_num,
-			enum rdma_port_space ps,
-			struct rdma_dev_addr *dev_addr, u32 *qkey)
+static int cma_set_qkey(struct rdma_id_private *id_priv)
 {
 	struct ib_sa_mcmember_rec rec;
 	int ret = 0;
 
-	switch (ps) {
+	if (id_priv->qkey)
+		return;
+
+	switch (id_priv->id.ps) {
 	case RDMA_PS_UDP:
-		*qkey = RDMA_UDP_QKEY;
+		id_priv->qkey = RDMA_UDP_QKEY;
 		break;
 	case RDMA_PS_IPOIB:
-		ib_addr_get_mgid(dev_addr, &rec.mgid);
-		ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec);
-		*qkey = be32_to_cpu(rec.qkey);
+		ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid);
+		ret = ib_sa_get_mcmember_rec(id_priv->id.device,
+					     id_priv->id.port_num, &rec.mgid,
+					     &rec);
+		if (!ret)
+			id_priv->qkey = be32_to_cpu(rec.qkey);
 		break;
 	default:
 		break;
@@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i
 		ret = ib_find_cached_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
-			ret = cma_set_qkey(cma_dev->device,
-					   id_priv->id.port_num,
-					   id_priv->id.ps, dev_addr,
-					   &id_priv->qkey);
-			if (!ret)
-				cma_attach_to_dev(id_priv, cma_dev);
+			cma_attach_to_dev(id_priv, cma_dev);
 			break;
 		}
 	}
@@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd
 	*qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT;
 
 	if (cma_is_ud_ps(id_priv->id.ps)) {
+		ret = cma_set_qkey(id_priv);
+		if (ret)
+			return ret;
+
 		qp_attr->qkey = id_priv->qkey;
 		*qp_attr_mask |= IB_QP_QKEY;
 	} else {
@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i
 			event.status = ib_event->param.sidr_rep_rcvd.status;
 			break;
 		}
+		ret = cma_set_qkey(id_priv);
+		if (ret) {
+			event.event = RDMA_CM_EVENT_ADDR_ERROR;
+			event.status = -EINVAL;
+			break;
+		}
 		if (id_priv->qkey != rep->qkey) {
 			event.event = RDMA_CM_EVENT_UNREACHABLE;
 			event.status = -EINVAL;
@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma
 			     const void *private_data, int private_data_len)
 {
 	struct ib_cm_sidr_rep_param rep;
+	int ret;
 
 	memset(&rep, 0, sizeof rep);
 	rep.status = status;
 	if (status == IB_SIDR_SUCCESS) {
+		ret = cma_set_qkey(id_priv);
+		if (ret)
+			return ret;
 		rep.qp_num = id_priv->qp_num;
 		rep.qkey = id_priv->qkey;
 	}

-- 
--Yossi


From bboas at systemfabricworks.com  Mon Feb  9 08:50:31 2009
From: bboas at systemfabricworks.com (Bill Boas)
Date: Mon, 9 Feb 2009 08:50:31 -0800
Subject: [ofa-general] RE: [ewg] OFED (EWG) meeting agenda for tomorrow (Feb
	09)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com>
Message-ID: <C8BF0D1945D84AA0BB272A1673B99E05@BillGWAYLAPTOP>

Tziporet, EWG members and OFA general list readers

 
Attached is the draft agenda as of Friday morning last week, a few changes
since then.

 
Also attached is HPC wire's re-print of the press release.

 
These are sent out as background updates for the RWG call today and to
provide information for those considering attending the Sonoma Workshop.

 
The MWG of OFA, chaired by Wayne Augsburger, welcomes your feedback, input
and comments - and your presence in Sonoma Mar 22-25

 
I'll be on the call today in 10 mins.

 
Bill.

 
Bill Boas

Executive Director and Vice Chair OFA

VP, Business  Development

System Fabric Works

510-375-8840

bboas at systemfabricworks.com

www.systemfabricworks.com

 
  _____  

From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren
Sent: Sunday, February 08, 2009 7:58 AM
To: Tziporet Koren; ewg at lists.openfabrics.org
Cc: general at lists.openfabrics.org
Subject: [ewg] OFED (EWG) meeting agenda for tomorrow (Feb 09)

 
These are the agenda items for the meeting tomorrow:

1. OFED 1.4.1 release status:

*	New OSes: 

*	RH 5.3 - done, we still have an issue with Itanium

*	SLES 11 - schedule is OK. RC3 already available - Any volunteers to
prepare the backports?

*	Open MPI 1.3 - I heard there are some critical bugs. What is the
status of 1.3.1? - Jeff S.

*	RDS with iWARP support - Steve

*	NFS/RDMA backports - Steve

*	Critical bug fixes
As far as I know these are the critical bugs that should be fixed:

1383            blo     P3      jackm at mellanox.co.il    Local protection
error on transmit from ipoib datagram to...

1471    cri     P3      amirv at mellanox.co.il    Performance degradation in
ofed 1.4

Please send more bugs that are critical for the release

 
2. Decide on 1.4.1 schedule:

Proposal:

*	RC1 - Mar 3

*	RC2 - Mar 17

*	RC3 - Mar 31

*	GA  - Apr 7

 
3. Sonoma updates (if any) - Bill Boas

Tziporet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/7c648875/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Draft Sonoma 2009  agenda for Feb 6 MWG review.xls
Type: application/vnd.ms-excel
Size: 101888 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/7c648875/attachment.xls>
-------------- next part --------------
An embedded message was scrubbed...
From: <Saved by Windows Internet Explorer 7>
Subject: HPCwire: OFA to Host 5th Annual International Sonoma Workshop
Date: Mon, 9 Feb 2009 08:44:40 -0800
Size: 871805
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/7c648875/attachment.mht>

From randy.dunlap at oracle.com  Mon Feb  9 08:53:39 2009
From: randy.dunlap at oracle.com (Randy Dunlap)
Date: Mon, 09 Feb 2009 08:53:39 -0800
Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband)
In-Reply-To: <20090209193908.1a448944.sfr@canb.auug.org.au>
References: <20090209193908.1a448944.sfr@canb.auug.org.au>
Message-ID: <49905F93.300@oracle.com>

Stephen Rothwell wrote:
> Hi all,
> 
> [I accidentally deleted the merge and quilt-import logs today :-( - I
> wonder if any would have noticed :-).  The merge summary still appears
> below.]
> 
> Changes since 20090206:


allyesconfig build on i386 fails with:

drivers/built-in.o: In function `iwch_sgl2pbl_map':
/usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237: undefined reference to `__umoddi3'
make: *** [.tmp_vmlinux1] Error 1


or allmodconfig on i386 fails with:

ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined!

-- 
~Randy


From swise at opengridcomputing.com  Mon Feb  9 09:00:08 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 09 Feb 2009 11:00:08 -0600
Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband)
In-Reply-To: <49905F93.300@oracle.com>
References: <20090209193908.1a448944.sfr@canb.auug.org.au>
	<49905F93.300@oracle.com>
Message-ID: <49906118.3060801@opengridcomputing.com>

Randy Dunlap wrote:
> Stephen Rothwell wrote:
>   
>> Hi all,
>>
>> [I accidentally deleted the merge and quilt-import logs today :-( - I
>> wonder if any would have noticed :-).  The merge summary still appears
>> below.]
>>
>> Changes since 20090206:
>>     
>
>
> allyesconfig build on i386 fails with:
>
> drivers/built-in.o: In function `iwch_sgl2pbl_map':
> /usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237: undefined reference to `__umoddi3'
> make: *** [.tmp_vmlinux1] Error 1
>
>
> or allmodconfig on i386 fails with:
>
> ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined!
>
>   

Somehow changing offset to a u64 must have caused this.  What is 
__umoddi3?  (it can't be good) :)

Steve


From randy.dunlap at oracle.com  Mon Feb  9 09:01:11 2009
From: randy.dunlap at oracle.com (Randy Dunlap)
Date: Mon, 09 Feb 2009 09:01:11 -0800
Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband)
In-Reply-To: <49906118.3060801@opengridcomputing.com>
References: <20090209193908.1a448944.sfr@canb.auug.org.au>
	<49905F93.300@oracle.com> <49906118.3060801@opengridcomputing.com>
Message-ID: <49906157.9090707@oracle.com>

Steve Wise wrote:
> Randy Dunlap wrote:
>> Stephen Rothwell wrote:
>>  
>>> Hi all,
>>>
>>> [I accidentally deleted the merge and quilt-import logs today :-( - I
>>> wonder if any would have noticed :-).  The merge summary still appears
>>> below.]
>>>
>>> Changes since 20090206:
>>>     
>>
>>
>> allyesconfig build on i386 fails with:
>>
>> drivers/built-in.o: In function `iwch_sgl2pbl_map':
>> /usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237:
>> undefined reference to `__umoddi3'
>> make: *** [.tmp_vmlinux1] Error 1
>>
>>
>> or allmodconfig on i386 fails with:
>>
>> ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined!
>>
>>   
> 
> Somehow changing offset to a u64 must have caused this.  What is
> __umoddi3?  (it can't be good) :)

It's some kind of mod operation, like 64-bit % 32-bit or
64-bit % 64-bit.  Should be in a fairly recent change.


-- 
~Randy


From weiny2 at llnl.gov  Mon Feb  9 09:04:01 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 9 Feb 2009 09:04:01 -0800
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
Message-ID: <20090209090401.3eac78a5.weiny2@llnl.gov>

On Fri, 6 Feb 2009 14:47:17 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Fri, Feb 6, 2009 at 2:12 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> > Hi,
> >
> > I'm looking at adding pkey support into the OpenSM vendor layer. The
> > pkey table is a per port structure and is part of ib_port_attr_t. That
> > structure also include num_pkeys. There is only related API:
> > osm_vendor_get_all_port_attr which takes several pointers, the second
> > one is a pointer to a preallocated array of port attributes (memory
> > allocation for that is done by the client). ib_port_attr_t includes a
> > pointer to the pkey table. So the only way this can work is if that
> > allocation is also done by the client which makes that a valid
> > parameter on input (as well as output). Similarly for num_pkeys so the
> > vendor layer doesn't go past the end of the supplied table. So both
> > num_pkeys and p_pkey_table in that struct need to be in/out
> > parameters. num_pkeys could always be returned as the total number of
> > pkeys for the port when num_pkeys is set to 0 on input.
> >
> > Similar thing is true for gid table in ib_port_attr_t.
> >
> > I'm also not sure which vendor layers are important. I don't see how
> > to fix them all (e.g. osm_vendor_al.c is one, there are some others)
> > as some of them appear to do a straight memory to memory copy of the
> > ib_port_attr_t structure (others are OK and fixable).
> >
> > The only other alternative I see is to change this API and possibly
> > this structure which is way more disruptive and risky (especially with
> > the inability to test anything but one of the vendor layers).
> 
> Actually, although more disruptive, it might be cleaner (and safer in
> the long run) to add to the vendor API. There could be additional osm
> vendor APIs for pkeys and gids and these could return some suitable
> IB_ error from ib_types in vendor layers where they are unimplemented.
> IB_UNSUPPORTED looks good to me. I'm likely to head down this approach
> unless I hear otherwise.

This sounds more reasonable to me, better to suffer now than later...

Ira

> 
> -- Hal
> 
> > Thoughts ?
> >
> > -- Hal
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Mon Feb  9 09:16:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 19:16:08 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid
	function to console to generate a list of guid matching one or more
	regexps
In-Reply-To: <4990542A.5040907@morey-chaisemartin.com>
References: <498FE89C.2020304@ext.bull.net>
	<20090209151451.GI26139@sashak.voltaire.com>
	<4990542A.5040907@morey-chaisemartin.com>
Message-ID: <20090209171608.GJ26139@sashak.voltaire.com>

On 17:04 Mon 09 Feb     , Nicolas Morey-Chaisemartin wrote:
> >   
> Dumping port guid is specially useful to generate config files.  I've
> never had the need to dump nodeguid. If people need it, why not make a
> global dump.
> If not, it may be simpler to rename to dump_portguids

Sure, we can start this way.

> >
> > Try opensm/osm_indent (many places in the patch will be affected).
> >
> >   
> Last time I tried osm_indent, it introduced a real lot of changes to the
> code (even the one I didn't edited) so I haven't used it on my patches.

You can extract related changes by editing diff file. In any case
osm_indent will let you idea about how code should be formatted.

> I'll fix the indentation.
> >> +	/* Ensure this SM is master (so we have the LFT) */
> >> +
> >> + getguid_wait_init:
> >> +	if(osm_exit_flag)
> >> +		return;
> >> +	cl_spinlock_acquire(&p_osm->sm.state_lock);
> >> +	/* If the subnet struct is not properly initialized, we exit */
> >> +	if(p_osm->sm.p_subn == NULL){
> >> +	  cl_spinlock_release(&p_osm->sm.state_lock);
> >> +	  sleep(1);
> >> +	  goto getguid_wait_init;
> >> +	}
> >>     
> >
> > The console is initialized after osm_subnet. When will the case
> > (p_osm->sm.p_subn == NULL) be valid?
> >
> >   
> I didn't knew that, I was just checking my pointers to be sure.
> >> +	if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){
> >> +	  cl_spinlock_release(&p_osm->sm.state_lock);
> >> +	  sleep(1);
> >> +	  goto getguid_wait_init;
> >> +	}
> >>     
> >
> > This will cause to endless loop when OpenSM is in Standby or Inactive
> > states.
> >
> >   
> This is some code I used for another function that looks at LFT table.

It is not in a main stream, right?

> In the other case, I need the SM to be master.
> I'll change it.
> >> +	cl_spinlock_release(&p_osm->sm.state_lock);
> >> +	if(p_osm->sm.p_subn->need_update != 0){
> >> +	  sleep(1);
> >> +	  goto getguid_wait_init;
> >> +	}
> >>     
> >
> > Subnet discovery/setup could take some time. An user may want to use
> > console for other things in this time. I don't think that sleeping is
> > suitable here, better to print "try later" message or like this.
> >
> >   
> See comment below
> >
> >> +	p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl);
> >> +	while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) {
> >> +
> >> +		p_port = p_next_port;
> >> +		p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item);
> >> +
> >> +		for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){
> >> +			if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){
> >> +				fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid));
> >> +			}
> >> +		}
> >> +	}
> >> +	
> >> +CL_PLOCK_RELEASE(p_osm->sm.p_lock);
> >> +	if(output != out)
> >> +		fclose(output);
> >> +	if(exit_after_run)
> >> +		osm_exit_flag = 1;
> >>     
> >
> > Why this 'exit_after_run'?
> >
> > If you need functionality to exit OpenSM triggered from console (but it
> > is not clear for me why) use another command.
> >
> >   
> 
> For the last 2 comments, the purpose is to be able to easily script the
> configuration file generation. We have netlist generation here and it's
> much easier to be able to just do
> echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" |
> opensm ...

Hmm, OpenSM main purpose is much different than just fabric statistics
dumps generation :). If the only thing you need is port guids list you
can parse 'ibnetdiscover' output - it will be much faster and not
destructive (you even can find some trivial script in ibsim tree -
'tests/get_all_ca_port_guids.sh').

And in any case "two command approach" can work via pipe too:

( echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" ; \
  echo "exit_opensm" ) | opensm ...

Sasha


From kliteyn at dev.mellanox.co.il  Mon Feb  9 10:36:01 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Feb 2009 20:36:01 +0200
Subject: [ofa-general] Re: [PATCH] opensm/qos_config: no invalid option
 message on default values
In-Reply-To: <20090208225412.GA24514@sashak.voltaire.com>
References: <20090208225412.GA24514@sashak.voltaire.com>
Message-ID: <49907791.7050905@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Don't comply about invalid QoS options when its default values are used.

Looks good. This also fixes bug #1451:
https://bugs.openfabrics.org/show_bug.cgi?id=1451

-- Yevgeny

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_subnet.c |   18 +++++++++---------
>  1 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 3324af9..69937c1 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -911,9 +911,11 @@ static ib_api_status_t osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn
>   **********************************************************************/
>  static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt)
>  {
> -	if (!(*max_vls) || *max_vls > 15) {
> -		log_report(" Invalid Cached Option: %s_max_vls=%u: "
> -			   "Using Default = %u\n", prefix, *max_vls, dflt);
> +	if (!*max_vls || *max_vls > 15) {
> +		if (*max_vls)
> +			log_report(" Invalid Cached Option: %s_max_vls=%u: "
> +				   "Using Default = %u\n",
> +				   prefix, *max_vls, dflt);
>  		*max_vls = dflt;
>  	}
>  }
> @@ -921,8 +923,10 @@ static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned
>  static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt)
>  {
>  	if (*high_limit < 0 || *high_limit > 255) {
> -		log_report(" Invalid Cached Option: %s_high_limit=%d: "
> -			   "Using Default: %d\n", prefix, *high_limit, dflt);
> +		if (*high_limit > 255)
> +			log_report(" Invalid Cached Option: %s_high_limit=%d: "
> +				   "Using Default: %d\n",
> +				   prefix, *high_limit, dflt);
>  		*high_limit = dflt;
>  	}
>  }
> @@ -934,8 +938,6 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix,
>  	int count = 0;
>  
>  	if (*vlarb == NULL) {
> -		log_report(" Invalid Cached Option: %s_vlarb_%s: "
> -		"Using Default\n", prefix, suffix);
>  		*vlarb = strdup(dflt);
>  		return;
>  	}
> @@ -1003,8 +1005,6 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt)
>  	int count = 0;
>  
>  	if (*sl2vl == NULL) {
> -		log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n",
> -			   prefix);
>  		*sl2vl = strdup(dflt);
>  		return;
>  	}


From kliteyn at dev.mellanox.co.il  Mon Feb  9 10:43:42 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Feb 2009 20:43:42 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ftree: cleanup
	ftree_sw_tbl_element_t use
In-Reply-To: <20090208230406.GC24514@sashak.voltaire.com>
References: <20090208230406.GC24514@sashak.voltaire.com>
Message-ID: <4990795E.3060504@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> cl_list() allocates memory needed for storing an object in the list -
> no need additional wrappers like ftree_sw_tbl_element_t.

Looks good, thanks.

-- Yevgeny

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_ucast_ftree.c |   17 ++++-------------
>  1 files changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index 68900d8..10096c7 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -1418,7 +1418,6 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
>  	ftree_tuple_t new_tuple;
>  	uint32_t i;
>  	cl_list_t bfs_list;
> -	ftree_sw_tbl_element_t *p_sw_tbl_element;
>  
>  	OSM_LOG_ENTER(&p_ftree->p_osm->log);
>  
> @@ -1465,14 +1464,10 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
>  	 */
>  
>  	cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
> -	cl_list_insert_tail(&bfs_list,
> -			    &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
> +	cl_list_insert_tail(&bfs_list, p_sw);
>  
>  	while (!cl_is_list_empty(&bfs_list)) {
> -		p_sw_tbl_element =
> -		    (ftree_sw_tbl_element_t *) cl_list_remove_head(&bfs_list);
> -		p_sw = p_sw_tbl_element->p_sw;
> -		__osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
> +		p_sw = (ftree_sw_t *) cl_list_remove_head(&bfs_list);
>  
>  		/* Discover all the nodes from ports that are pointing down */
>  
> @@ -1509,9 +1504,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
>  								new_tuple);
>  
>  				/* add the newly discovered switch to the BFS queue */
> -				cl_list_insert_tail(&bfs_list,
> -						    &__osm_ftree_sw_tbl_element_create
> -						    (p_remote_sw)->map_item);
> +				cl_list_insert_tail(&bfs_list, p_sw);
>  			}
>  			/* Done assigning indexes to all the remote switches
>  			   that are pointed by the downgoing ports.
> @@ -1547,9 +1540,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree)
>  								p_remote_sw,
>  								new_tuple);
>  				/* add the newly discovered switch to the BFS queue */
> -				cl_list_insert_tail(&bfs_list,
> -						    &__osm_ftree_sw_tbl_element_create
> -						    (p_remote_sw)->map_item);
> +				cl_list_insert_tail(&bfs_list, p_sw);
>  			}
>  			/* Done assigning indexes to all the remote switches
>  			   that are pointed by the upgoing ports.


From sashak at voltaire.com  Mon Feb  9 11:23:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 21:23:26 +0200
Subject: [ofa-general] Re: [RFC] OpenSM vendor layer
In-Reply-To: <20090209090401.3eac78a5.weiny2@llnl.gov>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<f0e08f230902061147q17d1f74ev32a4ec221c5e3e5c@mail.gmail.com>
	<20090209090401.3eac78a5.weiny2@llnl.gov>
Message-ID: <20090209192326.GK26139@sashak.voltaire.com>

On 09:04 Mon 09 Feb     , Ira Weiny wrote:
> > 
> > Actually, although more disruptive, it might be cleaner (and safer in
> > the long run) to add to the vendor API. There could be additional osm
> > vendor APIs for pkeys and gids and these could return some suitable
> > IB_ error from ib_types in vendor layers where they are unimplemented.
> > IB_UNSUPPORTED looks good to me. I'm likely to head down this approach
> > unless I hear otherwise.
> 
> This sounds more reasonable to me, better to suffer now than later...

I don't see how it is "safer" in the long run than just extending.

Adding new APIs now will require adding this to another vendor
implementations as well (without actual possibility to test :( ).
Extending osm_vendor_get_all_port_attr() only requires fixing port_array
initializations (I guess it is 3-5 places in total in opensm and ibutils
trees) and with other vendor implementation will work automatically as
"unsupported" - no pkey table will be returned.

I'm not yet saying that following this approach we are opening way for
adding various new "doesn't make sense" API call for each port/whatever
attribute.... :)

Sasha


From dotanba at gmail.com  Mon Feb  9 11:25:32 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 09 Feb 2009 21:25:32 +0200
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** non zero lkey in send(),
	write() with num_sge > 1?
In-Reply-To: <661509.82751.qm@web111205.mail.gq1.yahoo.com>
References: <661509.82751.qm@web111205.mail.gq1.yahoo.com>
Message-ID: <4990832C.5090204@gmail.com>

Ofed User wrote:
> Hi,
>
> Can stack pass num_sge > 1, and lkey !=0 as part of sg_list[] elements, in post_send() call?
>   
What are you trying to achieve?

If num_sge > 1 => the HCA will try to read the blocks pointed by the 
sg_list one by one and validate that the address + size is inside a valid
Memory Region which its local key is the lkey.

Then i guess that the answer is: Yes.

Dotan


From sashak at voltaire.com  Mon Feb  9 11:44:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 9 Feb 2009 21:44:19 +0200
Subject: [ofa-general] Re: [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on
	index port incrementation
In-Reply-To: <49905202.3050406@ext.bull.net>
References: <49905202.3050406@ext.bull.net>
Message-ID: <20090209194419.GL26139@sashak.voltaire.com>

On 16:55 Mon 09 Feb     , Nicolas Morey Chaisemartin wrote:
> This patch fixes a bug in index port incrementation in the fat-tree 
> algorithm.
> Problem happens (at least) with a 4 level Fat tree as below:
>
>
>                          L3  L3
>        ___________________|__|____________________
>       /          /               \               \                <= All 
> the L2 are connected on 2 L3 switches
>    L2-1         L2-2            L2-1           L2-2
>   /             /                 \              \                 <== The 
> Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>   L1           L1                 L1             L1
>   /|\         /|\                 /|\           /|\
>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
> multiple set. In each set, all L0 lead to all L1 of their set.
>
>    L0           L0                 L0           L0
>  /   \        /    \             /    \       /     \
> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>
>
> To detail:
> We have a bunch of sets. Each set contains compute node, L0 and L1 
> switches.
> Plus a common top of L2 and L3 switches.
>
> In each set, there are groups of compute nodes. Each group is connected to 
> a single L0 switch.
> In a given set, all L0 are connected to all L1.
>
> The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
> through a L2, the Nth L1 can only see the Nth L1 of the other sets)
> All the L2 are connected to a couple of L3.
>
>
> If we dont put the L3. We have a perfectly balanced fat tree and well 
> equilibrated routes.
> But when we add the L3, it introduce a huge difference. As it is not 
> necessary, no route is going through L3 (which is fine).
> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 
> is twice overused (compared to the balanced state).
>
> This comes from the down_port_groups_idx which is incremented each time the 
> algorithm goes down through a node whether it creates routes to HCA (port 
> != switch)
> or not. As route coming up from a L1 reaches only one L2, the algorithm 
> goes through all the other L2 while going down, incrementing their index.
> Our case here is a bit specific but in a case where your L1 doesn't have 
> full connectivity to all your L2, and another switch rank above, the 
> problem may appear.
>
> To avoid this problem,  __osm_ftree_fabric_route_upgoing_by_going_down 
> function has been changed so it returns a value to indicate if routes to 
> HCA (in fact to leaf switch) were created.
> With this information, we only increase the index when the algorithm has 
> created routes to HCA.
> After applying this patch and measuring the link usage, we are perfectly 
> balanced  (L2<->L3 links are still not used but that is to be expected).
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From stan.smith at intel.com  Mon Feb  9 13:16:33 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Mon, 9 Feb 2009 13:16:33 -0800
Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names
 missing from osm_vendor_t ?
In-Reply-To: <498F5E7B.6020208@dev.mellanox.co.il>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>

Hello all,
  My initial query, sadly somewhat confusing w.r.t. my confusion of mad vs. umad interfaces, was asking if it is permissible for the Windows OpenSM vendor-ibal to have a dependence on umad?

In order for the OFED saquery to work correctly in the Windows environment, saquery code expects the osm_vendor_t struct to have the embedded ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] definition.
This definition creates two umad dependencies:
  1) #define UMAD_MAX_DEVICES + #define UMAD_CA_NAME_LEN
  2) a umad_get_cas_names() call to populate the osm_vendor_t.ca_names struct.

The current version of OpenSM vendor_umad already has the umad dependency, so it seemed somewhat reasonable to introduce this dependency in Windows OpenSM.

This OpenSM change is considered temporary until such a time as we find a Windows opensm maintainer who has cycles to move Windows OpenSM forward to the current OFED OpenSM code base; Ishai has stated you are unavailable due to other project responsibilities.
Much of Sean's WinVerbs/WinMAD/libmad/libumad Windows work provides the necessary infrastructure to make porting the latest OFED OpenSM much easier.
I see there is an svn branch where someone is working on OpenSM? Any ideas as to what's going on here?

My proposal for the Windows OpenSM code base is to add ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal definition of osm_vendor_t and a call to umad_get_cas_names() to populate the osm_vendor_t.ca_names struct for.

Comments?

Thanks,

Stan.


Yevgeny Kliteynik wrote:
> Yevgeny Kliteynik wrote:
>> Hi Stan,
>
> Oops... Looks like I was having a problem with my mail client.
> By now my response is partially outdated...
>
> -- Yevgeny
>
>> Adding Sasha (OFED management maintainer)
>> and the openib mailing list.
>>
>> Stan C. Smith wrote:
>>> Hello,
>>>   The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in
>>> opensm\user\include\vendor\osm_vendor_al.h) is missing
>>> the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'.
>>> saquery.c expects to find ca_names in osm_vendor_t.
>>>
>>> A couple of observations:
>>> 1) Windows currently supports a much older version of opensm than
>>> what OFED 1.4 tools expect.
>>
>> Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple
>> of minor fixes.
>>
>>> 2) saquery uses OpenSM mad interfaces while it 'could' be using
>>> libibmad interfaces.
>>
>> By "OpenSM mad interfaces" you mean libosmvendor?
>>
>>>    If libibmad from saquery, then OpenSM would not need libibmad
>>> references for Windows.
>>
>> Not sure what you mean here. You mean removing libibmad dependency
>> from saquery?
>>
>>> 3) How bad is it to create libibmad dependencies for OpenSM?
>>
>> Pretty bad. I don't think we should add a new dependency unless
>> there's a really good reason for it.
>>
>>> 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD
>>>    interfaces; the rest use libibmad.
>>>
>>> Most of the OFED diagnostic tools support the cmd-line option '-C
>>> ca_name'. This cmd-line input is resolved thru
>>> libibmad interfaces.
>>> Saquery is no exception in that it expects to match the '-C ca_name'
>>> against osm_vendor_t.ca_names[]. 'ibstat -l' lists
>>> CA names.
>>>
>>> The question becomes how best to resolve the missing ca_names?
>>>
>>> 1) modify saquery to call libibmad interface to get CA names;
>>>    leaves osm_vendor_t unmodified. So far, saquery is the only diag
>>>    pgm which uses OSM mad interfaces; expecting ca_names in
>>> osm_vendor_t.
>>>
>>> 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names
>>>    and populate ca_names from OpenSM code?
>>
>> I'd say that this option is much better.
>>
>>>    Creates libibmad dependencies for opensm.
>>
>> But it doesn't have to. Can IBAL expose some function to get these
>> names, so that Win osmvendor will use this function instead of
>> libibmad?
>>
>> Also, Linux osmvendor doesn't have libibmad dependency.
>> It uses umad function umad_get_cas_names() to obtain the CA names.
>> I know that there is a Windows version of umad, but I don't know
>> what is its status. If we *have* to add an additional dependency,
>> then it should be libibumad and not libibmad.
>>
>> At some point in the future we would really want to have the new
>> version of OFED OpenSM ported to WinOF. If there will be a match
>> between Linux and Windows libraries, then the whole vendor concept
>> can be simplified and there won't be a need to have a separate
>> vendor for IBAL. The things
>> that would be different are platform-dependent issues like threads,
>> locks, syslog, but not IB-related code.
>>
>> -- Yevgeny
>>
>>
>>> Comments?
>>>
>>> Thanks,
>>>
>>> Stan.
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw


From kliteyn at dev.mellanox.co.il  Mon Feb  9 14:46:39 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 10 Feb 2009 00:46:39 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ftree: simplify root guids setup.
In-Reply-To: <20090208230830.GD24514@sashak.voltaire.com>
References: <20090208230830.GD24514@sashak.voltaire.com>
Message-ID: <4990B24F.2070804@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Eliminate root_guid_list storage - parse it directly to bfs list.

Looks good, thanks.

-- Yevgeny

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_ucast_ftree.c |  101 +++++++++++++-------------------------
>  1 files changed, 35 insertions(+), 66 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index 10096c7..35f2ea1 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -100,11 +100,6 @@ struct ftree_fabric_t_;
>  typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN];
>  typedef uint64_t ftree_tuple_key_t;
>  
> -struct guid_list_item {
> -	cl_list_item_t list;
> -	uint64_t guid;
> -};
> -
>  /***************************************************
>   **
>   **  ftree_sw_table_element_t definition
> @@ -203,7 +198,6 @@ typedef struct ftree_fabric_t_ {
>  	cl_qmap_t hca_tbl;
>  	cl_qmap_t sw_tbl;
>  	cl_qmap_t sw_by_tuple_tbl;
> -	cl_qlist_t root_guid_list;
>  	cl_qmap_t cn_guid_tbl;
>  	unsigned cn_num;
>  	uint8_t leaf_switch_rank;
> @@ -886,8 +880,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
>  	cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
>  	cl_qmap_init(&p_ftree->cn_guid_tbl);
>  
> -	cl_qlist_init(&p_ftree->root_guid_list);
> -
>  	return p_ftree;
>  }
>  
> @@ -953,10 +945,6 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
>  	}
>  	cl_qmap_remove_all(&p_ftree->cn_guid_tbl);
>  
> -	/* remove all the elements of root_guid_list */
> -	while (!cl_is_qlist_empty(&p_ftree->root_guid_list))
> -		free(cl_qlist_remove_head(&p_ftree->root_guid_list));
> -
>  	/* free the leaf switches array */
>  	if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches))
>  		free(p_ftree->leaf_switches);
> @@ -3045,16 +3033,41 @@ Exit:
>  
>  /***************************************************
>   ***************************************************/
> +struct rank_root_cxt {
> +	ftree_fabric_t *fabric;
> +	cl_list_t *list;
> +};
> +
> +static int rank_root_sw_by_guid(void *cxt, uint64_t guid, char *p)
> +{
> +	struct rank_root_cxt *c = cxt;
> +	ftree_sw_t *sw;
> +
> +	sw = __osm_ftree_fabric_get_sw_by_guid(c->fabric, cl_hton64(guid));
> +	if (!sw) {
> +		/* the specified root guid wasn't found in the fabric */
> +		OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_ERROR, "ERR AB24: "
> +			"Root switch GUID 0x%" PRIx64 " not found\n", guid);
> +		return 0;
> +	}
> +
> +	OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_DEBUG,
> +		"Ranking root switch with GUID 0x%" PRIx64 "\n", guid);
> +	sw->rank = 0;
> +	cl_list_insert_tail(c->list, sw);
> +
> +	return 0;
> +}
>  
>  static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree)
>  {
> +	struct rank_root_cxt context;
>  	osm_node_t *p_osm_node;
>  	osm_node_t *p_remote_osm_node;
>  	osm_physp_t *p_osm_physp;
>  	ftree_sw_t *p_sw;
>  	ftree_sw_t *p_remote_sw;
>  	cl_list_t ranking_bfs_list;
> -	struct guid_list_item *item;
>  	int res = 0;
>  	unsigned num_roots;
>  	unsigned max_rank = 0;
> @@ -3064,25 +3077,16 @@ static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree)
>  	cl_list_init(&ranking_bfs_list, 10);
>  
>  	/* Rank all the roots and add them to list */
> -	for (item = (void *)cl_qlist_head(&p_ftree->root_guid_list);
> -	     item != (void *)cl_qlist_end(&p_ftree->root_guid_list);
> -	     item = (void *)cl_qlist_next(&item->list)) {
> -		p_sw =
> -		    __osm_ftree_fabric_get_sw_by_guid(p_ftree,
> -						      cl_hton64(item->guid));
> -		if (!p_sw) {
> -			/* the specified root guid wasn't found in the fabric */
> -			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB24: "
> -				"Root switch GUID 0x%" PRIx64 " not found\n",
> -				item->guid);
> -			continue;
> -		}
> +	OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> +		"Fetching root nodes from file %s\n",
> +		p_ftree->p_osm->subn.opt.root_guid_file);
>  
> -		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> -			"Ranking root switch with GUID 0x%" PRIx64 "\n",
> -			item->guid);
> -		p_sw->rank = 0;
> -		cl_list_insert_tail(&ranking_bfs_list, p_sw);
> +	context.fabric = p_ftree;
> +	context.list = &ranking_bfs_list;
> +	if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file,
> +			   rank_root_sw_by_guid, &context)) {
> +		res = -1;
> +		goto Exit;
>  	}
>  
>  	num_roots = cl_list_count(&ranking_bfs_list);
> @@ -3314,21 +3318,6 @@ Exit:
>  
>  /***************************************************
>   ***************************************************/
> -static int add_guid_item_to_list(void *cxt, uint64_t guid, char *p)
> -{
> -	cl_qlist_t *list = cxt;
> -	struct guid_list_item *item;
> -
> -	item = malloc(sizeof(*item));
> -	if (!item)
> -		return -1;
> -
> -	item->guid = guid;
> -	cl_qlist_insert_tail(list, &item->list);
> -
> -	return 0;
> -}
> -
>  static int add_guid_item_to_map(void *cxt, uint64_t guid, char *p)
>  {
>  	cl_qmap_t *map = cxt;
> @@ -3350,26 +3339,6 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
>  
>  	OSM_LOG_ENTER(&p_ftree->p_osm->log);
>  
> -	if (__osm_ftree_fabric_roots_provided(p_ftree)) {
> -		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> -			"Fetching root nodes from file %s\n",
> -			p_ftree->p_osm->subn.opt.root_guid_file);
> -
> -		if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file,
> -				   add_guid_item_to_list,
> -				   &p_ftree->root_guid_list)) {
> -			status = -1;
> -			goto Exit;
> -		}
> -
> -		if (!cl_qlist_count(&p_ftree->root_guid_list)) {
> -			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB22: "
> -				"Root guids file has no valid guids\n");
> -			status = -1;
> -			goto Exit;
> -		}
> -	}
> -
>  	if (__osm_ftree_fabric_cns_provided(p_ftree)) {
>  		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  			"Fetching compute nodes from file %s\n",


From sashak at voltaire.com  Mon Feb  9 15:54:14 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 10 Feb 2009 01:54:14 +0200
Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
Message-ID: <20090209235414.GM26139@sashak.voltaire.com>

Hello Stan,

On 13:16 Mon 09 Feb     , Smith, Stan wrote:
> 
> My proposal for the Windows OpenSM code base is to add ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal definition of osm_vendor_t and a call to umad_get_cas_names() to populate the osm_vendor_t.ca_names struct for.
> 
> Comments?

Assuming WinOF already has libibumad implementation with preserved API
would it be reasonable to switch from vendor-ibal to vendor-ibumad in
WinOF?

Sasha


From sean.hefty at intel.com  Mon Feb  9 15:55:13 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 9 Feb 2009 15:55:13 -0800
Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from	osm_vendor_t ?
In-Reply-To: <20090209235414.GM26139@sashak.voltaire.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>	<498F5A8F.2000101@dev.mellanox.co.il>	<498F5E7B.6020208@dev.mellanox.co.il>	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
Message-ID: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>

>Assuming WinOF already has libibumad implementation with preserved API
>would it be reasonable to switch from vendor-ibal to vendor-ibumad in
>WinOF?

WinOF does have a libibumad implementation, plus libibmad ports between the two
platforms.  The saquery code needs structure definitions for the various
attributes, so using libibmad may be a better choice.  Changing saquery didn't
look that hard to me, but it did look like it would modify a fair portion of the
code.

- Sean


From sashak at voltaire.com  Mon Feb  9 16:19:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 10 Feb 2009 02:19:42 +0200
Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
	<2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>
Message-ID: <20090210001935.GP26139@sashak.voltaire.com>

On 15:55 Mon 09 Feb     , Sean Hefty wrote:
> >Assuming WinOF already has libibumad implementation with preserved API
> >would it be reasonable to switch from vendor-ibal to vendor-ibumad in
> >WinOF?
> 
> WinOF does have a libibumad implementation, plus libibmad ports between the two
> platforms.  The saquery code needs structure definitions for the various
> attributes, so using libibmad may be a better choice.

I agree, for "saquery" specific case it is better to cleanup osm_vendor
there (as we discussed already). My question above was about OpenSM
itself, not for purpose of saquery serving.

> Changing saquery didn't
> look that hard to me, but it did look like it would modify a fair portion of the
> code.

I guess so.

Sasha


From stan.smith at intel.com  Mon Feb  9 16:34:28 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Mon, 9 Feb 2009 16:34:28 -0800
Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <20090209235414.GM26139@sashak.voltaire.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com>

Sasha Khapyorsky wrote:
> Hello Stan,
>
> On 13:16 Mon 09 Feb     , Smith, Stan wrote:
>>
>> My proposal for the Windows OpenSM code base is to add
>> ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal
>> definition of osm_vendor_t and a call to umad_get_cas_names() to
>> populate the osm_vendor_t.ca_names struct for.
>>
>> Comments?
>
> Assuming WinOF already has libibumad implementation with preserved API
> would it be reasonable to switch from vendor-ibal to vendor-ibumad in
> WinOF?
>
> Sasha

Hello,

Path of least resistance thinking would point towards not doing a switch as the vendor-ibal to vendor-ibumad would be part of the Windows OpenSM migration to OFED 1.4x OpenSM.
My thinking is that making a switch to vendor-ibumad would be a good deal more work/involved just to get saquery working.
Not knowing the Windows OpenSM code base, moving part of it forward seems like a task 'which' could blossom into a good deal more work for the small return of saquery working?
Frankly, I'd rather see work put into getting OFED OpenSM 1.4 working on Windows.

Just my $0.02 worth.

Stan.


From sumeet.lahorani at oracle.com  Mon Feb  9 16:41:59 2009
From: sumeet.lahorani at oracle.com (Sumeet Lahorani)
Date: Mon, 09 Feb 2009 16:41:59 -0800
Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops
Message-ID: <4990CD57.3080108@oracle.com>

When we enable IB connected mode and increase MTU to 65520, we see the 
following in /var/log/messages

Feb  6 17:48:32 dadzab01 kernel: ib0: enabling connected mode will cause 
multicast packet drops
Feb  6 17:48:32 dadzab01 kernel: ib0: mtu > 2044 will cause multicast 
packet drops.
Feb  6 17:48:32 dadzab01 kernel: ib1: enabling connected mode will cause 
multicast packet drops
Feb  6 17:48:32 dadzab01 kernel: ib1: mtu > 2044 will cause multicast 
packet drops.

Should we not be doing this? What kind of multicast packets will be 
dropped?

If we are not using multicast, do any OFED drivers (bonding, ipoib etc) 
internally use multicast in a way that will cause them to not work 
correctly in connected mode?

We are using OFED 1.3.1.

- Sumeet


From sean.hefty at intel.com  Mon Feb  9 18:55:05 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 9 Feb 2009 18:55:05 -0800
Subject: [ofa-general] RE: svn.1936 commits
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com>
References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com>
Message-ID: <D1354B51917842DEB4A421E69266C625@amr.corp.intel.com>

I don't see that my original post ever went out.

>Ulp\libibmad\include\infiniband\mad.h
>
>Added MAD_EXPORT for xdump & smp_query_via needed by ibtracert & ibroute.

Changes to libibmad need to go through the management.git tree.  The mirror in
SVN will be replaced with all upstream code.

>Signed off by stan.smith at intel.com
>
>diff U3 C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h-
>revBASE.svn000.tmp.h C:/Documents and Settings/scsmith/My Documents/openIB-
>windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h
>--- C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h-
>revBASE.svn000.tmp.h    Mon Feb 09 16:36:46 2009
>+++ C:/Documents and Settings/scsmith/My Documents/openIB-
>windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h  Mon Feb 09
>15:55:43 2009
>@@ -710,7 +710,7 @@
>                              unsigned mod, unsigned timeout);
> MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
>                            unsigned mod, unsigned timeout);
>-uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
>+MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned
>attrid,
>                       unsigned mod, unsigned timeout, const void *srcport);
> uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned
>mod,
>                     unsigned timeout, const void *srcport);
>@@ -837,7 +837,7 @@
>        exit(-1); \
> } while(0)
>
>-void xdump(FILE * file, char *msg, void *p, int size);
>+MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size);
>
> END_C_DECLS
> #endif                         /* _MAD_H_ */

- Sean


From stan.smith at intel.com  Mon Feb  9 19:45:42 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Mon, 9 Feb 2009 19:45:42 -0800
Subject: [ofa-general] RE: svn.1936 commits
In-Reply-To: <D1354B51917842DEB4A421E69266C625@amr.corp.intel.com>
References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com>
	<D1354B51917842DEB4A421E69266C625@amr.corp.intel.com>
Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com>

Hefty, Sean wrote:
> I don't see that my original post ever went out.
>
>> Ulp\libibmad\include\infiniband\mad.h
>>
>> Added MAD_EXPORT for xdump & smp_query_via needed by ibtracert &
>> ibroute.
>
> Changes to libibmad need to go through the management.git tree.  The
> mirror in SVN will be replaced with all upstream code.

Yes I understand this.
The note was to inform you that changes need to be pushed back to the git tree.

>
>> Signed off by stan.smith at intel.com
>>
>> diff U3 C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h-
>> revBASE.svn000.tmp.h C:/Documents and Settings/scsmith/My
>> Documents/openIB-
>> windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h ---
>> C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h-
>> revBASE.svn000.tmp.h    Mon Feb 09 16:36:46 2009 +++ C:/Documents
>> and Settings/scsmith/My Documents/openIB-
>> windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h  Mon
>>                              Feb 09 15:55:43 2009 @@ -710,7 +710,7
>> @@ unsigned mod, unsigned timeout);
>> MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned
>>                            attrid, unsigned mod, unsigned timeout);
>> -uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
>> +MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id,
>>                       unsigned attrid, unsigned mod, unsigned
>> timeout, const void *srcport);
>> uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid,
>>                     unsigned mod, unsigned timeout, const void
>>        *srcport); @@ -837,7 +837,7 @@ exit(-1); \
>> } while(0)
>>
>> -void xdump(FILE * file, char *msg, void *p, int size);
>> +MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size);
>>
>> END_C_DECLS
>> #endif                         /* _MAD_H_ */
>
> - Sean


From ofedrnicuser at yahoo.com  Mon Feb  9 21:22:40 2009
From: ofedrnicuser at yahoo.com (Bill N)
Date: Mon, 9 Feb 2009 21:22:40 -0800 (PST)
Subject: ***SPAM*** Re: [ofa-general] non zero lkey in send(),
	write() with  num_sge > 1?
In-Reply-To: <4990832C.5090204@gmail.com>
Message-ID: <809230.93598.qm@web111213.mail.gq1.yahoo.com>


> > Can stack pass num_sge > 1, and lkey !=0 as part of
> sg_list[] elements, in post_send() call?
> >   
> What are you trying to achieve?
[Bill]
I just wanted to confirm, that even when Stag !=0,
(a) there can be multiple SGEs in the list with different lkey and TO.
And
(b) HCAs have to validate each of the SGE entry against the lkey.

Want to ensure that 
- As RDMA ULP I can invoke post_send() with multiple lkeys and utilize the allocated MRs, HCAs are designed to handle that.

Any example ULP we are aware of that does this?

Regards,
Bill


--- On Mon, 2/9/09, Dotan Barak <dotanba at gmail.com> wrote:

> From: Dotan Barak <dotanba at gmail.com>
> Subject: Re: [ofa-general] ***SPAM*** non zero lkey in send(), write() with  num_sge > 1?
> To: "Ofed User" <ofedrnicuser at yahoo.com>
> Cc: "OFED General" <general at lists.openfabrics.org>
> Date: Monday, February 9, 2009, 7:25 PM
> Ofed User wrote:
> > Hi,
> >


> If num_sge > 1 => the HCA will try to read the blocks
> pointed by the 
> sg_list one by one and validate that the address + size is
> inside a valid
> Memory Region which its local key is the lkey.
> 
> Then i guess that the answer is: Yes.
> 
> Dotan


From sean.hefty at intel.com  Mon Feb  9 23:02:34 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 9 Feb 2009 23:02:34 -0800
Subject: [ofa-general] RE: svn.1936 commits
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com>
References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com>
	<D1354B51917842DEB4A421E69266C625@amr.corp.intel.com>
	<3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com>
Message-ID: <7507A78ACA634E9A9AE3CB694B629246@amr.corp.intel.com>

>> Changes to libibmad need to go through the management.git tree.  The
>> mirror in SVN will be replaced with all upstream code.
>
>Yes I understand this.
>The note was to inform you that changes need to be pushed back to the git tree.

I am asking that all changes be submitted through to the main git tree first,
especially for changes that hit the SVN trunk.  I do not want to try to keep
diverging trees in sync.

As for the patch, commits should at least by reviewed by the maintainer before
they are committed.  WinOF has been very lax about this practice.  And the
subject should be more detailed than 'svn 1936 commits'.

- Sean


From nicolas.morey-chaisemartin at ext.bull.net  Mon Feb  9 23:49:09 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Tue, 10 Feb 2009 08:49:09 +0100
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value
 for down port index
Message-ID: <49913175.609@ext.bull.net>

We have to add the module value to the index before actually doing the module, or we get a value of -1 which makes OpenSM segfaults

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---

I missed this one in my previous patch.  Sorry for that

  opensm/opensm/osm_ucast_ftree.c |    3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: c02f9ea241a7150d1cb1c9846408feeeeb4ef024.diff
Type: text/x-patch
Size: 591 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090210/a90d8fad/attachment.bin>

From nicolas.morey-chaisemartin at ext.bull.net  Tue Feb 10 00:08:01 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Tue, 10 Feb 2009 09:08:01 +0100
Subject: [ofa-general] [PATCH v2] opensm/osm_console.c : Added dump_portguid
 function to
 console to generate a list of port guids matching one or more regexps
Message-ID: <499135E1.1080307@ext.bull.net>

This add a dump_portguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such
by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---

Repost without exit_after_run flag, active sleep init loop and indented.

  opensm/opensm/osm_console.c |  105 +++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 105 insertions(+), 0 deletions(-)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: a72ba8239575ad93b59015c9c4c1a0c8020d0db7.diff
Type: text/x-patch
Size: 3613 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090210/a19dc396/attachment.bin>

From kliteyn at dev.mellanox.co.il  Tue Feb 10 00:59:31 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 10 Feb 2009 10:59:31 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init
	value for down port index
In-Reply-To: <49913175.609@ext.bull.net>
References: <49913175.609@ext.bull.net>
Message-ID: <499141F3.9020001@dev.mellanox.co.il>

Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
> We have to add the module value to the index before actually doing the 
> module, or we get a value of -1 which makes OpenSM segfaults
> 
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
> 
> I missed this one in my previous patch.  Sorry for that
> 
>  opensm/opensm/osm_ucast_ftree.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> 
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index 4e65c87..c8f5f08 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -1921,7 +1921,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  		return FALSE;
>  
>  	/* foreach down-going port group (in indexing order) */
> -	i = p_sw->down_port_groups_idx;
> +	i = (p_sw->down_port_groups_idx +
> +	     p_sw->down_port_groups_num) % p_sw->down_port_groups_num;

Perhaps it would be simpler just to init the down_port_groups_idx to 0 instead of -1?
Something like this:

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 4e65c87..eae1ed8 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree,
  	/* initialize lft buffer */
  	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);

-	p_sw->down_port_groups_idx = -1;
+	p_sw->down_port_groups_idx = 0;

  	return p_sw;
  }				/* __osm_ftree_sw_create() */


From nicolas.morey-chaisemartin at ext.bull.net  Tue Feb 10 01:03:44 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Tue, 10 Feb 2009 10:03:44 +0100
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init
	value for down port index
In-Reply-To: <499141F3.9020001@dev.mellanox.co.il>
References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il>
Message-ID: <499142F0.8000803@ext.bull.net>

Yevgeny Kliteynik wrote:
> Hi Nicolas,
>
> Nicolas Morey Chaisemartin wrote:
>> We have to add the module value to the index before actually doing 
>> the module, or we get a value of -1 which makes OpenSM segfaults
>>
>> Signed-off-by: Nicolas Morey-Chaisemartin 
>> <nicolas.morey-chaisemartin at ext.bull.net>
>> ---
>>
>> I missed this one in my previous patch.  Sorry for that
>>
>>  opensm/opensm/osm_ucast_ftree.c |    3 ++-
>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>
>>
>>
>> diff --git a/opensm/opensm/osm_ucast_ftree.c 
>> b/opensm/opensm/osm_ucast_ftree.c
>> index 4e65c87..c8f5f08 100644
>> --- a/opensm/opensm/osm_ucast_ftree.c
>> +++ b/opensm/opensm/osm_ucast_ftree.c
>> @@ -1921,7 +1921,8 @@ 
>> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
>> p_ftree,
>>          return FALSE;
>>  
>>      /* foreach down-going port group (in indexing order) */
>> -    i = p_sw->down_port_groups_idx;
>> +    i = (p_sw->down_port_groups_idx +
>> +         p_sw->down_port_groups_num) % p_sw->down_port_groups_num;
>
> Perhaps it would be simpler just to init the down_port_groups_idx to 0 
> instead of -1?
> Something like this:
>
> diff --git a/opensm/opensm/osm_ucast_ftree.c 
> b/opensm/opensm/osm_ucast_ftree.c
> index 4e65c87..eae1ed8 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN 
> ftree_fabric_t * p_ftree,
>      /* initialize lft buffer */
>      memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>
> -    p_sw->down_port_groups_idx = -1;
> +    p_sw->down_port_groups_idx = 0;
>
>      return p_sw;
>  }                /* __osm_ftree_sw_create() */
>
>
>
>

Sure. I wanted to ensure that whatever happens to the index it would 
always be in the right interval but after checking I doubt anything else 
than initialization could set it outside its normal interval.
Do you want me to make the patch and send it or will you just push yours?

Nicolas


From kliteyn at dev.mellanox.co.il  Tue Feb 10 01:17:49 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 10 Feb 2009 11:17:49 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init
	value for down port index
In-Reply-To: <499142F0.8000803@ext.bull.net>
References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il>
	<499142F0.8000803@ext.bull.net>
Message-ID: <4991463D.6030705@dev.mellanox.co.il>

Nicolas Morey Chaisemartin wrote:
> Yevgeny Kliteynik wrote:
>> Hi Nicolas,
>>
>>>  
>>>      /* foreach down-going port group (in indexing order) */
>>> -    i = p_sw->down_port_groups_idx;
>>> +    i = (p_sw->down_port_groups_idx +
>>> +         p_sw->down_port_groups_num) % p_sw->down_port_groups_num;
>>
>> Perhaps it would be simpler just to init the down_port_groups_idx to 0 
>> instead of -1?
>> Something like this:
>>
>> diff --git a/opensm/opensm/osm_ucast_ftree.c 
>> b/opensm/opensm/osm_ucast_ftree.c
>> index 4e65c87..eae1ed8 100644
>> --- a/opensm/opensm/osm_ucast_ftree.c
>> +++ b/opensm/opensm/osm_ucast_ftree.c
>> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN 
>> ftree_fabric_t * p_ftree,
>>      /* initialize lft buffer */
>>      memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>>
>> -    p_sw->down_port_groups_idx = -1;
>> +    p_sw->down_port_groups_idx = 0;
>>
>>      return p_sw;
>>  }                /* __osm_ftree_sw_create() */
> 
> Sure. I wanted to ensure that whatever happens to the index it would 
> always be in the right interval but after checking I doubt anything else 
> than initialization could set it outside its normal interval.
> Do you want me to make the patch and send it or will you just push yours?

I'm ok with both options.
I can send a clean patch to Sasha tomorrow (I'm OOO today), or you can do it today.

-- Yevgeny

> Nicolas
> 


From nicolas.morey-chaisemartin at ext.bull.net  Tue Feb 10 01:29:28 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Tue, 10 Feb 2009 10:29:28 +0100
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init
	value for down port index
In-Reply-To: <4991463D.6030705@dev.mellanox.co.il>
References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il>
	<499142F0.8000803@ext.bull.net>
	<4991463D.6030705@dev.mellanox.co.il>
Message-ID: <499148F8.3000303@ext.bull.net>

Yevgeny Kliteynik wrote:
> Nicolas Morey Chaisemartin wrote:
>> Yevgeny Kliteynik wrote:
>>> Hi Nicolas,
>>>
>>>>  
>>>>      /* foreach down-going port group (in indexing order) */
>>>> -    i = p_sw->down_port_groups_idx;
>>>> +    i = (p_sw->down_port_groups_idx +
>>>> +         p_sw->down_port_groups_num) % p_sw->down_port_groups_num;
>>>
>>> Perhaps it would be simpler just to init the down_port_groups_idx to 
>>> 0 instead of -1?
>>> Something like this:
>>>
>>> diff --git a/opensm/opensm/osm_ucast_ftree.c 
>>> b/opensm/opensm/osm_ucast_ftree.c
>>> index 4e65c87..eae1ed8 100644
>>> --- a/opensm/opensm/osm_ucast_ftree.c
>>> +++ b/opensm/opensm/osm_ucast_ftree.c
>>> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN 
>>> ftree_fabric_t * p_ftree,
>>>      /* initialize lft buffer */
>>>      memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>>>
>>> -    p_sw->down_port_groups_idx = -1;
>>> +    p_sw->down_port_groups_idx = 0;
>>>
>>>      return p_sw;
>>>  }                /* __osm_ftree_sw_create() */
>>
>> Sure. I wanted to ensure that whatever happens to the index it would 
>> always be in the right interval but after checking I doubt anything 
>> else than initialization could set it outside its normal interval.
>> Do you want me to make the patch and send it or will you just push 
>> yours?
>
> I'm ok with both options.
> I can send a clean patch to Sasha tomorrow (I'm OOO today), or you can 
> do it today.
>
> -- Yevgeny
>
>> Nicolas
>>
>
>
>
Yours should be faster and I recheck and I see no reason to enforce a 
"check" in the function so I prefer your solution.
I'll repost the patch today as it's breaking opensm/ftree.

Nicolas


From nicolas.morey-chaisemartin at ext.bull.net  Tue Feb 10 01:53:21 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Tue, 10 Feb 2009 10:53:21 +0100
Subject: [ofa-general] [PATCH v2] opensm/osm_ucast_ftree.c Fixed bad init
 value for down port index
Message-ID: <49914E91.4090305@ext.bull.net>

Fixes the init value of down_port_groups_idx to 0 so it's in the port group interval.
This way __osm_ftree_fabric_route_upgoing_by_going_down can use the index directly without segfaulting.

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/opensm/osm_ucast_ftree.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 4e65c87..eae1ed8 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree,
  	/* initialize lft buffer */
  	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);

-	p_sw->down_port_groups_idx = -1;
+	p_sw->down_port_groups_idx = 0;

  	return p_sw;
  }				/* __osm_ftree_sw_create() */
-- 
1.6.1


From vlad at lists.openfabrics.org  Tue Feb 10 03:13:09 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 10 Feb 2009 03:13:09 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090210-0200 daily build status
Message-ID: <20090210111309.DB371E61174@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From acceptany at gmail.com  Tue Feb 10 03:56:06 2009
From: acceptany at gmail.com (Jordan)
Date: Tue, 10 Feb 2009 19:56:06 +0800
Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in
	opensm?
Message-ID: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com>

How can I add a new routing algorithm in opensm , which files need to be
modified?  If this can be done , is there a simulator to test this new
algorithm and dump some results?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090210/4ef6ff0f/attachment.html>

From swise at opengridcomputing.com  Tue Feb 10 10:44:48 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 10 Feb 2009 12:44:48 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
Message-ID: <20090210184448.22891.31130.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

Removes the need for special u64 math on i386 systems.

Fixes i386 build break in linux-next introduced by 
commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 2cf6f13..5bb299a 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list,
 			return -EINVAL;
 		}
 		offset = sg_list[i].addr - mhp->attr.va_fbo;
-		offset += ((u64) mhp->attr.va_fbo) %
-		          (1UL << (12 + mhp->attr.page_size));
+		offset += mhp->attr.va_fbo &
+			  ((1UL << (12 + mhp->attr.page_size)) - 1);
 		pbl_addr[i] = ((mhp->attr.pbl_addr -
 			        rhp->rdev.rnic_info.pbl_base) >> 3) +
 			      (offset >> (12 + mhp->attr.page_size));
@@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe,
 		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
 
 		/* to in the WQE == the offset into the page */
-		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
-				(1UL << (12 + page_size[i])));
+		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
+				((1UL << (12 + page_size[i]))-1));
 
 		/* pbl_addr is the adapters address in the PBL */
 		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);


From randy.dunlap at oracle.com  Tue Feb 10 11:04:55 2009
From: randy.dunlap at oracle.com (Randy Dunlap)
Date: Tue, 10 Feb 2009 11:04:55 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <20090210184448.22891.31130.stgit@dell3.ogc.int>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
Message-ID: <4991CFD7.30503@oracle.com>

Steve Wise wrote:
> From: Steve Wise <swise at opengridcomputing.com>
> 
> Removes the need for special u64 math on i386 systems.
> 
> Fixes i386 build break in linux-next introduced by 
> commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945.
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>

Yes, that works, thanks.  But this patch should go into 2.6.29, not
just 2.6.30.


> ---
> 
>  drivers/infiniband/hw/cxgb3/iwch_qp.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
> index 2cf6f13..5bb299a 100644
> --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
> +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
> @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list,
>  			return -EINVAL;
>  		}
>  		offset = sg_list[i].addr - mhp->attr.va_fbo;
> -		offset += ((u64) mhp->attr.va_fbo) %
> -		          (1UL << (12 + mhp->attr.page_size));
> +		offset += mhp->attr.va_fbo &
> +			  ((1UL << (12 + mhp->attr.page_size)) - 1);
>  		pbl_addr[i] = ((mhp->attr.pbl_addr -
>  			        rhp->rdev.rnic_info.pbl_base) >> 3) +
>  			      (offset >> (12 + mhp->attr.page_size));
> @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe,
>  		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
>  
>  		/* to in the WQE == the offset into the page */
> -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
> -				(1UL << (12 + page_size[i])));
> +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
> +				((1UL << (12 + page_size[i]))-1));
>  
>  		/* pbl_addr is the adapters address in the PBL */
>  		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);


-- 
~Randy


From swise at opengridcomputing.com  Tue Feb 10 11:10:34 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 10 Feb 2009 13:10:34 -0600
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <4991CFD7.30503@oracle.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<4991CFD7.30503@oracle.com>
Message-ID: <4991D12A.8090309@opengridcomputing.com>


Randy Dunlap wrote:
> Steve Wise wrote:
>   
>> From: Steve Wise <swise at opengridcomputing.com>
>>
>> Removes the need for special u64 math on i386 systems.
>>
>> Fixes i386 build break in linux-next introduced by 
>> commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945.
>>
>> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
>>     
>
> Yes, that works, thanks.  But this patch should go into 2.6.29, not
> just 2.6.30.
>
>
>   
I thought the commit that caused this was:

1e27e8cee0698259ccb1fe6abeaf4b48969c0945

And that was going in 2.6.30.  (I thought).


>> ---
>>
>>  drivers/infiniband/hw/cxgb3/iwch_qp.c |    8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
>> index 2cf6f13..5bb299a 100644
>> --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
>> +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
>> @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list,
>>  			return -EINVAL;
>>  		}
>>  		offset = sg_list[i].addr - mhp->attr.va_fbo;
>> -		offset += ((u64) mhp->attr.va_fbo) %
>> -		          (1UL << (12 + mhp->attr.page_size));
>> +		offset += mhp->attr.va_fbo &
>> +			  ((1UL << (12 + mhp->attr.page_size)) - 1);
>>  		pbl_addr[i] = ((mhp->attr.pbl_addr -
>>  			        rhp->rdev.rnic_info.pbl_base) >> 3) +
>>  			      (offset >> (12 + mhp->attr.page_size));
>> @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe,
>>  		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
>>  
>>  		/* to in the WQE == the offset into the page */
>> -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
>> -				(1UL << (12 + page_size[i])));
>> +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
>> +				((1UL << (12 + page_size[i]))-1));
>>  
>>  		/* pbl_addr is the adapters address in the PBL */
>>  		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
>>     
>
>
>   


From randy.dunlap at oracle.com  Tue Feb 10 11:12:27 2009
From: randy.dunlap at oracle.com (Randy Dunlap)
Date: Tue, 10 Feb 2009 11:12:27 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <4991D12A.8090309@opengridcomputing.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<4991CFD7.30503@oracle.com>
	<4991D12A.8090309@opengridcomputing.com>
Message-ID: <4991D19B.5050307@oracle.com>

Steve Wise wrote:
> 
> Randy Dunlap wrote:
>> Steve Wise wrote:
>>  
>>> From: Steve Wise <swise at opengridcomputing.com>
>>>
>>> Removes the need for special u64 math on i386 systems.
>>>
>>> Fixes i386 build break in linux-next introduced by commit
>>> 1e27e8cee0698259ccb1fe6abeaf4b48969c0945.
>>>
>>> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
>>>     
>>
>> Yes, that works, thanks.  But this patch should go into 2.6.29, not
>> just 2.6.30.
>>
>>
>>   
> I thought the commit that caused this was:
> 
> 1e27e8cee0698259ccb1fe6abeaf4b48969c0945
> 
> And that was going in 2.6.30.  (I thought).

Oh, OK.  If that's the case, then you are obviously correct
about [2.6.30].

Thanks.

>>> ---
>>>
>>>  drivers/infiniband/hw/cxgb3/iwch_qp.c |    8 ++++----
>>>  1 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c
>>> b/drivers/infiniband/hw/cxgb3/iwch_qp.c
>>> index 2cf6f13..5bb299a 100644
>>> --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
>>> +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
>>> @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp,
>>> struct ib_sge *sg_list,
>>>              return -EINVAL;
>>>          }
>>>          offset = sg_list[i].addr - mhp->attr.va_fbo;
>>> -        offset += ((u64) mhp->attr.va_fbo) %
>>> -                  (1UL << (12 + mhp->attr.page_size));
>>> +        offset += mhp->attr.va_fbo &
>>> +              ((1UL << (12 + mhp->attr.page_size)) - 1);
>>>          pbl_addr[i] = ((mhp->attr.pbl_addr -
>>>                      rhp->rdev.rnic_info.pbl_base) >> 3) +
>>>                    (offset >> (12 + mhp->attr.page_size));
>>> @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp,
>>> union t3_wr *wqe,
>>>          wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
>>>  
>>>          /* to in the WQE == the offset into the page */
>>> -        wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
>>> -                (1UL << (12 + page_size[i])));
>>> +        wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
>>> +                ((1UL << (12 + page_size[i]))-1));
>>>  
>>>          /* pbl_addr is the adapters address in the PBL */
>>>          wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);


-- 
~Randy


From rdreier at cisco.com  Tue Feb 10 16:38:03 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Feb 2009 16:38:03 -0800
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <20090210184448.22891.31130.stgit@dell3.ogc.int> (Steve Wise's
	message of "Tue, 10 Feb 2009 12:44:48 -0600")
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
Message-ID: <adamyctajv8.fsf@cisco.com>

I'll roll this into the offending patch (that is in -next).

But:

 > -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
 > -				(1UL << (12 + page_size[i])));
 > +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
 > +				((1UL << (12 + page_size[i]))-1));

Is this required?  Strength reduction optimization should do this
automatically (and the code has been there for quite a while, so
obviously it isn't causing problems)

 - R.


From swise at opengridcomputing.com  Tue Feb 10 17:03:52 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 10 Feb 2009 19:03:52 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <adamyctajv8.fsf@cisco.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com>
Message-ID: <499223F8.1010204@opengridcomputing.com>

Roland Dreier wrote:
> I'll roll this into the offending patch (that is in -next).
>
> But:
>
>  > -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
>  > -				(1UL << (12 + page_size[i])));
>  > +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
>  > +				((1UL << (12 + page_size[i]))-1));
>
> Is this required?  Strength reduction optimization should do this
> automatically (and the code has been there for quite a while, so
> obviously it isn't causing problems)
>
>  - R.
>   
Ok.


From davem at davemloft.net  Tue Feb 10 17:07:40 2009
From: davem at davemloft.net (David Miller)
Date: Tue, 10 Feb 2009 17:07:40 -0800 (PST)
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <499223F8.1010204@opengridcomputing.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com>
	<499223F8.1010204@opengridcomputing.com>
Message-ID: <20090210.170740.208470781.davem@davemloft.net>

From: Steve Wise <swise at opengridcomputing.com>
Date: Tue, 10 Feb 2009 19:03:52 -0600

> Roland Dreier wrote:
> > I'll roll this into the offending patch (that is in -next).
> >
> > But:
> >
> >  > -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
> >  > -				(1UL << (12 + page_size[i])));
> >  > +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
> >  > +				((1UL << (12 + page_size[i]))-1));
> >
> > Is this required?  Strength reduction optimization should do this
> > automatically (and the code has been there for quite a while, so
> > obviously it isn't causing problems)
> >
> >  - R.
> >   
> Ok.

GCC won't optimize that modulus the way you expect, try for yourself
and look at the assembler if you don't believe me. :-)


From rdreier at cisco.com  Tue Feb 10 17:18:49 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Feb 2009 17:18:49 -0800
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <20090210.170740.208470781.davem@davemloft.net> (David Miller's
	message of "Tue, 10 Feb 2009 17:07:40 -0800 (PST)")
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com> <499223F8.1010204@opengridcomputing.com>
	<20090210.170740.208470781.davem@davemloft.net>
Message-ID: <adaeiy5ahza.fsf@cisco.com>

> > Is this required?  Strength reduction optimization should do this
> > automatically (and the code has been there for quite a while, so
> > obviously it isn't causing problems)

> GCC won't optimize that modulus the way you expect, try for yourself
> and look at the assembler if you don't believe me. :-)

Are you thinking of the case when there are signed integers involved and
so "% modulus" might produce a different result than "& (modulus - 1)"
(because the compiler can't know that things are never negative)?
Because in this case the compiler seems to do what I thought it would;
the relevant part of the i386 assembly for

		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
				(1UL << (12 + page_size[i])));

is

        movl    %eax, 28(%edi,%ebx)     # <variable>.length,
        <variable>.len
        movzbl  28(%esp,%esi), %ecx     # page_size, tmp89
        movl    $1, %eax        #, tmp92
        addl    $12, %ecx       #, tmp90
        sall    %cl, %eax       # tmp90, tmp92
        movl    (%esp), %ecx    # wr,
        decl    %eax    # tmp93
        movl    12(%ecx), %edx  # <variable>.sg_list, <variable>.sg_list
        andl    (%edx,%ebx), %eax       # <variable>.addr, tmp93

ie the compiler computes the modulus, then does decl to compute
modulus-1 and then &s with it.

Or am I misunderstanding your point?

 - R.


From acceptany at gmail.com  Tue Feb 10 17:23:50 2009
From: acceptany at gmail.com (Jordan)
Date: Wed, 11 Feb 2009 09:23:50 +0800
Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in
	opensm?
In-Reply-To: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com>
References: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com>
Message-ID: <91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com>

How can I add a new routing algorithm in opensm , which files need to be
modified?  If this can be done , is there a simulator to test this new
algorithm and dump some results?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090211/bf4b299a/attachment.html>

From davem at davemloft.net  Tue Feb 10 17:23:47 2009
From: davem at davemloft.net (David Miller)
Date: Tue, 10 Feb 2009 17:23:47 -0800 (PST)
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <adaeiy5ahza.fsf@cisco.com>
References: <499223F8.1010204@opengridcomputing.com>
	<20090210.170740.208470781.davem@davemloft.net>
	<adaeiy5ahza.fsf@cisco.com>
Message-ID: <20090210.172347.189515015.davem@davemloft.net>

From: Roland Dreier <rdreier at cisco.com>
Date: Tue, 10 Feb 2009 17:18:49 -0800

> > > Is this required?  Strength reduction optimization should do this
> > > automatically (and the code has been there for quite a while, so
> > > obviously it isn't causing problems)
> 
> > GCC won't optimize that modulus the way you expect, try for yourself
> > and look at the assembler if you don't believe me. :-)
> 
> Are you thinking of the case when there are signed integers involved and
> so "% modulus" might produce a different result than "& (modulus - 1)"
> (because the compiler can't know that things are never negative)?
> Because in this case the compiler seems to do what I thought it would;
> the relevant part of the i386 assembly for
> 
> 		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
> 				(1UL << (12 + page_size[i])));
> 
> is
> 
>         movl    %eax, 28(%edi,%ebx)     # <variable>.length,
>         <variable>.len
>         movzbl  28(%esp,%esi), %ecx     # page_size, tmp89
>         movl    $1, %eax        #, tmp92
>         addl    $12, %ecx       #, tmp90
>         sall    %cl, %eax       # tmp90, tmp92
>         movl    (%esp), %ecx    # wr,
>         decl    %eax    # tmp93
>         movl    12(%ecx), %edx  # <variable>.sg_list, <variable>.sg_list
>         andl    (%edx,%ebx), %eax       # <variable>.addr, tmp93
> 
> ie the compiler computes the modulus, then does decl to compute
> modulus-1 and then &s with it.
> 
> Or am I misunderstanding your point?

Must be compiler and platform specific because with gcc-4.1.3 on
sparc with -O2, for the test program:

unsigned long page_size[4];

int main(int argc)
{
        unsigned long long x = argc;

        return x % (1UL << (12 + page_size[argc]));
}

I get a call to __umoddi3:

main:
        save    %sp, -112, %sp
        sethi   %hi(page_size), %g1
        sll     %i0, 2, %g3
        or      %g1, %lo(page_size), %g1
        mov     1, %o2
        ld      [%g1+%g3], %g2
        add     %g2, 12, %g2
        sll     %o2, %g2, %o2
        mov     %i0, %o1
        mov     %o2, %o3
        sra     %i0, 31, %o0
        call    __umoddi3, 0
         mov    0, %o2
        jmp     %i7+8
         restore %g0, %o1, %o0

I get the same with gcc-4.3.0 and -O2 on 32-bit x86:

main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	movl	$1, %eax
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	subl	$20, %esp
	movl	(%ecx), %edx
	movl	page_size(,%edx,4), %ecx
	movl	$0, 12(%esp)
	movl	%edx, (%esp)
	addl	$12, %ecx
	sall	%cl, %eax
	movl	%eax, 8(%esp)
	movl	%edx, %eax
	sarl	$31, %eax
	movl	%eax, 4(%esp)
	call	__umoddi3
	addl	$20, %esp
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret


From sashak at voltaire.com  Tue Feb 10 17:34:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 03:34:41 +0200
Subject: [ofa-general] [PATCH] infiniband-diags/saquery: remove osm vendor
	layer
In-Reply-To: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
	<2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>
Message-ID: <20090211013441.GR26139@sashak.voltaire.com>


Replace OSM Vendor layer by libibumad and libibmad (rpc) calls.

This patch is done following "minimum changes" rule to demonstrate osm
vendor replacement. Many subsequent improvements and simplification can
be done. All current saquery functionality is preserved.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

On 15:55 Mon 09 Feb     , Sean Hefty wrote:
> Changing saquery didn't
> look that hard to me, but it did look like it would modify a fair portion of the
> code.

Cannot resist... :)

Sasha

 infiniband-diags/configure.in  |    4 -
 infiniband-diags/src/saquery.c |  266 +++++++++++++++++++---------------------
 2 files changed, 127 insertions(+), 143 deletions(-)

diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in
index 58eea0a..7d277b2 100644
--- a/infiniband-diags/configure.in
+++ b/infiniband-diags/configure.in
@@ -40,10 +40,6 @@ AC_CHECK_LIB(ibmad, port_performance_ext_query, [],
 	AC_MSG_ERROR([port_performance_ext_query() not found. diags require more recent libibmad.]))
 AC_CHECK_LIB(osmcomp, cl_thread_init, [],
 	AC_MSG_ERROR([cl_thread_init() not found. diags require libosmcomp.]))
-AC_CHECK_LIB(osmvendor, osmv_query_sa, [],
-	AC_MSG_ERROR([osmv_query_sa() not found. diags require libosmvendor.]), [-lopensm])
-AC_CHECK_LIB(opensm, osm_log_init_v2, [],
-	AC_MSG_ERROR([osm_log_init_v2() not found. diags require libopensm.]))
 fi
 
 dnl Checks for header files.
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 5361184..0a997cf 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -42,20 +42,33 @@
 #include <arpa/inet.h>
 #include <ctype.h>
 #include <string.h>
+#include <errno.h>
 
 #define _GNU_SOURCE
 #include <getopt.h>
 
+#include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/opensm/osm_log.h>
-#include <infiniband/vendor/osm_vendor_api.h>
-#include <infiniband/vendor/osm_vendor_sa_api.h>
-#include <infiniband/opensm/osm_mad_pool.h>
+#include <infiniband/iba/ib_types.h>
 #include <infiniband/complib/cl_debug.h>
 #include <infiniband/complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
+struct sa_bind_handle {
+	int fd, agent;
+	ib_portid_t dport;
+};
+
+struct sa_result {
+	int status;
+	unsigned result_cnt;
+	void *p_result_madw;
+};
+
+#define osmv_query_res_t struct sa_result
+#define osm_bind_handle_t struct sa_bind_handle *
+
 struct query_params {
 	ib_gid_t sgid, dgid, gid, mgid;
 	uint16_t slid, dlid, mlid;
@@ -82,7 +95,7 @@ struct query_cmd {
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
-static ib_net64_t smkey = OSM_DEFAULT_SA_KEY;
+static ib_net64_t smkey = CL_HTON64(1);
 
 /**
  * Declare some globals because I don't want this to be too complex.
@@ -90,11 +103,6 @@ static ib_net64_t smkey = OSM_DEFAULT_SA_KEY;
 #define MAX_PORTS (8)
 #define DEFAULT_SA_TIMEOUT_MS (1000)
 osmv_query_res_t result;
-osm_log_t log_osm;
-osm_mad_pool_t mad_pool;
-osm_vendor_t *vendor = NULL;
-char *sa_hca_name = NULL;
-uint32_t sa_port_num = 0;
 
 enum {
 	ALL,
@@ -112,6 +120,81 @@ int requested_lid_flag = 0;
 ib_net64_t requested_guid = 0;
 int requested_guid_flag = 0;
 
+static int sa_query(struct sa_bind_handle *h, uint8_t method,
+		    ib_net16_t attr, ib_net32_t mod, ib_net64_t comp_mask,
+		    ib_net64_t sm_key, void *data)
+{
+	ib_rpc_t rpc;
+	void *umad, *mad;
+	int ret, offset, len = 256;
+
+	memset(&rpc, 0, sizeof(rpc));
+	rpc.mgtclass = IB_SA_CLASS;
+	rpc.method = method;
+	rpc.attr.id = cl_ntoh16(attr);
+	rpc.attr.mod = cl_ntoh32(mod);
+	rpc.mask = cl_ntoh64(comp_mask);
+	rpc.datasz = IB_SA_DATA_SIZE;
+	rpc.dataoffs = IB_SA_DATA_OFFS;
+
+	umad = calloc(1, len + umad_size());
+	if (!umad)
+		IBPANIC("cannot alloc mem for umad: %s\n", strerror(errno));
+
+	mad_build_pkt(umad, &rpc, &h->dport, NULL, data);
+
+	/* SA SM_Key (36/8) - temporary done using IB_MAD_MKEY_F */
+	mad_set_field64(umad_get_mad(umad), 12, IB_MAD_MKEY_F, cl_hton64(sm_key));
+
+	if (ibdebug > 1)
+		xdump(stdout, "SA Request:\n", umad_get_mad(umad), len);
+
+	ret = umad_send(h->fd, h->agent, umad, len, ibd_timeout, 0);
+	if (ret < 0)
+		IBPANIC("umad_send failed: attr %u: %s\n",
+			attr, strerror(errno));
+
+recv_mad:
+	ret = umad_recv(h->fd, umad, &len, ibd_timeout);
+	if (ret < 0) {
+		if (errno == ENOSPC) {
+			umad = realloc(umad, umad_size() + len);
+			goto recv_mad;
+		}
+		IBPANIC("umad_recv failed: attr %u: %s\n", attr,
+			strerror(errno));
+	}
+
+	if ((ret = umad_status(umad)))
+		return ret;
+
+	mad = umad_get_mad(umad);
+
+	if (ibdebug > 1)
+		xdump(stdout, "SA Response:\n", mad, len);
+
+	method = mad_get_field(mad, 0, IB_MAD_METHOD_F);
+	offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
+	result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
+	result.p_result_madw = mad;
+	if (result.status || !offset)
+		result.result_cnt = 0;
+	else if (method != IB_MAD_METHOD_GET_TABLE)
+		result.result_cnt = 1;
+	else
+		result.result_cnt = (len - IB_SA_DATA_OFFS) / (offset << 3);
+
+	return 0;
+}
+
+static void *osmv_get_query_result(void *mad, unsigned i)
+{
+	int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
+	return mad + IB_SA_DATA_OFFS + i * (offset << 3);
+}
+
+#define osmv_get_query_node_rec(mad, i) osmv_get_query_result(mad, i)
+
 static unsigned valid_gid(ib_gid_t *gid)
 {
 	ib_gid_t zero_gid = { };
@@ -132,14 +215,6 @@ static void format_buf(char *in, char *out, unsigned size)
 	*out = '\0';
 }
 
-/**
- * Call back for the various record requests.
- */
-static void query_res_cb(osmv_query_res_t * res)
-{
-	result = *res;
-}
-
 static void print_node_desc(ib_node_record_t * node_record)
 {
 	ib_node_info_t *p_ni = &(node_record->node_info);
@@ -683,6 +758,7 @@ static void dump_one_mft_record(void *data)
 		       cl_ntoh16(mftr->mft[i]));
 	printf("\n");
 }
+
 static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *))
 {
 	int i;
@@ -694,11 +770,8 @@ static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *))
 
 static void return_mad(void)
 {
-	/*
-	 * Return the IB query MAD to the pool as necessary.
-	 */
-	if (result.p_result_madw != NULL) {
-		osm_mad_pool_put(&mad_pool, result.p_result_madw);
+	if (result.p_result_madw) {
+		free(result.p_result_madw - umad_size());
 		result.p_result_madw = NULL;
 	}
 }
@@ -711,32 +784,11 @@ get_any_records(osm_bind_handle_t h,
 		ib_net16_t attr_id, ib_net32_t attr_mod, ib_net64_t comp_mask,
 		void *attr, ib_net16_t attr_offset, ib_net64_t sm_key)
 {
-	ib_api_status_t status;
-	osmv_query_req_t req;
-	osmv_user_query_t user;
-
-	memset(&req, 0, sizeof(req));
-	memset(&user, 0, sizeof(user));
-
-	user.attr_id = attr_id;
-	user.attr_offset = attr_offset;
-	user.attr_mod = attr_mod;
-	user.comp_mask = comp_mask;
-	user.p_attr = attr;
-
-	req.query_type = OSMV_QUERY_USER_DEFINED;
-	req.timeout_ms = ibd_timeout;
-	req.retry_cnt = 1;
-	req.flags = OSM_SA_FLAGS_SYNC;
-	req.query_context = NULL;
-	req.pfn_query_cb = query_res_cb;
-	req.p_query_input = &user;
-	req.sm_key = sm_key;
-
-	if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) {
-		fprintf(stderr, "Query SA failed: %s\n",
-			ib_get_err_str(status));
-		return status;
+	int ret = sa_query(h, IB_MAD_METHOD_GET_TABLE, attr_id, attr_mod,
+			   comp_mask, sm_key, attr);
+	if (ret) {
+		fprintf(stderr, "Query SA failed: %s\n", ib_get_err_str(ret));
+		return ret;
 	}
 
 	if (result.status != IB_SUCCESS) {
@@ -745,7 +797,7 @@ get_any_records(osm_bind_handle_t h,
 		return result.status;
 	}
 
-	return status;
+	return ret;
 }
 
 /**
@@ -928,34 +980,21 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h)
 
 static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h)
 {
-	osmv_query_req_t req;
-	ib_api_status_t status;
-
-	memset(&req, 0, sizeof(req));
-
-	req.query_type = OSMV_QUERY_CLASS_PORT_INFO;
-	req.timeout_ms = ibd_timeout;
-	req.retry_cnt = 1;
-	req.flags = OSM_SA_FLAGS_SYNC;
-	req.query_context = NULL;
-	req.pfn_query_cb = query_res_cb;
-	req.p_query_input = NULL;
-	req.sm_key = 0;
-
-	if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) {
+	int ret = sa_query(h, IB_MAD_METHOD_GET, IB_MAD_ATTR_CLASS_PORT_INFO,
+			   0, 0, 0, NULL);
+	if (ret) {
 		fprintf(stderr, "ERROR: Query SA failed: %s\n",
-			ib_get_err_str(status));
-		return (status);
+			ib_get_err_str(ret));
+		return ret;
 	}
 	if (result.status != IB_SUCCESS) {
 		fprintf(stderr, "ERROR: Query result returned: %s\n",
 			ib_get_err_str(result.status));
 		return (result.status);
 	}
-	status = result.status;
 	dump_results(&result, dump_class_port_info);
 	return_mad();
-	return (status);
+	return ret;
 }
 
 static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h,
@@ -1046,11 +1085,8 @@ static ib_api_status_t print_multicast_member_records(osm_bind_handle_t h)
 	return_mad();
 
 return_mc:
-	/* return_mad for the mc_group_result */
-	if (mc_group_result.p_result_madw != NULL) {
-		osm_mad_pool_put(&mad_pool, mc_group_result.p_result_madw);
-		mc_group_result.p_result_madw = NULL;
-	}
+	if (mc_group_result.p_result_madw)
+		free(mc_group_result.p_result_madw - umad_size());
 
 	return (status);
 }
@@ -1366,78 +1402,30 @@ static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h,
 
 static osm_bind_handle_t get_bind_handle(void)
 {
-	uint32_t i = 0;
-	uint64_t port_guid = (uint64_t) - 1;
-	osm_bind_handle_t h;
-	ib_api_status_t status;
-	ib_port_attr_t attr_array[MAX_PORTS];
-	uint32_t num_ports = MAX_PORTS;
-	uint32_t ca_name_index = 0;
-
-	complib_init();
-
-	osm_log_construct(&log_osm);
-	if ((status = osm_log_init_v2(&log_osm, TRUE, 0x0001, NULL,
-				      0, TRUE)) != IB_SUCCESS) {
-		fprintf(stderr, "Failed to init osm_log: %s\n",
-			ib_get_err_str(status));
-		exit(-1);
-	}
-	osm_log_set_level(&log_osm, OSM_LOG_NONE);
-	if (ibdebug)
-		osm_log_set_level(&log_osm, OSM_LOG_DEFAULT_LEVEL);
-
-	vendor = osm_vendor_new(&log_osm, ibd_timeout);
-	osm_mad_pool_construct(&mad_pool);
-	if ((status = osm_mad_pool_init(&mad_pool)) != IB_SUCCESS) {
-		fprintf(stderr, "Failed to init mad pool: %s\n",
-			ib_get_err_str(status));
-		exit(-1);
-	}
+	static struct sa_bind_handle handle;
+	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
 
-	if ((status =
-	     osm_vendor_get_all_port_attr(vendor, attr_array,
-					  &num_ports)) != IB_SUCCESS) {
-		fprintf(stderr, "Failed to get port attributes: %s\n",
-			ib_get_err_str(status));
-		exit(-1);
-	}
+	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
 
-	for (i = 0; i < num_ports; i++) {
-		if (i > 1 && cl_ntoh64(attr_array[i].port_guid)
-		    != (cl_ntoh64(attr_array[i - 1].port_guid) + 1))
-			ca_name_index++;
-		if (sa_port_num && sa_port_num != attr_array[i].port_num)
-			continue;
-		if (sa_hca_name
-		    && strcmp(sa_hca_name,
-			      vendor->ca_names[ca_name_index]) != 0)
-			continue;
-		if (attr_array[i].link_state == IB_LINK_ACTIVE) {
-			port_guid = attr_array[i].port_guid;
-			break;
-		}
-	}
+	ib_resolve_smlid(&handle.dport, ibd_timeout);
+	if (!handle.dport.lid)
+		IBPANIC("No SM found.");
 
-	if (port_guid == (uint64_t) - 1) {
-		fprintf(stderr,
-			"Failed to find active port, check port status with \"ibstat\"\n");
-		exit(-1);
-	}
+	handle.dport.qp = 1;
+	if (!handle.dport.qkey)
+		handle.dport.qkey = IB_DEFAULT_QP1_QKEY;
 
-	h = osmv_bind_sa(vendor, &mad_pool, port_guid);
+	handle.fd = madrpc_portid();
+	handle.agent = umad_register(handle.fd, IB_SA_CLASS, 2, 1, NULL);
 
-	if (h == OSM_BIND_INVALID_HANDLE) {
-		fprintf(stderr, "Failed to bind to SA\n");
-		exit(-1);
-	}
-	return h;
+	return &handle;
 }
 
-static void clean_up(void)
+static void clean_up(struct sa_bind_handle *h)
 {
-	osm_mad_pool_destroy(&mad_pool);
-	osm_vendor_delete(&vendor);
+	umad_unregister(h->fd, h->agent);
+	umad_close_port(h->fd);
+	umad_done();
 }
 
 static const struct query_cmd query_cmds[] = {
@@ -1847,7 +1835,7 @@ int main(int argc, char **argv)
 
 	if (src_lid)
 		free(src_lid);
-	clean_up();
+	clean_up(h);
 	close_node_name_map(node_name_map);
 	return (status);
 }
-- 
1.6.1.2.319.gbd9e


From sean.hefty at intel.com  Tue Feb 10 17:38:02 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 10 Feb 2009 17:38:02 -0800
Subject: [ofa-general] RE: [PATCH] infiniband-diags/saquery: remove osm
	vendor layer
In-Reply-To: <20090211013441.GR26139@sashak.voltaire.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
	<2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com>
	<20090211013441.GR26139@sashak.voltaire.com>
Message-ID: <B8B0DC7003D54B889391A9045A9BAA65@amr.corp.intel.com>

>Replace OSM Vendor layer by libibumad and libibmad (rpc) calls.
>
>This patch is done following "minimum changes" rule to demonstrate osm
>vendor replacement. Many subsequent improvements and simplification can
>be done. All current saquery functionality is preserved.
>
>Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>---
>
>On 15:55 Mon 09 Feb     , Sean Hefty wrote:
>> Changing saquery didn't
>> look that hard to me, but it did look like it would modify a fair portion of
>the
>> code.
>
>Cannot resist... :)

Excellent! - thanks Sasha!  It even reduced the codebase too.

- Sean


From sashak at voltaire.com  Tue Feb 10 17:46:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 03:46:35 +0200
Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
	<3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com>
Message-ID: <20090211014635.GS26139@sashak.voltaire.com>

On 16:34 Mon 09 Feb     , Smith, Stan wrote:
> 
> Path of least resistance thinking would point towards not doing a switch as the vendor-ibal to vendor-ibumad would be part of the Windows OpenSM migration to OFED 1.4x OpenSM.
> My thinking is that making a switch to vendor-ibumad would be a good deal more work/involved just to get saquery working.

For just saquery it would be overkill. (BTW I posted patch which cleans
osm vendor calls from saquery - hope the problem of vendor-ibal
extending will be eliminated soon).

I was thinking about vendor switching in context of OpenSM itself - in
order to unify OpenSM/umad access layer between different systems (and
eventually to cleanup all those osm vendor mess).

> Not knowing the Windows OpenSM code base, moving part of it forward seems like a task 'which' could blossom into a good deal more work for the small return of saquery working?
> Frankly, I'd rather see work put into getting OFED OpenSM 1.4 working on Windows.

Sure, it could be done as part of WinOF OpenSM upgrade process (doing
this just for fun against outdated OpenSM codebase doesn't buy a much).

Sasha


From Minoru.Hamakawa at Sun.COM  Tue Feb 10 18:51:09 2009
From: Minoru.Hamakawa at Sun.COM (Minoru Hamakawa)
Date: Wed, 11 Feb 2009 11:51:09 +0900
Subject: [ofa-general] Unable to handle kernel NULL pointer dereference
Message-ID: <49923D1D.4090202@Sun.COM>

Hi experts,

Does anyone know the following panic??
--
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
 [<ffffffff8003686e>] kref_get+0x1/0x3d
...
--

It occurrs when we remove IB Cable from HCA and insert cable to HCA.
The HCA is X4217A-Z(Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
2.0 2.5GT/s] (rev a0))
OFED is 1.3.1.
And Kernel is 2.6.18-92.1.10.el5_lustre.1.6.6.20081218100335smp.
#Lustre patched kernel

Thank you in advance for your kind attention.
Should you have any queries please feel free to contact me.
And I appreciate if I could hear from you at your earliest convenience.

I'm not in this alias. please reply direct to me.

Best regards,
Minoru Hamakawa


ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff,
status -11
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff,
status -11
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff,
status -11
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
 [<ffffffff8003686e>] kref_get+0x1/0x3d
PGD 40a158067 PUD 40a364067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:1c.0/0000:0b:00.1/irq
CPU 0
Pid: 3752, comm: ib_mad1 Tainted: GF
2.6.18-92.1.10.el5_lustre.1.6.6.20081218100335smp #1
RIP: 0010:[<ffffffff8003686e>]  [<ffffffff8003686e>] kref_get+0x1/0x3d
RSP: 0018:ffff8104184f5cf0  EFLAGS: 00010002
RAX: ffff81040dcf3000 RBX: ffff81040dcf3000 RCX: 0000000000000000
RDX: 0000000000000100 RSI: ffff8104189f4dc0 RDI: 0000000000000008
RBP: ffff81040dcf3130 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff810416036828
R13: ffff8104189f4c18 R14: ffff8104189f4c00 R15: ffff8104184f7280
FS:  0000000000000000(0000) GS:ffffffff803ea000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000040a2f1000 CR4: 00000000000006e0
Process ib_mad1 (pid: 3752, threadinfo ffff8104184f4000, task
ffff81041b993100)
Stack:  ffff81040dcf3000 ffffffff88585668 040000001d420301 032801001dbe4000
 ae64ffff88432100 0000c0fe00007fff 00ba030001000000 0000001ac9cc0001
 4580a0d000000000 000000d000000000 fc89ef3000000000 0000000000002ad7
Call Trace:
 [<ffffffff88585668>] :ib_sa:notice_handler+0xaf/0x10b
 [<ffffffff883f8fd1>] :ib_mad:ib_mad_completion_handler+0x433/0x5e0
 [<ffffffff883f8b9e>] :ib_mad:ib_mad_completion_handler+0x0/0x5e0
 [<ffffffff8004cd60>] run_workqueue+0x94/0xe4
 [<ffffffff8004966b>] worker_thread+0x0/0x122
 [<ffffffff8009dcac>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004975b>] worker_thread+0xf0/0x122
 [<ffffffff8008acce>] default_wake_function+0x0/0xe
 [<ffffffff8009dcac>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8009dcac>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003243b>] kthread+0xfe/0x132
 [<ffffffff8009dcac>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dcac>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003233d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


From sashak at voltaire.com  Tue Feb 10 19:13:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 05:13:38 +0200
Subject: [ofa-general] [PATCH] libibmad/mad.h: define more SA attributed
Message-ID: <20090211031338.GT26139@sashak.voltaire.com>


Define some more SA attributes.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 libibmad/include/infiniband/mad.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 3095f34..bd62ec7 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -127,12 +127,23 @@ enum SMI_ATTR_ID {
 enum SA_ATTR_ID {
 	IB_SA_ATTR_NOTICE = 0x02,
 	IB_SA_ATTR_INFORMINFO = 0x03,
+	IB_SA_ATTR_NODERECORD = 0x11,
 	IB_SA_ATTR_PORTINFORECORD = 0x12,
+	IB_SA_ATTR_SL2VLTABLERECORD = 0x13,
+	IB_SA_ATTR_SWITCHINFORECORD = 0x14,
+	IB_SA_ATTR_LFTRECORD = 0x15,
+	IB_SA_ATTR_RFTRECORD = 0x16,
+	IB_SA_ATTR_MFTRECORD = 0x17,
+	IB_SA_ATTR_SMINFORECORD = 0x18,
 	IB_SA_ATTR_LINKRECORD = 0x20,
+	IB_SA_ATTR_GUIDINFORECORD = 0x30,
 	IB_SA_ATTR_SERVICERECORD = 0x31,
+	IB_SA_ATTR_PKEYTABLERECORD = 0x33,
 	IB_SA_ATTR_PATHRECORD = 0x35,
+	IB_SA_ATTR_VLARBTABLERECORD = 0x36,
 	IB_SA_ATTR_MCRECORD = 0x38,
 	IB_SA_ATTR_MULTIPATH = 0x3a,
+	IB_SA_ATTR_INFORMINFORECORD = 0xf3,
 
 	IB_SA_ATTR_LAST
 };
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Tue Feb 10 19:14:13 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 05:14:13 +0200
Subject: [ofa-general] [PATCH] libibmad/fields.c: define SA SM_Key field
	details
Message-ID: <20090211031413.GU26139@sashak.voltaire.com>


Define SA SM_Key field details (offset, length, name, dump_function).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 libibmad/src/fields.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index 08d0ccb..89581dc 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -95,7 +95,7 @@ static const ib_field_t ib_mad_f[] = {
 	{BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex},
 
 	/* word 10,11 (36-43 bytes) */
-	{0, 0},			/* IB_SA_MKEY_F - reserved as invalid */
+	{288, 64, "SaSMkey", mad_dump_hex},
 
 	/* word 12 (44-47 bytes) */
 	{BE_OFFS(46 * 8, 16), "SaAttrOffs", mad_dump_uint},
-- 
1.6.1.2.319.gbd9e


From Jie.Cai at cs.anu.edu.au  Tue Feb 10 23:12:19 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Wed, 11 Feb 2009 18:12:19 +1100
Subject: [ofa-general] uDAPL multi-rail (multi-IAs) sample program??
Message-ID: <49927A53.1020403@cs.anu.edu.au>


Is there any sample program for utilizing multi-rail to do RDMA 
communications?

At each node, multiple IAs are opened corresponding to different HCA 
ports, and
then RDMA write from one side to another side with though both rails.

If anyone has experience on this or has some sample code, please let me 
know.

Big thanks.

-- 
Mr. Jie Cai


From rdreier at cisco.com  Tue Feb 10 23:20:39 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Feb 2009 23:20:39 -0800
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <20090210.172347.189515015.davem@davemloft.net> (David Miller's
	message of "Tue, 10 Feb 2009 17:23:47 -0800 (PST)")
References: <499223F8.1010204@opengridcomputing.com>
	<20090210.170740.208470781.davem@davemloft.net>
	<adaeiy5ahza.fsf@cisco.com>
	<20090210.172347.189515015.davem@davemloft.net>
Message-ID: <ada4oz1a188.fsf@cisco.com>

 > Must be compiler and platform specific because with gcc-4.1.3 on
 > sparc with -O2, for the test program:
 > 
 > unsigned long page_size[4];
 > 
 > int main(int argc)
 > {
 >         unsigned long long x = argc;
 > 
 >         return x % (1UL << (12 + page_size[argc]));
 > }
 > 
 > I get a call to __umoddi3:

You're not testing the same thing.  The original code was:

		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
				(1UL << (12 + page_size[i])));

and it's not that easy to see with all the parentheses, but the
expression being done is (u32) % (unsigned long).  So rather than
unsigned long long in your program, you should have just done unsigned
(u32 is unsigned int on all Linux architectures).  In that case gcc does
not generate a call to any library function in all the versions I have
handy, although gcc 4.1 does do a div instead of an and.  (And I don't
think any 32-bit architectures require a library function for (unsigned)
% (unsigned), so the code should be OK)

Your example shows that gcc is missing a strength reduction opportunity
in not handling (u64) % (unsigned long) on 32 bit architectures, but I
guess it is a more difficult optimization to do, since gcc has to know
that it can simply zero the top 32 bits.

 - R.


From davem at davemloft.net  Wed Feb 11 00:00:49 2009
From: davem at davemloft.net (David Miller)
Date: Wed, 11 Feb 2009 00:00:49 -0800 (PST)
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <ada4oz1a188.fsf@cisco.com>
References: <adaeiy5ahza.fsf@cisco.com>
	<20090210.172347.189515015.davem@davemloft.net>
	<ada4oz1a188.fsf@cisco.com>
Message-ID: <20090211.000049.193727089.davem@davemloft.net>

From: Roland Dreier <rdreier at cisco.com>
Date: Tue, 10 Feb 2009 23:20:39 -0800

>  > unsigned long page_size[4];
>  > 
>  > int main(int argc)
>  > {
>  >         unsigned long long x = argc;
>  > 
>  >         return x % (1UL << (12 + page_size[argc]));
>  > }
>  > 
>  > I get a call to __umoddi3:
> 
> You're not testing the same thing.  The original code was:
> 
> 		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
> 				(1UL << (12 + page_size[i])));
> 
> and it's not that easy to see with all the parentheses, but the
> expression being done is (u32) % (unsigned long).  So rather than
> unsigned long long in your program, you should have just done unsigned
> (u32 is unsigned int on all Linux architectures).  In that case gcc does
> not generate a call to any library function in all the versions I have
> handy, although gcc 4.1 does do a div instead of an and.  (And I don't
> think any 32-bit architectures require a library function for (unsigned)
> % (unsigned), so the code should be OK)
> 
> Your example shows that gcc is missing a strength reduction opportunity
> in not handling (u64) % (unsigned long) on 32 bit architectures, but I
> guess it is a more difficult optimization to do, since gcc has to know
> that it can simply zero the top 32 bits.

Indeed, I get the divide if I use "unsigned int" for "x".

I still think you should make this change, as many systems out
there are getting the expensive divide.

main:
	sethi	%hi(page_size), %g1
	or	%g1, %lo(page_size), %g1
	mov	%o0, %g3
	sll	%o0, 2, %g4
	ld	[%g1+%g4], %g2
	mov	1, %g1
	add	%g2, 12, %g2
	sll	%g1, %g2, %g1
	wr	%g0, %g0, %y
	nop
	nop
	nop
	udiv	%o0, %g1, %o0
	smul	%o0, %g1, %o0
	jmp	%o7+8
	 sub	%g3, %o0, %o0


From dorfman.eli at gmail.com  Wed Feb 11 01:22:24 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Wed, 11 Feb 2009 11:22:24 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet
	configuration after SIGHUP
In-Reply-To: <20090209141732.GF26139@sashak.voltaire.com>
References: <20090203124407.GE11874@sashak.voltaire.com>
	<49884962.5070601@gmail.com>
	<20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
	<694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
	<20090208213826.GA24254@sashak.voltaire.com>
	<4990340A.10004@gmail.com>
	<20090209141732.GF26139@sashak.voltaire.com>
Message-ID: <499298D0.5060804@gmail.com>

Sasha Khapyorsky wrote:
> On 15:47 Mon 09 Feb     , Eli Dorfman (Voltaire) wrote:
>> Sasha Khapyorsky wrote:
>>> Hi Eli,
>>>
>>> On 21:23 Sun 08 Feb     , Eli Dorfman wrote:
>>>> yes, but wouldn't it be better to separate between heavy sweep and
>>>> config rescan (due to SIGHUP).
>>> SIGHUP main purpose always was to trigger heavy sweep.
>>>
>>>> I think that user should know when configuration is updated and not
>>>> wait for heavy sweep.
>>> I'm not following - SIGHUP will cause heavy sweep and config update,
>>> where is a waiting?
>>>
>> i meant that if the user is changing config file and there is a heavy sweep then
>> config may be updated,
> 
> Are you about race between file reading (by OpenSM) and writing (by
> user)? Using write lock on reading would solve an issue.
> 
>> while using specific flag for config rescan will avoid this case.
> 
> What do you mean by "specific flag"? Using separate signal? Assuming so,
> this will not prevent read/write race.
> 

At the moment force_heavy_sweep is set in many places and also after SIGHUP.
opensm rescans the configuration file when this flag is set, so if there is link change
in the subnet while the user is modifying the file, the opensm may update the configuration
even if the user didn't finish updating it.
Using another flag (e.g. rescan_config_file) that will be set only after SIGHUP will 
assure that opensm updates subnet configuration when user finished updating the file.


From nicolas.morey-chaisemartin at ext.bull.net  Wed Feb 11 01:25:26 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Wed, 11 Feb 2009 10:25:26 +0100
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing	between
	non-CN nodes
In-Reply-To: <20090207202319.GE27757@sashak.voltaire.com>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
Message-ID: <49929986.40106@ext.bull.net>

Sasha Khapyorsky wrote:
> On 20:48 Sat 07 Feb     , Nicolas Morey-Chaisemartin wrote:
>   
>>> "IO" is specific for your setup. Could we find more generic name for such
>>> nodes?
>>>
>>>   
>>>       
>> Sure. Any ideas?
>>     
>
> No, I didn't think about it.
>
>   
I've rebased and fix the patches against master. I just need a name for 
the configuration.
What about high nodes (HN) as it concerns only nodes which are not at 
the bottom of the fat tree?

Nicolas


From acceptany at gmail.com  Wed Feb 11 03:03:04 2009
From: acceptany at gmail.com (Jordan)
Date: Wed, 11 Feb 2009 19:03:04 +0800
Subject: [ofa-general] ***SPAM*** problem about adding a new routing
	algorithm in opensm
Message-ID: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com>

I want to add a new routing algorithm in opensm , can this idea be supported
by opensm ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090211/5cda2b8c/attachment.html>

From vlad at lists.openfabrics.org  Wed Feb 11 03:14:15 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 11 Feb 2009 03:14:15 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090211-0200 daily build status
Message-ID: <20090211111415.C22DBE60E5A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Wed Feb 11 03:43:47 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 13:43:47 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing
	between non-CN nodes
In-Reply-To: <49929986.40106@ext.bull.net>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
	<49929986.40106@ext.bull.net>
Message-ID: <20090211114347.GA27920@sashak.voltaire.com>

On 10:25 Wed 11 Feb     , Nicolas Morey Chaisemartin wrote:
> What about high nodes (HN) as it concerns only nodes which are not at the 
> bottom of the fat tree?

Could be fine. Let's ask Yevgeny too... :)

Yevgeny! Any idea about io_nodes more generic name?

Sasha


From sashak at voltaire.com  Wed Feb 11 03:52:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 13:52:55 +0200
Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan
	subnet configuration after SIGHUP
In-Reply-To: <499298D0.5060804@gmail.com>
References: <20090203134831.GI11874@sashak.voltaire.com>
	<498850A2.8090701@gmail.com>
	<20090205000323.GN11874@sashak.voltaire.com>
	<498A9888.5010003@gmail.com>
	<20090205121634.GQ11874@sashak.voltaire.com>
	<694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com>
	<20090208213826.GA24254@sashak.voltaire.com>
	<4990340A.10004@gmail.com>
	<20090209141732.GF26139@sashak.voltaire.com>
	<499298D0.5060804@gmail.com>
Message-ID: <20090211115247.GB27920@sashak.voltaire.com>

On 11:22 Wed 11 Feb     , Eli Dorfman (Voltaire) wrote:
> 
> At the moment force_heavy_sweep is set in many places and also after SIGHUP.
> opensm rescans the configuration file when this flag is set, so if there is link change
> in the subnet while the user is modifying the file, the opensm may update the configuration
> even if the user didn't finish updating it.

So what is your concerts here? That OpenSM rescans unmodified file or
that file is potentially broken?

> Using another flag (e.g. rescan_config_file) that will be set only after SIGHUP will 
> assure that opensm updates subnet configuration when user finished updating the file.

Send SIGHUP. OpenSM will rescan config again and will do heavy sweep.
Where is a problem?

Sasha


From sashak at voltaire.com  Wed Feb 11 04:39:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 14:39:34 +0200
Subject: [ofa-general] ***SPAM*** problem about adding a new routing
	algorithm in opensm
In-Reply-To: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com>
References: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com>
Message-ID: <20090211123926.GE27920@sashak.voltaire.com>

On 19:03 Wed 11 Feb     , Jordan wrote:
> I want to add a new routing algorithm in opensm ,

What is this algorithm and how is it different from existing ones?

> can this idea be supported
> by opensm ?

This idea is already supported by OpenSM - look at 'struct
osm_routing_engine' (in osm_opensm.h) and how it is used.

Sasha


From sashak at voltaire.com  Wed Feb 11 04:40:43 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 14:40:43 +0200
Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in
	opensm?
In-Reply-To: <91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com>
References: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com>
	<91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com>
Message-ID: <20090211124043.GF27920@sashak.voltaire.com>

On 09:23 Wed 11 Feb     , Jordan wrote:
> If this can be done , is there a simulator to test this new
> algorithm and dump some results?

Yes, ibsim.

Sasha


From kliteyn at dev.mellanox.co.il  Wed Feb 11 05:26:31 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 11 Feb 2009 15:26:31 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing	between
	non-CN nodes
In-Reply-To: <20090211114347.GA27920@sashak.voltaire.com>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
	<49929986.40106@ext.bull.net>
	<20090211114347.GA27920@sashak.voltaire.com>
Message-ID: <4992D207.6010701@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 10:25 Wed 11 Feb     , Nicolas Morey Chaisemartin wrote:
>> What about high nodes (HN) as it concerns only nodes which are not at the 
>> bottom of the fat tree?
> 
> Could be fine. Let's ask Yevgeny too... :)
> 
> Yevgeny! Any idea about io_nodes more generic name?

Ugh...

"IO nodes":
Pros: the name is closer to the reality, since in most cases
the nodes that would need special treatment are indeed IO nodes.
Cons: the name is not "general"...

"High nodes"
Pros: general name with kinda "hint" to the special treatment.
Cons: the "hint" is rather vague...

Bottom line - I'm OK with both options (slightly leaning toward IO),
as long as it is described well enough in the help message and in man :)

-- Yevgeny

> Sasha
> 


From sashak at voltaire.com  Wed Feb 11 06:04:13 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 16:04:13 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_ucast_ftree.c Fixed bad init
	value for down port index
In-Reply-To: <49914E91.4090305@ext.bull.net>
References: <49914E91.4090305@ext.bull.net>
Message-ID: <20090211140413.GJ27920@sashak.voltaire.com>

On 10:53 Tue 10 Feb     , Nicolas Morey Chaisemartin wrote:
> Fixes the init value of down_port_groups_idx to 0 so it's in the port group 
> interval.
> This way __osm_ftree_fabric_route_upgoing_by_going_down can use the index 
> directly without segfaulting.
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

> ---
>  opensm/opensm/osm_ucast_ftree.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/opensm/opensm/osm_ucast_ftree.c 
> b/opensm/opensm/osm_ucast_ftree.c
> index 4e65c87..eae1ed8 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN 
> ftree_fabric_t * p_ftree,
>  	/* initialize lft buffer */
>  	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>
> -	p_sw->down_port_groups_idx = -1;
> +	p_sw->down_port_groups_idx = 0;

I make it 'unsigned int' (instead of 'int') after all.

Sasha

>
>  	return p_sw;
>  }				/* __osm_ftree_sw_create() */
> -- 
> 1.6.1
>


From sashak at voltaire.com  Wed Feb 11 06:07:09 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 16:07:09 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing
	between non-CN nodes
In-Reply-To: <4992D207.6010701@dev.mellanox.co.il>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
	<49929986.40106@ext.bull.net>
	<20090211114347.GA27920@sashak.voltaire.com>
	<4992D207.6010701@dev.mellanox.co.il>
Message-ID: <20090211140703.GK27920@sashak.voltaire.com>

On 15:26 Wed 11 Feb     , Yevgeny Kliteynik wrote:
>
> Bottom line - I'm OK with both options (slightly leaning toward IO),
> as long as it is described well enough in the help message and in man :)

Ok, no clear opinions. Nicolas, it is your decision about name :)

Sasha


From subbukl at gmail.com  Wed Feb 11 06:18:11 2009
From: subbukl at gmail.com (subbu kl)
Date: Wed, 11 Feb 2009 19:48:11 +0530
Subject: ***SPAM*** Re: [ofa-general] Fwd: pciback module not working
In-Reply-To: <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
Message-ID: <f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized
guest with pciback module.

No one seems to have tried answering this question on the list, let me ping
xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual
slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1
(x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to
a read-only configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing
list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16
(level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0
disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com> wrote:

> Okay so my question to the openfabrics guys is, why would the OFED
> drivers fail to read the firmware?
>
> Any thoughts?
>
> Thanks,
> - David Brown
>
>
> ---------- Forwarded message ----------
> From: David Brown <dmlb2000 at gmail.com>
> Date: Thu, Sep 11, 2008 at 2:24 PM
> Subject: pciback module not working
> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>
>
> This issue was brought up about a year and a half ago. So I'll bring
> it up again and see if anything happens.
>
> I've got an infiniband network and am attempting to pass the
> infiniband card through the host and give it to the guest.
> I'm working with standard CentOS 5.2 on both guest and host with their
> provided xen (3.0.3 ish). I've also attempted to install the newest
> Xen 3.3 and use their standard host kernel and that did the same
> thing. The guest dmesg output in the guest is similar on both
> permissive and normal mode.
>
> I'm getting issues with detecting the firmware on the card for some
> reason...
>
> Any help would be appreciated.
>
> Thanks,
> - David Brown
>
> === GUEST dmesg output ===
> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
> ib_mthca: Initializing 0000:00:00.0
> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
> PCI: Setting latency timer of device 0000:00:00.0 to 64
> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
> ib_mthca: probe of 0000:00:00.0 failed with error -11
> =======================
>
> === Host modprobe.conf ===
> alias eth0 bnx2
> alias eth1 bnx2
> alias scsi_hostadapter cciss
> options pciback hide=(41:00.0)
> =====================
>
> === Host lspci output ===
> # lspci -vs 41:00.0
> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
> HCA] (rev 20)
>       Subsystem: Hewlett-Packard Company Unknown device 170a
>       Flags: fast devsel, IRQ 16
>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>       Capabilities: [40] Power Management version 2
>       Capabilities: [48] Vital Product Data
>       Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
> Enable-
>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>       Capabilities: [60] Express Endpoint IRQ 0
> =====================
>
> This makes sure it get loaded first off before anything else.
> === Host mkinitrd cmd ===
> # mkinitrd -f --with=pciback --preload pciback
> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
> ====================
>
> === Host pciback dmesg ===
> pciback 0000:41:00.0: Driver tried to write to a read-only
> configuration space field at offset 0x44, size 2. This may be
> harmless, but if you have problems with your device:
> 1) see permissive attribute in sysfs
> 2) report problems to the xen-devel mailing list along with details of
> your device obtained from lspci.
> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> PCI: Setting latency timer of device 0000:41:00.0 to 64
> ACPI: PCI interrupt for device 0000:41:00.0 disabled
> ======================
>
> === Host pciback dmesg (after setting it permissive) ===
> pciback 0000:41:00.0: enabling permissive mode configuration space
> accesses!
> pciback 0000:41:00.0: permissive mode is potentially unsafe!
> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
> device vif1.0 entered promiscuous mode
> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> PCI: Setting latency timer of device 0000:41:00.0 to 64
> ACPI: PCI interrupt for device 0000:41:00.0 disabled
> =========================================
>
> === Guest lspci output ===
> # lspci -v
> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
> HCA] (rev 20)
>       Subsystem: Hewlett-Packard Company Unknown device 170a
>       Flags: fast devsel, IRQ 16
>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>       Capabilities: [40] Power Management version 2
>       Capabilities: [48] Vital Product Data
>       Capabilities: [90] Message Signalled Interrupts: 64bit+
> Queue=0/5 Enable-
>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>       Capabilities: [60] Express Endpoint IRQ 0
> =====================
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090211/b2dc00f1/attachment.html>

From olga.shern at gmail.com  Wed Feb 11 06:34:32 2009
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Wed, 11 Feb 2009 16:34:32 +0200
Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops
In-Reply-To: <4990CD57.3080108@oracle.com>
References: <4990CD57.3080108@oracle.com>
Message-ID: <bc457d660902110634n5e24f99cv720872ea2b8f60fb@mail.gmail.com>

Hi Summet,

You can read from the ipoib release notes:
"If IPoIB connected mode is enabled, it uses a large MTU for connected mode
   messages and a small MTU for datagram (in particular, multicast) messages,
   and relies on path MTU discovery to adjust MTU appropriately. Packets sent
   in the window before MTU discovery automatically reduces the MTU for a
   specific destination will be dropped, producing the following message in the
   system log:
   "packet len <actual length> (> <max allowed length>) too long to
send, dropping"

   To warn about this, a message is produced in the system log each time MTU is
   set to a value higher than 2K."


Olga

On Tue, Feb 10, 2009 at 2:41 AM, Sumeet Lahorani
<sumeet.lahorani at oracle.com> wrote:
> When we enable IB connected mode and increase MTU to 65520, we see the
> following in /var/log/messages
>
> Feb  6 17:48:32 dadzab01 kernel: ib0: enabling connected mode will cause
> multicast packet drops
> Feb  6 17:48:32 dadzab01 kernel: ib0: mtu > 2044 will cause multicast packet
> drops.
> Feb  6 17:48:32 dadzab01 kernel: ib1: enabling connected mode will cause
> multicast packet drops
> Feb  6 17:48:32 dadzab01 kernel: ib1: mtu > 2044 will cause multicast packet
> drops.
>
> Should we not be doing this? What kind of multicast packets will be dropped?
>
> If we are not using multicast, do any OFED drivers (bonding, ipoib etc)
> internally use multicast in a way that will cause them to not work correctly
> in connected mode?
>
> We are using OFED 1.3.1.
>
> - Sumeet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From tziporet at mellanox.co.il  Wed Feb 11 07:09:35 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 11 Feb 2009 17:09:35 +0200
Subject: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com>


These are the OFED (EWG) meeting minutes for Feb 09 on OFED 1.4.1
release

Meeting Summary:
==============
1. Agreed on 1.4.1 release schedule - GA is planed for April 7 
2. Reviewed 1.4.1 status
3. Reviewed Sonoma agenda

Details:
======
1. OFED 1.4.1 schedule:
*	RC1 - Mar 3
*	RC2 - Mar 17
*	RC3 - Mar 31
*	GA  - Apr 7

2. OFED 1.4.1 release status:
> *	New OSes: 
> *	RH 5.3 - done
> *	SLES 11 - schedule is OK. RC3 already available, need to create
> backports
Tziporet to check with Novell if we can place the sources on the OFA
server
*	Open MPI - we will take 1.3.1
> *	RDS with iWARP support - good progress
> *	NFS/RDMA backports - RHEL 5.2 should be ready in 2 weeks
> *	Critical bug fixes
> As far as I know these are the critical bugs that should be fixed:
> 	1383    	blo  	jackm at mellanox.co.il  	Local protection
> error on transmit from ipoib datagram to... - on work
		1287 	maj	jackm at mellanox.co.il	IPoIB datagram
mode initial packet loss - we will check if we can fix this
*	Need to add 1.4.1 to bugzilla

> 3. Sonoma updates from Bill Boas:
> Bill sent the agenda - and we reviewed it in the meeting
Comments and suggestions should be sent to Bill.
There is a need for more PR - if companies are willing to put the Press
release on on their web site


> Tziporet
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090211/52582ddf/attachment.html>

From ogerlitz at Voltaire.com  Wed Feb 11 07:11:54 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Wed, 11 Feb 2009 17:11:54 +0200
Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops
In-Reply-To: <4990CD57.3080108@oracle.com>
References: <4990CD57.3080108@oracle.com>
Message-ID: <4992EABA.9090605@Voltaire.com>

Sumeet Lahorani wrote:
> When we enable IB connected mode and increase MTU to 65520, we see the following
> kernel: ib0: enabling connected mode will cause multicast packet drops
> kernel: ib0: mtu > 2044 will cause multicast packet drops.

> Should we not be doing this? What kind of multicast packets will be dropped?
> If we are not using multicast, do any drivers (bonding, ipoib etc) internally use 
> multicast in a way that will cause them to not work correctly in connected mode? 

Connected mode is supported only for unicast traffic where multicast traffic keeps going over the IB UD QP whose MTU is much lower (e.g 2-4K). To close the gap between the MTU seen by the network stack to the MTU used by the UD QP, IPoIB emulates receiving an icmp packet that tells the os stack to use a different path mtu for this multicast neighbour, see

ipoib_start_xmit --> 
  ipoib_send --> 
   ipoib_cm_skb_too_long(mcast_mtu) --> 
    skb->dst->ops->update_pmtu(skb->dst, mtu)

When IP multicast is not used, multicast is used by the network stack and bonding just for the sake of sending ARPs on the broadcast group, and IGMP where the size of both is way below the IB mtu.

Or.


From swise at opengridcomputing.com  Wed Feb 11 07:44:42 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Feb 2009 09:44:42 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <adamyctajv8.fsf@cisco.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com>
Message-ID: <4992F26A.4030800@opengridcomputing.com>

Roland Dreier wrote:
> I'll roll this into the offending patch (that is in -next).
>
> But:
>
>  > -		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
>  > -				(1UL << (12 + page_size[i])));
>  > +		wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) &
>  > +				((1UL << (12 + page_size[i]))-1));
>
> Is this required?  Strength reduction optimization should do this
> automatically (and the code has been there for quite a while, so
> obviously it isn't causing problems)
>
>  - R.
>   

Note that wr->sg_list[i].addr was being cast to a u32.  That was wrong.


From hal.rosenstock at gmail.com  Wed Feb 11 08:16:25 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 11 Feb 2009 11:16:25 -0500
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between 
	non-CN nodes
In-Reply-To: <4992D207.6010701@dev.mellanox.co.il>
References: <494A5339.9030304@ext.bull.net>
	<20090207185551.GD27757@sashak.voltaire.com>
	<498DE57D.4030501@morey-chaisemartin.com>
	<20090207202319.GE27757@sashak.voltaire.com>
	<49929986.40106@ext.bull.net>
	<20090211114347.GA27920@sashak.voltaire.com>
	<4992D207.6010701@dev.mellanox.co.il>
Message-ID: <f0e08f230902110816l6be2a58bgd3ff171eebf8db35@mail.gmail.com>

On Wed, Feb 11, 2009 at 8:26 AM, Yevgeny Kliteynik
<kliteyn at dev.mellanox.co.il> wrote:
> Sasha Khapyorsky wrote:
>>
>> On 10:25 Wed 11 Feb     , Nicolas Morey Chaisemartin wrote:
>>>
>>> What about high nodes (HN) as it concerns only nodes which are not at the
>>> bottom of the fat tree?
>>
>> Could be fine. Let's ask Yevgeny too... :)
>>
>> Yevgeny! Any idea about io_nodes more generic name?
>
> Ugh...
>
> "IO nodes":
> Pros: the name is closer to the reality, since in most cases
> the nodes that would need special treatment are indeed IO nodes.
> Cons: the name is not "general"...
>
> "High nodes"
> Pros: general name with kinda "hint" to the special treatment.
> Cons: the "hint" is rather vague...
>
> Bottom line - I'm OK with both options (slightly leaning toward IO),
> as long as it is described well enough in the help message and in man :)

Maybe consistency is the hobgobblin of small minds but don't we now have:

high nodes which is a topology based name
and
compute nodes which is a functional based name.

Is it worth having them consistent ?

-- Hal

> -- Yevgeny
>
>> Sasha
>>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From jean-vincent.ficet at bull.net  Wed Feb 11 08:21:13 2009
From: jean-vincent.ficet at bull.net (Vincent Ficet)
Date: Wed, 11 Feb 2009 17:21:13 +0100
Subject: [ofa-general] 2.6.16.46-0.12-SLERT-10-15: scheduling while atomic ?
Message-ID: <4992FAF9.9070305@bull.net>

Hello,

On a Suse real time kernel (2.6.16.46-0.12-SLERT-10-15), we get the 
following kernel stack trace while running SDP traffic:

scheduling while atomic: ib_cm/4/0x00000001/18293

Call Trace:
       <ffffffff80324eed>{__sched_text_start+125}
       <ffffffff801375ca>{lock_timer_base+27}
       <ffffffff80327a03>{_spin_unlock_irqrestore+53}
       <ffffffff80137dc3>{__mod_timer+439}
       <ffffffff80326870>{schedule_timeout+208}
       <ffffffff80137fe5>{process_timeout+0}
       <ffffffff80327a43>{_spin_unlock_irq+52}
       <ffffffff80326232>{wait_for_completion_timeout+127}
       <ffffffff80126fd4>{default_wake_function+0}
       <ffffffff88515af6>{:mlx4_core:__mlx4_cmd+318}
       <ffffffff8851c148>{:mlx4_core:mlx4_mr_free+73}
       <ffffffff8852f0b8>{:mlx4_ib:mlx4_ib_dereg_mr+23}
       <ffffffff884cfb9b>{:ib_core:ib_dereg_mr+26}
       <ffffffff88633592>{:ib_sdp:sdp_destroy_qp+161}
       <ffffffff88633c6d>{:ib_sdp:sdp_reset_sk+276}
       <ffffffff88637f43>{:ib_sdp:sdp_cma_handler+2008}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff886284f4>{:rdma_cm:cma_modify_qp_err+72}
       <ffffffff80125876>{__wake_up_common+62}
       <ffffffff80327a03>{_spin_unlock_irqrestore+53}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff88629c25>{:rdma_cm:cma_ib_handler+369}
       <ffffffff885f7e52>{:ib_cm:cm_process_work+26}
       <ffffffff885f95fe>{:ib_cm:cm_work_handler+986}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff8013f91e>{run_workqueue+154}
       <ffffffff80324e76>{__sched_text_start+6}
       <ffffffff8013ffb2>{worker_thread+0}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff801400ae>{worker_thread+252}
       <ffffffff80126fd4>{default_wake_function+0}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff8014190a>{kthread+212}
       <ffffffff8015865c>{hracct_exit_syscall+22}
       <ffffffff8010bd5e>{child_rip+8}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff80141836>{kthread+0}
       <ffffffff8010bd56>{child_rip+0}

The OFA kernel package in place is:

git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 88ab7955605c5e769e760f6bec980e0c2e72aa5c

Looking for the "scheduling while atomic" message in the latest kernel, we see that it was printed out by __schedule_bug in 
this function:

/*
 * Various schedule()-time debugging checks and statistics:
 */
static inline void schedule_debug(struct task_struct *prev)
{
	/*
	 * Test if we are atomic. Since do_exit() needs to call into
	 * schedule() atomically, we ignore that path for now.
	 * Otherwise, whine if we are scheduling when we should not be.
	 */
	if (unlikely(in_atomic_preempt_off() && !prev->exit_state))
		__schedule_bug(prev);

	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

	schedstat_inc(this_rq(), sched_count);
#ifdef CONFIG_SCHEDSTATS
	if (unlikely(prev->lock_depth >= 0)) {
		schedstat_inc(this_rq(), bkl_count);
		schedstat_inc(prev, sched_info.bkl_count);
	}
#endif
}

Any idea as to what is going wrong here ?

Thanks for your help,

Vincent


From devel at morey-chaisemartin.com  Wed Feb 11 09:31:35 2009
From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 11 Feb 2009 18:31:35 +0100
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing	between
	non-CN nodes
In-Reply-To: <20090211140703.GK27920@sashak.voltaire.com>
References: <494A5339.9030304@ext.bull.net>	<20090207185551.GD27757@sashak.voltaire.com>	<498DE57D.4030501@morey-chaisemartin.com>	<20090207202319.GE27757@sashak.voltaire.com>	<49929986.40106@ext.bull.net>	<20090211114347.GA27920@sashak.voltaire.com>	<4992D207.6010701@dev.mellanox.co.il>
	<20090211140703.GK27920@sashak.voltaire.com>
Message-ID: <49930B77.5020803@morey-chaisemartin.com>

Sasha Khapyorsky a écrit :
> On 15:26 Wed 11 Feb     , Yevgeny Kliteynik wrote:
>   
>> Bottom line - I'm OK with both options (slightly leaning toward IO),
>> as long as it is described well enough in the help message and in man :)
>>     
>
> Ok, no clear opinions. Nicolas, it is your decision about name :)
>
> Sasha
>   

My lazyness would say IO is better because it's what has been written in
my code and documentation, but it's not too much work to change anyway.
If by 10am tommorow( GMT+1 )  I don't have more clear opinions, I'll
repost them with io_guid_file. Feel free to have any idea before then.

Nicolas


From Jeffrey.C.Becker at nasa.gov  Wed Feb 11 09:58:12 2009
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Wed, 11 Feb 2009 09:58:12 -0800
Subject: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com>
Message-ID: <499311B4.4090607@nasa.gov>

Hi Tziporet

Tziporet Koren wrote:
>
> These are the OFED (EWG) meeting minutes for Feb 09 on OFED 1.4.1 release
>
> Meeting Summary:
>
> ==============
>
> 1. Agreed on 1.4.1 release schedule - GA is planed for April 7
>
> 2. Reviewed 1.4.1 status
>
> 3. Reviewed Sonoma agenda
>
> Details:
>
> ======
>
> 1. OFED 1.4.1 schedule:
>
>           o RC1 - Mar 3
>           o RC2 - Mar 17
>           o RC3 - Mar 31
>           o GA  - Apr 7
>
> 2. OFED 1.4.1 release status:
>
>           o New OSes:
>                 + RH 5.3 - done
>                 + SLES 11 - schedule is OK. RC3 already available,
>                   need to create backports
>                   Tziporet to check with Novell if we can place the
>                   sources on the OFA server
>

Thanks to NASA's developing relationship with Novell, I got access to
SLES11 rc3 iso's. I'm downloading them now, and will start on the
backports when that's done.

-jeff

>           o Open MPI - we will take 1.3.1
>           o RDS with iWARP support - good progress
>           o NFS/RDMA backports - RHEL 5.2 should be ready in 2 weeks
>           o Critical bug fixes
>             As far as I know these are the critical bugs that should
>             be fixed:
>
>             1383            blo     jackm at mellanox.co.il    Local
>             protection error on transmit from ipoib datagram to… - on work
>
>             1287    maj     jackm at mellanox.co.il    IPoIB datagram
>             mode initial packet loss - we will check if we can fix this
>
>           o Need to add 1.4.1 to bugzilla
>
> 3. Sonoma updates from Bill Boas:
>
> Bill sent the agenda - and we reviewed it in the meeting
>
> Comments and suggestions should be sent to Bill.
>
> There is a need for more PR - if companies are willing to put the
> Press release on on their web site
>
>
> Tziporet
>


From rdreier at cisco.com  Wed Feb 11 10:12:09 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 11 Feb 2009 10:12:09 -0800
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <4992F26A.4030800@opengridcomputing.com> (Steve Wise's message of
	"Wed, 11 Feb 2009 09:44:42 -0600")
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com> <4992F26A.4030800@opengridcomputing.com>
Message-ID: <adatz70972e.fsf@cisco.com>

 > Note that wr->sg_list[i].addr was being cast to a u32.  That was wrong.

Is it possible for the page to be bigger than 4GB?  If so then yes you
might be chopping off high-order bits or something.

Anyway please send me this change as a separate patch with a changelog
explaining that you're avoiding the div etc.... I don't want to roll it
in with the other unrelated fix (which changes code that was never
upstream anyway).


From swise at opengridcomputing.com  Wed Feb 11 10:32:47 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Feb 2009 12:32:47 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <adatz70972e.fsf@cisco.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>	<adamyctajv8.fsf@cisco.com>
	<4992F26A.4030800@opengridcomputing.com>
	<adatz70972e.fsf@cisco.com>
Message-ID: <499319CF.6050204@opengridcomputing.com>

Roland Dreier wrote:
>  > Note that wr->sg_list[i].addr was being cast to a u32.  That was wrong.
>
> Is it possible for the page to be bigger than 4GB?  If so then yes you
> might be chopping off high-order bits or something.
>   
Yes it is possible.

A MR can be created with an iov_base of say 0xffffffff00000000.

Then any sge.addr entries would be the iob_base + any offset.

> Anyway please send me this change as a separate patch with a changelog
> explaining that you're avoiding the div etc.... I don't want to roll it
> in with the other unrelated fix (which changes code that was never
> upstream anyway).
>   

will do. 

So you are handling the offset patch that will make it u64 and remove 
the mod usage, correct?

I will post a new patch with just this send change.

Steve.


From rdreier at cisco.com  Wed Feb 11 10:36:01 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 11 Feb 2009 10:36:01 -0800
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <499319CF.6050204@opengridcomputing.com> (Steve Wise's message of
	"Wed, 11 Feb 2009 12:32:47 -0600")
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>
	<adamyctajv8.fsf@cisco.com> <4992F26A.4030800@opengridcomputing.com>
	<adatz70972e.fsf@cisco.com> <499319CF.6050204@opengridcomputing.com>
Message-ID: <adak57w95ym.fsf@cisco.com>

 > > Is it possible for the page to be bigger than 4GB?  If so then yes you
 > > might be chopping off high-order bits or something.

 > Yes it is possible.
 > 
 > A MR can be created with an iov_base of say 0xffffffff00000000.
 > 
 > Then any sge.addr entries would be the iob_base + any offset.

But the code we're talking about is:

		/* to in the WQE == the offset into the page */
		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
				(1UL << (12 + page_size[i])));

so it seems the top address bits don't matter unless page_size[i] is at
least 20 -- in which case using 1UL to shift overflows on 32 bits anyway...

 > So you are handling the offset patch that will make it u64 and remove
 > the mod usage, correct?

Yeah, I rolled the fix into the "offset needs to be u64" patch, it
should be in linux-next by now (or at least in my for-next branch).

 - R.


From sashak at voltaire.com  Wed Feb 11 10:47:17 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 20:47:17 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_console.c : Added
	dump_portguid function
	to console to generate a list of port guids matching one or more
	regexps
In-Reply-To: <499135E1.1080307@ext.bull.net>
References: <499135E1.1080307@ext.bull.net>
Message-ID: <20090211184717.GO5910@sashak.voltaire.com>

Hi Nicolas,

On 09:08 Tue 10 Feb     , Nicolas Morey Chaisemartin wrote:
> This add a dump_portguid functionnality to openSM console which makes it 
> really easy to generate cn_guid_file, root_guid_file and such
> by dumping into a file all port guids whom nodedesc contains at least one 
> of the provided regexps
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>
> Repost without exit_after_run flag, active sleep init loop and indented.
>
>  opensm/opensm/osm_console.c |  105 
> +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 105 insertions(+), 0 deletions(-)
>
>

> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> index c6e8e59..5fbcd43 100644
> --- a/opensm/opensm/osm_console.c
> +++ b/opensm/opensm/osm_console.c
> @@ -42,6 +42,7 @@
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  #include <netdb.h>
> +#include <regex.h>
>  #ifdef ENABLE_OSM_CONSOLE_SOCKET
>  #include <arpa/inet.h>
>  #endif
> @@ -1173,6 +1174,109 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>  }
>  
>  /* more parse routines go here */
> +typedef struct _regexp_list {
> +	regex_t exp;
> +	struct _regexp_list *next;
> +} regexp_list_t;
> +
> +static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> +{
> +	cl_qmap_t *p_port_guid_tbl;
> +	osm_port_t *p_port;
> +	osm_port_t *p_next_port;
> +
> +	regexp_list_t *p_head_regexp = NULL;
> +	regexp_list_t *p_regexp;
> +
> +	/* Option variables */
> +	char *p_cmd = NULL;
> +	FILE *output = out;
> +
> +	/* Read commande line */
> +
> +	while (1) {
> +		p_cmd = next_token(p_last);
> +		if (p_cmd) {
> +			if (strcmp(p_cmd, "file") == 0) {
> +				p_cmd = next_token(p_last);
> +				if (p_cmd) {
> +					output = fopen(p_cmd, "w+");
> +					if (output == NULL) {
> +						fprintf(out,
> +							"Could not open file %s: %s\n",
> +							p_cmd, strerror(errno));
> +						output = out;
> +					}
> +				} else
> +					fprintf(out, "No file name passed\n");
> +			} else {
> +				p_regexp = malloc(sizeof(*p_regexp));
> +				if (regcomp
> +				    (&(p_regexp->exp), p_cmd,
> +				     REG_NOSUB | REG_EXTENDED) != 0) {
> +					fprintf(out,
> +						"Couldn't parse regular expression %s. Skipping it.\n",
> +						p_cmd);
> +				}
> +				p_regexp->next = p_head_regexp;
> +				p_head_regexp = p_regexp;
> +			}
> +		} else
> +			break;	/* No more tokens */
> +
> +	}
> +
> +	/* Check we have at least one expression to match */
> +	if (p_head_regexp == NULL) {
> +		fprintf(out, "No valid expression provided. Aborting\n");
> +		return;
> +	}
> +
> +	cl_spinlock_release(&p_osm->sm.state_lock);

What is this cl_spinlock_release()? Typo?

> +	if (p_osm->sm.p_subn->need_update != 0) {
> +		fprintf(out, "Subnet is not ready yet. Try again later.\n");
> +		return;
> +	}
> +
> +	/* Subnet doesn't need to be updated so we can carry on */
> +
> +	CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock);
> +	p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl);

Do we really need exclusive locking here? port_guid_table content is
rad-only, I guess "read-only" lock (CL_PLOCK_ACQUIRE()) should be enough.

The rest looks fine for me.

Sasha

> +
> +	p_next_port = (osm_port_t *) cl_qmap_head(p_port_guid_tbl);
> +	while (p_next_port != (osm_port_t *) cl_qmap_end(p_port_guid_tbl)) {
> +
> +		p_port = p_next_port;
> +		p_next_port =
> +		    (osm_port_t *) cl_qmap_next(&p_next_port->map_item);
> +
> +		for (p_regexp = p_head_regexp; p_regexp != NULL;
> +		     p_regexp = p_regexp->next)
> +			if (regexec
> +			    (&(p_regexp->exp), p_port->p_node->print_desc, 0,
> +			     NULL, 0) == 0)
> +				fprintf(output, "0x%" PRIxLEAST64 "\n",
> +					cl_ntoh64(p_port->p_physp->port_guid));
> +	}
> +
> +	CL_PLOCK_RELEASE(p_osm->sm.p_lock);
> +	if (output != out)
> +		fclose(output);
> +
> +}
> +
> +static void help_dump_portguid(FILE * out, int detail)
> +{
> +	fprintf(out,
> +		"dump_portguid [file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n");
> +	if (detail) {
> +		fprintf(out,
> +			"getguidgetguid  -- Dump all the port GUID whom node_desc matches one of the provided regexp\n");
> +		fprintf(out,
> +			"   [file filename] -- Send the port GUID list to the specified file instead of regular output\n");
> +	}
> +
> +}
>  
>  static const struct command console_cmds[] = {
>  	{"help", &help_command, &help_parse},
> @@ -1192,6 +1296,7 @@ static const struct command console_cmds[] = {
>  #ifdef ENABLE_OSM_PERF_MGR
>  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
>  #endif				/* ENABLE_OSM_PERF_MGR */
> +	{"dump_portguid", &help_dump_portguid, &dump_portguid_parse},
>  	{NULL, NULL, NULL}	/* end of array */
>  };
>  
> 


From swise at opengridcomputing.com  Wed Feb 11 10:44:45 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Feb 2009 12:44:45 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math.
In-Reply-To: <adak57w95ym.fsf@cisco.com>
References: <20090210184448.22891.31130.stgit@dell3.ogc.int>	<adamyctajv8.fsf@cisco.com>
	<4992F26A.4030800@opengridcomputing.com>	<adatz70972e.fsf@cisco.com>
	<499319CF.6050204@opengridcomputing.com>
	<adak57w95ym.fsf@cisco.com>
Message-ID: <49931C9D.2090604@opengridcomputing.com>

Roland Dreier wrote:
>  > > Is it possible for the page to be bigger than 4GB?  If so then yes you
>  > > might be chopping off high-order bits or something.
>
>  > Yes it is possible.
>  > 
>  > A MR can be created with an iov_base of say 0xffffffff00000000.
>  > 
>  > Then any sge.addr entries would be the iob_base + any offset.
>
> But the code we're talking about is:
>
> 		/* to in the WQE == the offset into the page */
> 		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
> 				(1UL << (12 + page_size[i])));
>
> so it seems the top address bits don't matter unless page_size[i] is at
> least 20 -- in which case using 1UL to shift overflows on 32 bits anyway...
>
>   

Yes yes...you're right.   This code is really just saving the offset in 
a page.

I'll send a new patch.


From sashak at voltaire.com  Wed Feb 11 11:54:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 21:54:42 +0200
Subject: [ofa-general] [PATCH] infiniband-diags/saquery: fix types and some
	cleanup
Message-ID: <20090211195442.GP5910@sashak.voltaire.com>


Fix types - mostly ib_net*_t -> uint*_t conversion. Use host byte order
SA attributes from mad.h (instead of ib_types.h). Fix functions
prototypes and return value types. Remove osm* stubs. Remove unused
'offset' argument in get_any_records() and get_all_gecords() functions.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/src/saquery.c |  388 ++++++++++++++++++----------------------
 1 files changed, 171 insertions(+), 217 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 5b66f93..a94a015 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -50,24 +50,22 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 #include <infiniband/iba/ib_types.h>
-#include <infiniband/complib/cl_debug.h>
 #include <infiniband/complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
-struct sa_bind_handle {
+struct bind_handle {
 	int fd, agent;
 	ib_portid_t dport;
 };
 
-struct sa_result {
+struct query_res {
 	int status;
 	unsigned result_cnt;
 	void *p_result_madw;
 };
 
-#define osmv_query_res_t struct sa_result
-#define osm_bind_handle_t struct sa_bind_handle *
+typedef struct bind_handle * bind_handle_t;
 
 struct query_params {
 	ib_gid_t sgid, dgid, gid, mgid;
@@ -87,22 +85,22 @@ struct query_params {
 
 struct query_cmd {
 	const char *name, *alias;
-	ib_net16_t query_type;
+	uint16_t query_type;
 	const char *usage;
-	int (*handler) (const struct query_cmd * q, osm_bind_handle_t h,
+	int (*handler) (const struct query_cmd * q, bind_handle_t h,
 			struct query_params *p, int argc, char *argv[]);
 };
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
-static ib_net64_t smkey = CL_HTON64(1);
+static uint64_t smkey = 1;
 
 /**
  * Declare some globals because I don't want this to be too complex.
  */
 #define MAX_PORTS (8)
 #define DEFAULT_SA_TIMEOUT_MS (1000)
-osmv_query_res_t result;
+static struct query_res result;
 
 enum {
 	ALL,
@@ -115,14 +113,14 @@ enum {
 } node_print_desc = ALL;
 
 char *requested_name = NULL;
-ib_net16_t requested_lid = 0;
+uint16_t requested_lid = 0;
 int requested_lid_flag = 0;
-ib_net64_t requested_guid = 0;
+uint64_t requested_guid = 0;
 int requested_guid_flag = 0;
 
-static int sa_query(struct sa_bind_handle *h, uint8_t method,
-		    ib_net16_t attr, ib_net32_t mod, ib_net64_t comp_mask,
-		    ib_net64_t sm_key, void *data)
+static int sa_query(struct bind_handle *h, uint8_t method,
+		    uint16_t attr, uint32_t mod, uint64_t comp_mask,
+		    uint64_t sm_key, void *data)
 {
 	ib_rpc_t rpc;
 	void *umad, *mad;
@@ -131,9 +129,9 @@ static int sa_query(struct sa_bind_handle *h, uint8_t method,
 	memset(&rpc, 0, sizeof(rpc));
 	rpc.mgtclass = IB_SA_CLASS;
 	rpc.method = method;
-	rpc.attr.id = cl_ntoh16(attr);
-	rpc.attr.mod = cl_ntoh32(mod);
-	rpc.mask = cl_ntoh64(comp_mask);
+	rpc.attr.id = attr;
+	rpc.attr.mod = mod;
+	rpc.mask = comp_mask;
 	rpc.datasz = IB_SA_DATA_SIZE;
 	rpc.dataoffs = IB_SA_DATA_OFFS;
 
@@ -143,8 +141,7 @@ static int sa_query(struct sa_bind_handle *h, uint8_t method,
 
 	mad_build_pkt(umad, &rpc, &h->dport, NULL, data);
 
-	/* SA SM_Key (36/8) - temporary done using IB_MAD_MKEY_F */
-	mad_set_field64(umad_get_mad(umad), 12, IB_MAD_MKEY_F, cl_hton64(sm_key));
+	mad_set_field64(umad_get_mad(umad), 0, IB_SA_MKEY_F, sm_key);
 
 	if (ibdebug > 1)
 		xdump(stdout, "SA Request:\n", umad_get_mad(umad), len);
@@ -189,14 +186,12 @@ recv_mad:
 	return 0;
 }
 
-static void *osmv_get_query_result(void *mad, unsigned i)
+static void *get_query_rec(void *mad, unsigned i)
 {
 	int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
 	return mad + IB_SA_DATA_OFFS + i * (offset << 3);
 }
 
-#define osmv_get_query_node_rec(mad, i) osmv_get_query_result(mad, i)
-
 static unsigned valid_gid(ib_gid_t *gid)
 {
 	ib_gid_t zero_gid = { };
@@ -456,7 +451,7 @@ static void dump_multicast_member_record(void *data)
 	 */
 	for (i = 0; i < result.result_cnt; i++) {
 		ib_node_record_t *nr =
-		    osmv_get_query_node_rec(result.p_result_madw, i);
+		    get_query_rec(result.p_result_madw, i);
 		if (nr->node_info.port_guid ==
 		    p_mcmr->port_gid.unicast.interface_id) {
 			node_name =
@@ -761,11 +756,11 @@ static void dump_one_mft_record(void *data)
 	printf("\n");
 }
 
-static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *))
+static void dump_results(struct query_res *r, void (*dump_func) (void *))
 {
 	int i;
 	for (i = 0; i < r->result_cnt; i++) {
-		void *data = osmv_get_query_result(r->p_result_madw, i);
+		void *data = get_query_rec(r->p_result_madw, i);
 		dump_func(data);
 	}
 }
@@ -781,13 +776,12 @@ static void return_mad(void)
 /**
  * Get any record(s)
  */
-static ib_api_status_t
-get_any_records(osm_bind_handle_t h,
-		ib_net16_t attr_id, ib_net32_t attr_mod, ib_net64_t comp_mask,
-		void *attr, ib_net16_t attr_offset, ib_net64_t sm_key)
+static int get_any_records(bind_handle_t h,
+			   uint16_t attr_id, uint32_t attr_mod,
+			   ib_net64_t comp_mask, void *attr, uint64_t sm_key)
 {
 	int ret = sa_query(h, IB_MAD_METHOD_GET_TABLE, attr_id, attr_mod,
-			   comp_mask, sm_key, attr);
+			   cl_ntoh64(comp_mask), sm_key, attr);
 	if (ret) {
 		fprintf(stderr, "Query SA failed: %s\n", ib_get_err_str(ret));
 		return ret;
@@ -805,30 +799,27 @@ get_any_records(osm_bind_handle_t h,
 /**
  * Get all the records available for requested query type.
  */
-static ib_api_status_t get_all_records(osm_bind_handle_t h, ib_net16_t query_id,				       ib_net16_t attr_offset, int trusted)
+static int get_all_records(bind_handle_t h, uint16_t attr_id, int trusted)
 {
-	return get_any_records(h, query_id, 0, 0, NULL, attr_offset,
-			       trusted ? smkey : 0);
+	return get_any_records(h, attr_id, 0, 0, NULL, trusted ? smkey : 0);
 }
 
 /**
  * return the lid from the node descriptor (name) supplied
  */
-static ib_api_status_t
-get_lid_from_name(osm_bind_handle_t h, const char *name, ib_net16_t * lid)
+static int
+get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid)
 {
-	int i = 0;
 	ib_node_record_t *node_record = NULL;
 	ib_node_info_t *p_ni = NULL;
-	ib_net16_t attr_offset = ib_get_attr_offset(sizeof(*node_record));
-	ib_api_status_t status;
+	int i = 0, ret;
 
-	status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD, attr_offset, 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
+	if (ret)
+		return ret;
 
 	for (i = 0; i < result.result_cnt; i++) {
-		node_record = osmv_get_query_node_rec(result.p_result_madw, i);
+		node_record = get_query_rec(result.p_result_madw, i);
 		p_ni = &(node_record->node_info);
 		if (name
 		    && strncmp(name, (char *)node_record->node_desc.description,
@@ -839,25 +830,25 @@ get_lid_from_name(osm_bind_handle_t h, const char *name, ib_net16_t * lid)
 		}
 	}
 	return_mad();
-	return (status);
+	return 0;
 }
 
-static ib_net16_t get_lid(osm_bind_handle_t h, const char *name)
+static uint16_t get_lid(bind_handle_t h, const char *name)
 {
-	ib_net16_t rc_lid = 0;
+	uint16_t rc_lid = 0;
 
 	if (!name)
-		return (0);
+		return 0;
 	if (isalpha(name[0]))
 		assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS);
 	else
 		rc_lid = atoi(name);
 	if (rc_lid == 0)
 		fprintf(stderr, "Failed to find lid for \"%s\"\n", name);
-	return (rc_lid);
+	return rc_lid;
 }
 
-static int parse_lid_and_ports(osm_bind_handle_t h,
+static int parse_lid_and_ports(bind_handle_t h,
 			       char *str, int *lid, int *port1, int *port2)
 {
 	char *p, *e;
@@ -920,38 +911,32 @@ static int parse_lid_and_ports(osm_bind_handle_t h,
 /*
  * Get the portinfo records available with IsSM or IsSMdisabled CapabilityMask bit on.
  */
-static ib_api_status_t get_issm_records(osm_bind_handle_t h,
-					ib_net32_t capability_mask)
+static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask)
 {
 	ib_portinfo_record_t attr;
 
 	memset(&attr, 0, sizeof(attr));
 	attr.port_info.capability_mask = capability_mask;
 
-	return get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD,
-			       cl_hton32(1 << 31), IB_PIR_COMPMASK_CAPMASK,
-			       &attr,
-			       ib_get_attr_offset(sizeof(ib_portinfo_record_t)),
-			       0);
+	return get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 1 << 31,
+			       IB_PIR_COMPMASK_CAPMASK, &attr, 0);
 }
 
-static ib_api_status_t print_node_records(osm_bind_handle_t h)
+static int print_node_records(bind_handle_t h)
 {
-	int i = 0;
-	ib_node_record_t *node_record = NULL;
-	ib_net16_t attr_offset = ib_get_attr_offset(sizeof(*node_record));
-	ib_api_status_t status;
+	int i = 0, ret;
 
-	status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD, attr_offset, 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
+	if (ret)
+		return ret;
 
 	if (node_print_desc == ALL_DESC) {
 		printf("   LID \"name\"\n");
 		printf("================\n");
 	}
 	for (i = 0; i < result.result_cnt; i++) {
-		node_record = osmv_get_query_node_rec(result.p_result_madw, i);
+		ib_node_record_t *node_record;
+		node_record = get_query_rec(result.p_result_madw, i);
 		if (node_print_desc == ALL_DESC) {
 			print_node_desc(node_record);
 		} else if (node_print_desc == NAME_OF_LID) {
@@ -977,13 +962,13 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h)
 		}
 	}
 	return_mad();
-	return (status);
+	return ret;
 }
 
-static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h)
+static int get_print_class_port_info(bind_handle_t h)
 {
-	int ret = sa_query(h, IB_MAD_METHOD_GET, IB_MAD_ATTR_CLASS_PORT_INFO,
-			   0, 0, 0, NULL);
+	int ret = sa_query(h, IB_MAD_METHOD_GET, CLASS_PORT_INFO, 0, 0,
+			   0, NULL);
 	if (ret) {
 		fprintf(stderr, "ERROR: Query SA failed: %s\n",
 			ib_get_err_str(ret));
@@ -999,12 +984,12 @@ static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h)
 	return ret;
 }
 
-static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_path_records(const struct query_cmd *q, bind_handle_t h,
 			      struct query_params *p, int argc, char *argv[])
 {
 	ib_path_rec_t pr;
 	ib_net64_t comp_mask = 0;
-	ib_api_status_t status;
+	int ret;
 	uint32_t flow = 0;
 	uint16_t qos_class = 0;
 	uint8_t reversible = 0;
@@ -1029,17 +1014,16 @@ static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL_AND_SEL(p->rate, pr.rate, PR, RATE, SELEC);
 	CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, pr.pkt_life, PR, PKTLIFETIME, SELEC);
 
-	status = get_any_records(h, IB_MAD_ATTR_PATH_RECORD, 0, comp_mask,
-				 &pr, ib_get_attr_offset(sizeof(pr)), 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	ret = get_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask, &pr, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_path_record);
 	return_mad();
-	return (status);
+	return ret;
 }
 
-static ib_api_status_t print_issm_records(osm_bind_handle_t h)
+static ib_api_status_t print_issm_records(bind_handle_t h)
 {
 	ib_api_status_t status;
 
@@ -1064,23 +1048,19 @@ static ib_api_status_t print_issm_records(osm_bind_handle_t h)
 	return (status);
 }
 
-static ib_api_status_t print_multicast_member_records(osm_bind_handle_t h)
+static int print_multicast_member_records(bind_handle_t h)
 {
-	osmv_query_res_t mc_group_result;
-	ib_api_status_t status;
+	struct query_res mc_group_result;
+	int ret;
 
-	status = get_all_records(h, IB_MAD_ATTR_MCMEMBER_RECORD,
-				 ib_get_attr_offset(sizeof(ib_member_rec_t)),
-				 1);
-	if (status != IB_SUCCESS)
-		return (status);
+	ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 1);
+	if (ret)
+		return ret;
 
 	mc_group_result = result;
 
-	status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD,
-				 ib_get_attr_offset(sizeof(ib_node_record_t)),
-				 0);
-	if (status != IB_SUCCESS)
+	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
+	if (ret)
 		goto return_mc;
 
 	dump_results(&mc_group_result, dump_multicast_member_record);
@@ -1090,37 +1070,32 @@ return_mc:
 	if (mc_group_result.p_result_madw)
 		free(mc_group_result.p_result_madw - umad_size());
 
-	return (status);
+	return ret;
 }
 
-static ib_api_status_t print_multicast_group_records(osm_bind_handle_t h)
+static int print_multicast_group_records(bind_handle_t h)
 {
-	ib_api_status_t status;
-
-	status = get_all_records(h, IB_MAD_ATTR_MCMEMBER_RECORD,
-				 ib_get_attr_offset(sizeof(ib_member_rec_t)),
-				 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	int ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_multicast_group_record);
 	return_mad();
-	return (status);
+	return ret;
 }
 
-static int query_class_port_info(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_class_port_info(const struct query_cmd *q, bind_handle_t h,
 				 struct query_params *p, int argc, char *argv[])
 {
 	return get_print_class_port_info(h);
 }
 
-static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_node_records(const struct query_cmd *q, bind_handle_t h,
 			      struct query_params *p, int argc, char *argv[])
 {
 	ib_node_record_t nr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0;
-	ib_api_status_t status;
+	int lid = 0, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, NULL, NULL);
@@ -1128,10 +1103,9 @@ static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h,
 	memset(&nr, 0, sizeof(nr));
 	CHECK_AND_SET_VAL(lid, 16, 0, nr.lid, NR, LID);
 
-	status = get_any_records(h, IB_MAD_ATTR_NODE_RECORD, 0, comp_mask,
-				 &nr, ib_get_attr_offset(sizeof(nr)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask, &nr, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_node_record);
 	return_mad();
@@ -1140,13 +1114,12 @@ static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h,
 }
 
 static int query_portinfo_records(const struct query_cmd *q,
-				  osm_bind_handle_t h, struct query_params *p,
+				  bind_handle_t h, struct query_params *p,
 				  int argc, char *argv[])
 {
 	ib_portinfo_record_t pir;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1;
-	ib_api_status_t status;
+	int lid = 0, port = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, NULL);
@@ -1155,10 +1128,10 @@ static int query_portinfo_records(const struct query_cmd *q,
 	CHECK_AND_SET_VAL(lid, 16, 0, pir.lid, PIR, LID);
 	CHECK_AND_SET_VAL(port, 8, -1, pir.port_num, PIR, PORTNUM);
 
-	status = get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD, 0, comp_mask,
-				 &pir, ib_get_attr_offset(sizeof(pir)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0, comp_mask,
+			      &pir, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_portinfo_record);
 	return_mad();
@@ -1167,12 +1140,12 @@ static int query_portinfo_records(const struct query_cmd *q,
 }
 
 static int query_mcmember_records(const struct query_cmd *q,
-				  osm_bind_handle_t h, struct query_params *p,
+				  bind_handle_t h, struct query_params *p,
 				  int argc, char *argv[])
 {
 	ib_member_rec_t mr;
 	ib_net64_t comp_mask = 0;
-	ib_api_status_t status;
+	int ret;
 	uint32_t flow = 0;
 	uint8_t sl = 0, hop = 0, scope = 0;
 
@@ -1195,57 +1168,46 @@ static int query_mcmember_records(const struct query_cmd *q,
 	mr.scope_state |= scope << 4;
 	CHECK_AND_SET_VAL(p->proxy_join, 8, -1, mr.proxy_join, MCR, PROXY);
 
-	status = get_any_records(h, IB_MAD_ATTR_MCMEMBER_RECORD, 0, comp_mask,
-				 &mr, ib_get_attr_offset(sizeof(mr)), smkey);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask, &mr, smkey);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_mcmember_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static int query_service_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_service_records(const struct query_cmd *q, bind_handle_t h,
 				 struct query_params *p, int argc, char *argv[])
 {
-	ib_net16_t attr_offset =
-	    ib_get_attr_offset(sizeof(ib_service_record_t));
-	ib_api_status_t status;
-
-	status = get_all_records(h, IB_MAD_ATTR_SERVICE_RECORD, attr_offset, 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	int ret = get_all_records(h, IB_SA_ATTR_SERVICERECORD, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_service_record);
 	return_mad();
-	return (status);
+	return 0;
 }
 
 static int query_informinfo_records(const struct query_cmd *q,
-				    osm_bind_handle_t h, struct query_params *p,
+				    bind_handle_t h, struct query_params *p,
 				    int argc, char *argv[])
 {
-	ib_net16_t attr_offset =
-	    ib_get_attr_offset(sizeof(ib_inform_info_record_t));
-	ib_api_status_t status;
-
-	status =
-	    get_all_records(h, IB_MAD_ATTR_INFORM_INFO_RECORD, attr_offset, 0);
-	if (status != IB_SUCCESS)
-		return (status);
+	int ret = get_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_inform_info_record);
 	return_mad();
-	return (status);
+	return 0;
 }
 
-static int query_link_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_link_records(const struct query_cmd *q, bind_handle_t h,
 			      struct query_params *p, int argc, char *argv[])
 {
 	ib_link_record_t lr;
 	ib_net64_t comp_mask = 0;
-	int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1;
-	ib_api_status_t status;
+	int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &from_lid, &from_port, NULL);
@@ -1259,23 +1221,21 @@ static int query_link_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL(to_lid, 16, 0, lr.to_lid, LR, TO_LID);
 	CHECK_AND_SET_VAL(to_port, 8, -1, lr.to_port_num, LR, TO_PORT);
 
-	status = get_any_records(h, IB_MAD_ATTR_LINK_RECORD, 0, comp_mask,
-				 &lr, ib_get_attr_offset(sizeof(lr)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask, &lr, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_link_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static int query_sl2vl_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h,
 			       struct query_params *p, int argc, char *argv[])
 {
 	ib_slvl_table_record_t slvl;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, in_port = -1, out_port = -1;
-	ib_api_status_t status;
+	int lid = 0, in_port = -1, out_port = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &in_port, &out_port);
@@ -1285,23 +1245,22 @@ static int query_sl2vl_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL(in_port, 8, -1, slvl.in_port_num, SLVL, IN_PORT);
 	CHECK_AND_SET_VAL(out_port, 8, -1, slvl.out_port_num, SLVL, OUT_PORT);
 
-	status = get_any_records(h, IB_MAD_ATTR_SLVL_RECORD, 0, comp_mask,
-				 &slvl, ib_get_attr_offset(sizeof(slvl)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0, comp_mask,
+			      &slvl, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_slvl_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static int query_vlarb_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h,
 			       struct query_params *p, int argc, char *argv[])
 {
 	ib_vl_arb_table_record_t vlarb;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1, block = -1;
-	ib_api_status_t status;
+	int lid = 0, port = -1, block = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, &block);
@@ -1311,24 +1270,23 @@ static int query_vlarb_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL(port, 8, -1, vlarb.port_num, VLA, OUT_PORT);
 	CHECK_AND_SET_VAL(block, 8, -1, vlarb.block_num, VLA, BLOCK);
 
-	status = get_any_records(h, IB_MAD_ATTR_VLARB_RECORD, 0, comp_mask,
-				 &vlarb, ib_get_attr_offset(sizeof(vlarb)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0, comp_mask,
+			      &vlarb, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_vlarb_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
 static int query_pkey_tbl_records(const struct query_cmd *q,
-				  osm_bind_handle_t h, struct query_params *p,
+				  bind_handle_t h, struct query_params *p,
 				  int argc, char *argv[])
 {
 	ib_pkey_table_record_t pktr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1, block = -1;
-	ib_api_status_t status;
+	int lid = 0, port = -1, block = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, &block);
@@ -1338,23 +1296,22 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
 	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
 	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
 
-	status = get_any_records(h, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask,
-				 &pktr, ib_get_attr_offset(sizeof(pktr)), smkey);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask,
+			      &pktr, smkey);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_pkey_tbl_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static int query_lft_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_lft_records(const struct query_cmd *q, bind_handle_t h,
 			     struct query_params *p, int argc, char *argv[])
 {
 	ib_lft_record_t lftr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, block = -1;
-	ib_api_status_t status;
+	int lid = 0, block = -1, ret;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &block, NULL);
@@ -1363,24 +1320,22 @@ static int query_lft_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL(lid, 16, 0, lftr.lid, LFTR, LID);
 	CHECK_AND_SET_VAL(block, 16, -1, lftr.block_num, LFTR, BLOCK);
 
-	status = get_any_records(h, IB_MAD_ATTR_LFT_RECORD, 0, comp_mask,
-				 &lftr, ib_get_attr_offset(sizeof(lftr)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask, &lftr, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_lft_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h,
+static int query_mft_records(const struct query_cmd *q, bind_handle_t h,
 			     struct query_params *p, int argc, char *argv[])
 {
 	ib_mft_record_t mftr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, block = -1, position = -1;
+	int lid = 0, block = -1, position = -1, ret;
 	uint16_t pos = 0;
-	ib_api_status_t status;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &position, &block);
@@ -1392,19 +1347,18 @@ static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h,
 	CHECK_AND_SET_VAL(position, 8, -1, pos, MFTR, POSITION);
 	mftr.position_block_num |= cl_hton16(pos << 12);
 
-	status = get_any_records(h, IB_MAD_ATTR_MFT_RECORD, 0, comp_mask,
-				 &mftr, ib_get_attr_offset(sizeof(mftr)), 0);
-	if (status != IB_SUCCESS)
-		return status;
+	ret = get_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask, &mftr, 0);
+	if (ret)
+		return ret;
 
 	dump_results(&result, dump_one_mft_record);
 	return_mad();
-	return status;
+	return 0;
 }
 
-static osm_bind_handle_t get_bind_handle(void)
+static bind_handle_t get_bind_handle(void)
 {
-	static struct sa_bind_handle handle;
+	static struct bind_handle handle;
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
 
 	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
@@ -1423,7 +1377,7 @@ static osm_bind_handle_t get_bind_handle(void)
 	return &handle;
 }
 
-static void clean_up(struct sa_bind_handle *h)
+static void clean_up(struct bind_handle *h)
 {
 	umad_unregister(h->fd, h->agent);
 	umad_close_port(h->fd);
@@ -1431,31 +1385,31 @@ static void clean_up(struct sa_bind_handle *h)
 }
 
 static const struct query_cmd query_cmds[] = {
-	{"ClassPortInfo", "CPI", IB_MAD_ATTR_CLASS_PORT_INFO,
+	{"ClassPortInfo", "CPI", CLASS_PORT_INFO,
 	 NULL, query_class_port_info},
-	{"NodeRecord", "NR", IB_MAD_ATTR_NODE_RECORD,
+	{"NodeRecord", "NR", IB_SA_ATTR_NODERECORD,
 	 "[lid]", query_node_records},
-	{"PortInfoRecord", "PIR", IB_MAD_ATTR_PORTINFO_RECORD,
+	{"PortInfoRecord", "PIR", IB_SA_ATTR_PORTINFORECORD,
 	 "[[lid]/[port]]", query_portinfo_records},
-	{"SL2VLTableRecord", "SL2VL", IB_MAD_ATTR_SLVL_RECORD,
+	{"SL2VLTableRecord", "SL2VL", IB_SA_ATTR_SL2VLTABLERECORD,
 	 "[[lid]/[in_port]/[out_port]]", query_sl2vl_records},
-	{"PKeyTableRecord", "PKTR", IB_MAD_ATTR_PKEY_TBL_RECORD,
+	{"PKeyTableRecord", "PKTR", IB_SA_ATTR_PKEYTABLERECORD,
 	 "[[lid]/[port]/[block]]", query_pkey_tbl_records},
-	{"VLArbitrationTableRecord", "VLAR", IB_MAD_ATTR_VLARB_RECORD,
+	{"VLArbitrationTableRecord", "VLAR", IB_SA_ATTR_VLARBTABLERECORD,
 	 "[[lid]/[port]/[block]]", query_vlarb_records},
-	{"InformInfoRecord", "IIR", IB_MAD_ATTR_INFORM_INFO_RECORD,
+	{"InformInfoRecord", "IIR", IB_SA_ATTR_INFORMINFORECORD,
 	 NULL, query_informinfo_records},
-	{"LinkRecord", "LR", IB_MAD_ATTR_LINK_RECORD,
+	{"LinkRecord", "LR", IB_SA_ATTR_LINKRECORD,
 	 "[[from_lid]/[from_port]] [[to_lid]/[to_port]]", query_link_records},
-	{"ServiceRecord", "SR", IB_MAD_ATTR_SERVICE_RECORD,
+	{"ServiceRecord", "SR", IB_SA_ATTR_SERVICERECORD,
 	 NULL, query_service_records},
-	{"PathRecord", "PR", IB_MAD_ATTR_PATH_RECORD,
+	{"PathRecord", "PR", IB_SA_ATTR_PATHRECORD,
 	 NULL, query_path_records},
-	{"MCMemberRecord", "MCMR", IB_MAD_ATTR_MCMEMBER_RECORD,
+	{"MCMemberRecord", "MCMR", IB_SA_ATTR_MCRECORD,
 	 NULL, query_mcmember_records},
-	{"LFTRecord", "LFTR", IB_MAD_ATTR_LFT_RECORD,
+	{"LFTRecord", "LFTR", IB_SA_ATTR_LFTRECORD,
 	 "[[lid]/[block]]", query_lft_records},
-	{"MFTRecord", "MFTR", IB_MAD_ATTR_MFT_RECORD,
+	{"MFTRecord", "MFTR", IB_SA_ATTR_MFTRECORD,
 	 "[[mlid]/[position]/[block]]", query_mft_records},
 	{0}
 };
@@ -1473,7 +1427,7 @@ static const struct query_cmd *find_query(const char *name)
 	return NULL;
 }
 
-static const struct query_cmd *find_query_by_type(ib_net16_t type)
+static const struct query_cmd *find_query_by_type(uint16_t type)
 {
 	const struct query_cmd *q;
 
@@ -1494,7 +1448,7 @@ enum saquery_command {
 };
 
 static enum saquery_command command = SAQUERY_CMD_QUERY;
-static ib_net16_t query_type;
+static uint16_t query_type;
 static char *src_lid, *dst_lid;
 
 static int process_opt(void *context, int ch, char *optarg)
@@ -1511,7 +1465,7 @@ static int process_opt(void *context, int ch, char *optarg)
 			*dst_lid++ = '\0';
 		}
 		p->numb_path = 0x7f;
-		query_type = IB_MAD_ATTR_PATH_RECORD;
+		query_type = IB_SA_ATTR_PATHRECORD;
 		break;
 	case 2:
 		{
@@ -1527,7 +1481,7 @@ static int process_opt(void *context, int ch, char *optarg)
 			free(src_addr);
 		}
 		p->numb_path = 0x7f;
-		query_type = IB_MAD_ATTR_PATH_RECORD;
+		query_type = IB_SA_ATTR_PATHRECORD;
 		break;
 	case 3:
 		node_name_map_file = strdup(optarg);
@@ -1538,22 +1492,22 @@ static int process_opt(void *context, int ch, char *optarg)
 			fprintf(stderr, "cannot get SM_Key\n");
 			ibdiag_show_usage();
 		}
-		smkey = cl_hton64(strtoull(optarg, NULL, 0));
+		smkey = strtoull(optarg, NULL, 0);
 		break;
 	case 'p':
-		query_type = IB_MAD_ATTR_PATH_RECORD;
+		query_type = IB_SA_ATTR_PATHRECORD;
 		break;
 	case 'D':
 		node_print_desc = ALL_DESC;
 		break;
 	case 'c':
-		command = SAQUERY_CMD_CLASS_PORT_INFO;
+		command = CLASS_PORT_INFO;
 		break;
 	case 'S':
-		query_type = IB_MAD_ATTR_SERVICE_RECORD;
+		query_type = IB_SA_ATTR_SERVICERECORD;
 		break;
 	case 'I':
-		query_type = IB_MAD_ATTR_INFORM_INFO_RECORD;
+		query_type = IB_SA_ATTR_INFORMINFORECORD;
 		break;
 	case 'N':
 		command = SAQUERY_CMD_NODE_RECORD;
@@ -1588,7 +1542,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		command = SAQUERY_CMD_MCMEMBERS;
 		break;
 	case 'x':
-		query_type = IB_MAD_ATTR_LINK_RECORD;
+		query_type = IB_SA_ATTR_LINKRECORD;
 		break;
 	case 5:
 		p->slid = strtoul(optarg, NULL, 0);
@@ -1669,7 +1623,7 @@ static int process_opt(void *context, int ch, char *optarg)
 int main(int argc, char **argv)
 {
 	char usage_args[1024];
-	osm_bind_handle_t h;
+	bind_handle_t h;
 	struct query_params params = {
 		.hop_limit = -1,
 		.reversible = -1,
@@ -1758,7 +1712,7 @@ int main(int argc, char **argv)
 
 	if (!query_type && command == SAQUERY_CMD_QUERY) {
 		if (!argc || !(q = find_query(argv[0])))
-			query_type = IB_MAD_ATTR_NODE_RECORD;
+			query_type = IB_SA_ATTR_NODERECORD;
 		else {
 			query_type = q->query_type;
 			argc--;
@@ -1768,10 +1722,10 @@ int main(int argc, char **argv)
 
 	if (argc) {
 		if (node_print_desc == NAME_OF_LID) {
-			requested_lid = (ib_net16_t) strtoul(argv[0], NULL, 0);
+			requested_lid = strtoul(argv[0], NULL, 0);
 			requested_lid_flag++;
 		} else if (node_print_desc == NAME_OF_GUID) {
-			requested_guid = (ib_net64_t) strtoul(argv[0], NULL, 0);
+			requested_guid = strtoul(argv[0], NULL, 0);
 			requested_guid_flag++;
 		} else
 			requested_name = argv[0];
-- 
1.6.1.rc1.45.g123ed


From sashak at voltaire.com  Wed Feb 11 11:55:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 11 Feb 2009 21:55:25 +0200
Subject: [ofa-general] [PATCH] infiniband-diags: some code consolidation
In-Reply-To: <20090211195442.GP5910@sashak.voltaire.com>
References: <20090211195442.GP5910@sashak.voltaire.com>
Message-ID: <20090211195525.GQ5910@sashak.voltaire.com>


Consolidate repeated code using helper functions
get_and_dump_any_records() and get_and_dump_all_records().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/src/saquery.c |  172 +++++++++++++++-------------------------
 1 files changed, 65 insertions(+), 107 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index a94a015..9726d22 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -796,6 +796,21 @@ static int get_any_records(bind_handle_t h,
 	return ret;
 }
 
+static int get_and_dump_any_records(bind_handle_t h, uint16_t attr_id,
+				    uint32_t attr_mod, ib_net64_t comp_mask,
+				    void *attr, uint64_t sm_key,
+				    void (*dump_func) (void *))
+{
+	int ret = get_any_records(h, attr_id, attr_mod, comp_mask, attr,
+				  sm_key);
+	if (ret)
+		return ret;
+
+	dump_results(&result, dump_func);
+
+	return 0;
+}
+
 /**
  * Get all the records available for requested query type.
  */
@@ -804,6 +819,18 @@ static int get_all_records(bind_handle_t h, uint16_t attr_id, int trusted)
 	return get_any_records(h, attr_id, 0, 0, NULL, trusted ? smkey : 0);
 }
 
+static int get_and_dump_all_records(bind_handle_t h, uint16_t attr_id,
+				    int trusted, void (*dump_func) (void *))
+{
+	int ret = get_all_records(h, attr_id, 0);
+	if (ret)
+		return ret;
+
+	dump_results(&result, dump_func);
+	return_mad();
+	return ret;
+}
+
 /**
  * return the lid from the node descriptor (name) supplied
  */
@@ -989,7 +1016,6 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_path_rec_t pr;
 	ib_net64_t comp_mask = 0;
-	int ret;
 	uint32_t flow = 0;
 	uint16_t qos_class = 0;
 	uint8_t reversible = 0;
@@ -1014,13 +1040,8 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL_AND_SEL(p->rate, pr.rate, PR, RATE, SELEC);
 	CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, pr.pkt_life, PR, PKTLIFETIME, SELEC);
 
-	ret = get_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask, &pr, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_path_record);
-	return_mad();
-	return ret;
+	return get_and_dump_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask,
+					&pr, 0, dump_path_record);
 }
 
 static ib_api_status_t print_issm_records(bind_handle_t h)
@@ -1075,13 +1096,8 @@ return_mc:
 
 static int print_multicast_group_records(bind_handle_t h)
 {
-	int ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_multicast_group_record);
-	return_mad();
-	return ret;
+	return get_and_dump_all_records(h, IB_SA_ATTR_MCRECORD, 0,
+					dump_multicast_group_record);
 }
 
 static int query_class_port_info(const struct query_cmd *q, bind_handle_t h,
@@ -1095,7 +1111,7 @@ static int query_node_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_node_record_t nr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, ret;
+	int lid = 0;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, NULL, NULL);
@@ -1103,14 +1119,8 @@ static int query_node_records(const struct query_cmd *q, bind_handle_t h,
 	memset(&nr, 0, sizeof(nr));
 	CHECK_AND_SET_VAL(lid, 16, 0, nr.lid, NR, LID);
 
-	ret = get_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask, &nr, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_node_record);
-	return_mad();
-
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask,
+					&nr, 0, dump_node_record);
 }
 
 static int query_portinfo_records(const struct query_cmd *q,
@@ -1119,7 +1129,7 @@ static int query_portinfo_records(const struct query_cmd *q,
 {
 	ib_portinfo_record_t pir;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1, ret;
+	int lid = 0, port = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, NULL);
@@ -1128,15 +1138,9 @@ static int query_portinfo_records(const struct query_cmd *q,
 	CHECK_AND_SET_VAL(lid, 16, 0, pir.lid, PIR, LID);
 	CHECK_AND_SET_VAL(port, 8, -1, pir.port_num, PIR, PORTNUM);
 
-	ret = get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0, comp_mask,
-			      &pir, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_portinfo_record);
-	return_mad();
-
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0,
+					comp_mask, &pir, 0,
+					dump_one_portinfo_record);
 }
 
 static int query_mcmember_records(const struct query_cmd *q,
@@ -1145,7 +1149,6 @@ static int query_mcmember_records(const struct query_cmd *q,
 {
 	ib_member_rec_t mr;
 	ib_net64_t comp_mask = 0;
-	int ret;
 	uint32_t flow = 0;
 	uint8_t sl = 0, hop = 0, scope = 0;
 
@@ -1168,38 +1171,23 @@ static int query_mcmember_records(const struct query_cmd *q,
 	mr.scope_state |= scope << 4;
 	CHECK_AND_SET_VAL(p->proxy_join, 8, -1, mr.proxy_join, MCR, PROXY);
 
-	ret = get_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask, &mr, smkey);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_mcmember_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask,
+					&mr, smkey, dump_one_mcmember_record);
 }
 
 static int query_service_records(const struct query_cmd *q, bind_handle_t h,
 				 struct query_params *p, int argc, char *argv[])
 {
-	int ret = get_all_records(h, IB_SA_ATTR_SERVICERECORD, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_service_record);
-	return_mad();
-	return 0;
+	return get_and_dump_all_records(h, IB_SA_ATTR_SERVICERECORD, 0,
+					dump_service_record);
 }
 
 static int query_informinfo_records(const struct query_cmd *q,
 				    bind_handle_t h, struct query_params *p,
 				    int argc, char *argv[])
 {
-	int ret = get_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_inform_info_record);
-	return_mad();
-	return 0;
+	return get_and_dump_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0,
+					dump_inform_info_record);
 }
 
 static int query_link_records(const struct query_cmd *q, bind_handle_t h,
@@ -1207,7 +1195,7 @@ static int query_link_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_link_record_t lr;
 	ib_net64_t comp_mask = 0;
-	int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1, ret;
+	int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &from_lid, &from_port, NULL);
@@ -1221,13 +1209,8 @@ static int query_link_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(to_lid, 16, 0, lr.to_lid, LR, TO_LID);
 	CHECK_AND_SET_VAL(to_port, 8, -1, lr.to_port_num, LR, TO_PORT);
 
-	ret = get_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask, &lr, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_link_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask,
+					&lr, 0, dump_one_link_record);
 }
 
 static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h,
@@ -1235,7 +1218,7 @@ static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_slvl_table_record_t slvl;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, in_port = -1, out_port = -1, ret;
+	int lid = 0, in_port = -1, out_port = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &in_port, &out_port);
@@ -1245,14 +1228,9 @@ static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(in_port, 8, -1, slvl.in_port_num, SLVL, IN_PORT);
 	CHECK_AND_SET_VAL(out_port, 8, -1, slvl.out_port_num, SLVL, OUT_PORT);
 
-	ret = get_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0, comp_mask,
-			      &slvl, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_slvl_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0,
+					comp_mask, &slvl, 0,
+					dump_one_slvl_record);
 }
 
 static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h,
@@ -1260,7 +1238,7 @@ static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_vl_arb_table_record_t vlarb;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1, block = -1, ret;
+	int lid = 0, port = -1, block = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, &block);
@@ -1270,14 +1248,9 @@ static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(port, 8, -1, vlarb.port_num, VLA, OUT_PORT);
 	CHECK_AND_SET_VAL(block, 8, -1, vlarb.block_num, VLA, BLOCK);
 
-	ret = get_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0, comp_mask,
-			      &vlarb, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_vlarb_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0,
+					comp_mask, &vlarb, 0,
+					dump_one_vlarb_record);
 }
 
 static int query_pkey_tbl_records(const struct query_cmd *q,
@@ -1286,7 +1259,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
 {
 	ib_pkey_table_record_t pktr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, port = -1, block = -1, ret;
+	int lid = 0, port = -1, block = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &port, &block);
@@ -1296,14 +1269,9 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
 	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
 	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
 
-	ret = get_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask,
-			      &pktr, smkey);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_pkey_tbl_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0,
+					comp_mask, &pktr, smkey,
+					dump_one_pkey_tbl_record);
 }
 
 static int query_lft_records(const struct query_cmd *q, bind_handle_t h,
@@ -1311,7 +1279,7 @@ static int query_lft_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_lft_record_t lftr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, block = -1, ret;
+	int lid = 0, block = -1;
 
 	if (argc > 0)
 		parse_lid_and_ports(h, argv[0], &lid, &block, NULL);
@@ -1320,13 +1288,8 @@ static int query_lft_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(lid, 16, 0, lftr.lid, LFTR, LID);
 	CHECK_AND_SET_VAL(block, 16, -1, lftr.block_num, LFTR, BLOCK);
 
-	ret = get_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask, &lftr, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_lft_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask,
+					&lftr, 0, dump_one_lft_record);
 }
 
 static int query_mft_records(const struct query_cmd *q, bind_handle_t h,
@@ -1334,7 +1297,7 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h,
 {
 	ib_mft_record_t mftr;
 	ib_net64_t comp_mask = 0;
-	int lid = 0, block = -1, position = -1, ret;
+	int lid = 0, block = -1, position = -1;
 	uint16_t pos = 0;
 
 	if (argc > 0)
@@ -1347,13 +1310,8 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(position, 8, -1, pos, MFTR, POSITION);
 	mftr.position_block_num |= cl_hton16(pos << 12);
 
-	ret = get_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask, &mftr, 0);
-	if (ret)
-		return ret;
-
-	dump_results(&result, dump_one_mft_record);
-	return_mad();
-	return 0;
+	return get_and_dump_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask,
+					&mftr, 0, dump_one_mft_record);
 }
 
 static bind_handle_t get_bind_handle(void)
-- 
1.6.1.rc1.45.g123ed


From devel at morey-chaisemartin.com  Wed Feb 11 12:05:53 2009
From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Wed, 11 Feb 2009 21:05:53 +0100
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_console.c :
	Added	dump_portguid
	function	to console to generate a list of port guids matching one or
	more regexps
In-Reply-To: <20090211184717.GO5910@sashak.voltaire.com>
References: <499135E1.1080307@ext.bull.net>
	<20090211184717.GO5910@sashak.voltaire.com>
Message-ID: <49932FA1.9050500@morey-chaisemartin.com>

Sasha Khapyorsky a écrit :
> Hi Nicolas,
>
> On 09:08 Tue 10 Feb     , Nicolas Morey Chaisemartin wrote:
>   
>> This add a dump_portguid functionnality to openSM console which makes it 
>> really easy to generate cn_guid_file, root_guid_file and such
>> by dumping into a file all port guids whom nodedesc contains at least one 
>> of the provided regexps
>>
>> Signed-off-by: Nicolas Morey-Chaisemartin 
>> <nicolas.morey-chaisemartin at ext.bull.net>
>> ---
>>
>> Repost without exit_after_run flag, active sleep init loop and indented.
>>
>>  opensm/opensm/osm_console.c |  105 
>> +++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 105 insertions(+), 0 deletions(-)
>>
>>
>>     
>
>   
>> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
>> index c6e8e59..5fbcd43 100644
>> --- a/opensm/opensm/osm_console.c
>> +++ b/opensm/opensm/osm_console.c
>> @@ -42,6 +42,7 @@
>>  #include <sys/types.h>
>>  #include <sys/socket.h>
>>  #include <netdb.h>
>> +#include <regex.h>
>>  #ifdef ENABLE_OSM_CONSOLE_SOCKET
>>  #include <arpa/inet.h>
>>  #endif
>> @@ -1173,6 +1174,109 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>>  }
>>  
>>  /* more parse routines go here */
>> +typedef struct _regexp_list {
>> +	regex_t exp;
>> +	struct _regexp_list *next;
>> +} regexp_list_t;
>> +
>> +static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>> +{
>> +	cl_qmap_t *p_port_guid_tbl;
>> +	osm_port_t *p_port;
>> +	osm_port_t *p_next_port;
>> +
>> +	regexp_list_t *p_head_regexp = NULL;
>> +	regexp_list_t *p_regexp;
>> +
>> +	/* Option variables */
>> +	char *p_cmd = NULL;
>> +	FILE *output = out;
>> +
>> +	/* Read commande line */
>> +
>> +	while (1) {
>> +		p_cmd = next_token(p_last);
>> +		if (p_cmd) {
>> +			if (strcmp(p_cmd, "file") == 0) {
>> +				p_cmd = next_token(p_last);
>> +				if (p_cmd) {
>> +					output = fopen(p_cmd, "w+");
>> +					if (output == NULL) {
>> +						fprintf(out,
>> +							"Could not open file %s: %s\n",
>> +							p_cmd, strerror(errno));
>> +						output = out;
>> +					}
>> +				} else
>> +					fprintf(out, "No file name passed\n");
>> +			} else {
>> +				p_regexp = malloc(sizeof(*p_regexp));
>> +				if (regcomp
>> +				    (&(p_regexp->exp), p_cmd,
>> +				     REG_NOSUB | REG_EXTENDED) != 0) {
>> +					fprintf(out,
>> +						"Couldn't parse regular expression %s. Skipping it.\n",
>> +						p_cmd);
>> +				}
>> +				p_regexp->next = p_head_regexp;
>> +				p_head_regexp = p_regexp;
>> +			}
>> +		} else
>> +			break;	/* No more tokens */
>> +
>> +	}
>> +
>> +	/* Check we have at least one expression to match */
>> +	if (p_head_regexp == NULL) {
>> +		fprintf(out, "No valid expression provided. Aborting\n");
>> +		return;
>> +	}
>> +
>> +	cl_spinlock_release(&p_osm->sm.state_lock);
>>     
>
> What is this cl_spinlock_release()? Typo?
>
>   
>> +	if (p_osm->sm.p_subn->need_update != 0) {
>> +		fprintf(out, "Subnet is not ready yet. Try again later.\n");
>> +		return;
>> +	}
>> +
>> +	/* Subnet doesn't need to be updated so we can carry on */
>> +
>> +	CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock);
>> +	p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl);
>>     
>
> Do we really need exclusive locking here? port_guid_table content is
> rad-only, I guess "read-only" lock (CL_PLOCK_ACQUIRE()) should be enough.
>
> The rest looks fine for me.
>
> Sasha
>
>   

Read only is fine. I didn't know complib provided different kinds of lock.

Nicolas


From kliteyn at dev.mellanox.co.il  Wed Feb 11 12:13:43 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 11 Feb 2009 22:13:43 +0200
Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between
	non-CN nodes
In-Reply-To: <f0e08f230902110816l6be2a58bgd3ff171eebf8db35@mail.gmail.com>
References: <494A5339.9030304@ext.bull.net>	
	<20090207185551.GD27757@sashak.voltaire.com>	
	<498DE57D.4030501@morey-chaisemartin.com>	
	<20090207202319.GE27757@sashak.voltaire.com>	
	<49929986.40106@ext.bull.net>	
	<20090211114347.GA27920@sashak.voltaire.com>	
	<4992D207.6010701@dev.mellanox.co.il>
	<f0e08f230902110816l6be2a58bgd3ff171eebf8db35@mail.gmail.com>
Message-ID: <49933177.8010206@dev.mellanox.co.il>

Hal Rosenstock wrote:
> On Wed, Feb 11, 2009 at 8:26 AM, Yevgeny Kliteynik
> <kliteyn at dev.mellanox.co.il> wrote:
>> Sasha Khapyorsky wrote:
>>> On 10:25 Wed 11 Feb     , Nicolas Morey Chaisemartin wrote:
>>>> What about high nodes (HN) as it concerns only nodes which are not at the
>>>> bottom of the fat tree?
>>> Could be fine. Let's ask Yevgeny too... :)
>>>
>>> Yevgeny! Any idea about io_nodes more generic name?
>> Ugh...
>>
>> "IO nodes":
>> Pros: the name is closer to the reality, since in most cases
>> the nodes that would need special treatment are indeed IO nodes.
>> Cons: the name is not "general"...
>>
>> "High nodes"
>> Pros: general name with kinda "hint" to the special treatment.
>> Cons: the "hint" is rather vague...
>>
>> Bottom line - I'm OK with both options (slightly leaning toward IO),
>> as long as it is described well enough in the help message and in man :)
> 
> Maybe consistency is the hobgobblin of small minds but don't we now have:
> 
> high nodes which is a topology based name
> and
> compute nodes which is a functional based name.
> 
> Is it worth having them consistent ?

Good point. IO nodes will be consistent with CNs.

-- Yevgeny

> -- Hal
> 
>> -- Yevgeny
>>
>>> Sasha
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
> 


From frankose at ifi.uio.no  Wed Feb 11 12:17:55 2009
From: frankose at ifi.uio.no (Frank Olaf Sem-Jacobsen)
Date: Wed, 11 Feb 2009 21:17:55 +0100
Subject: [ofa-general] fat-tree CN nodes?
Message-ID: <49933273.1010504@ifi.uio.no>

Hi,

I have been looking into the fat tree code, and I was wondering about 
the definition of a compute node (CN). Are these part of the leaf 
switches at the bottom of the fat tree, or are they extra switches that 
are connected to the fat tree, e.g. the switch in a rack of blades which 
is again connected to the fat tree?

Appreciate the help,
-- 
Frank Olaf Sem-Jacobsen


From cameron at harr.org  Wed Feb 11 12:25:41 2009
From: cameron at harr.org (Cameron Harr)
Date: Wed, 11 Feb 2009 13:25:41 -0700
Subject: [ofa-general] fat-tree CN nodes?
In-Reply-To: <49933273.1010504@ifi.uio.no>
References: <49933273.1010504@ifi.uio.no>
Message-ID: <49933445.5030203@harr.org>

Hi Frank,
A compute node is a computer/server that is generally dedicated to doing 
computational work in a cluster or group of computers.
Cameron

Frank Olaf Sem-Jacobsen wrote:
> Hi,
>
> I have been looking into the fat tree code, and I was wondering about 
> the definition of a compute node (CN). Are these part of the leaf 
> switches at the bottom of the fat tree, or are they extra switches 
> that are connected to the fat tree, e.g. the switch in a rack of 
> blades which is again connected to the fat tree?
>
> Appreciate the help,


From frankose at ifi.uio.no  Wed Feb 11 12:31:04 2009
From: frankose at ifi.uio.no (Frank Olaf Sem-Jacobsen)
Date: Wed, 11 Feb 2009 21:31:04 +0100
Subject: [ofa-general] fat-tree CN nodes?
In-Reply-To: <49933445.5030203@harr.org>
References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org>
Message-ID: <49933588.6050607@ifi.uio.no>

Right,so it has no connection with any topological properties of the fat 
tree? Which again means that the definition of compute nodes is only 
necessary for the ability to balance these separately in the tree?

Thanks for your answer,

Cameron Harr wrote:
> Hi Frank,
> A compute node is a computer/server that is generally dedicated to doing 
> computational work in a cluster or group of computers.
> Cameron
> 
> Frank Olaf Sem-Jacobsen wrote:
>> Hi,
>>
>> I have been looking into the fat tree code, and I was wondering about 
>> the definition of a compute node (CN). Are these part of the leaf 
>> switches at the bottom of the fat tree, or are they extra switches 
>> that are connected to the fat tree, e.g. the switch in a rack of 
>> blades which is again connected to the fat tree?
>>
>> Appreciate the help,


-- 
Frank Olaf Sem-Jacobsen


From cameron at harr.org  Wed Feb 11 12:47:13 2009
From: cameron at harr.org (Cameron Harr)
Date: Wed, 11 Feb 2009 13:47:13 -0700
Subject: [ofa-general] fat-tree CN nodes?
In-Reply-To: <49933588.6050607@ifi.uio.no>
References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org>
	<49933588.6050607@ifi.uio.no>
Message-ID: <49933951.2030504@harr.org>

Frank,
I'm going to step out of the discussion because I'm no authority in the 
code. My understanding is that the CN is there as a GUID (from the HCA) 
on the very bottom of the fabric - connected to the leaf switch. Someone 
who knows the  code will have to give you a real answer. Sorry.
Cameron

Frank Olaf Sem-Jacobsen wrote:
> Right,so it has no connection with any topological properties of the 
> fat tree? Which again means that the definition of compute nodes is 
> only necessary for the ability to balance these separately in the tree?
>
> Thanks for your answer,
>
> Cameron Harr wrote:
>> Hi Frank,
>> A compute node is a computer/server that is generally dedicated to 
>> doing computational work in a cluster or group of computers.
>> Cameron
>>
>> Frank Olaf Sem-Jacobsen wrote:
>>> Hi,
>>>
>>> I have been looking into the fat tree code, and I was wondering 
>>> about the definition of a compute node (CN). Are these part of the 
>>> leaf switches at the bottom of the fat tree, or are they extra 
>>> switches that are connected to the fat tree, e.g. the switch in a 
>>> rack of blades which is again connected to the fat tree?
>>>
>>> Appreciate the help,
>
>


From or.gerlitz at gmail.com  Wed Feb 11 12:52:26 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 11 Feb 2009 22:52:26 +0200
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for 
	bind
In-Reply-To: <Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
Message-ID: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>

On Thu, Feb 5, 2009 at 1:44 PM, Or Gerlitz <ogerlitz at voltaire.com> wrote:

> It seems that even when the rdma-cm consumer binds to a specific address,
> the rdma-cm address resolution code follows the order of the devices/rules
> in routing table. So the user can't really dictate an outgoing interface
> based on the src address provided to rdma_resolve_addr.


Hi Sean,

Did you had the chance to look into that?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090211/2d7588cc/attachment.html>

From sean.hefty at intel.com  Wed Feb 11 13:14:46 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 11 Feb 2009 13:14:46 -0800
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
Message-ID: <798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com>

>Did you had the chance to look into that?

Not yet - but should be able to look into it by the end of the week.  From what
Jason said, it sounds like ip_dev_find() doesn't behave like I was expecting. 

- Sean


From swise at opengridcomputing.com  Wed Feb 11 14:29:15 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Feb 2009 16:29:15 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from
	build_rdma_recv().
Message-ID: <20090211222915.19520.22647.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

- remove modulo usage

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index c2b3cf7..bf549ed 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe,
 		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
 
 		/* to in the WQE == the offset into the page */
-		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
-				(1UL << (12 + page_size[i])));
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32)wr->sg_list[i].addr) &
+				((1UL << (12 + page_size[i]) - 1)));
 
 		/* pbl_addr is the adapters address in the PBL */
 		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);


From davem at davemloft.net  Wed Feb 11 15:00:53 2009
From: davem at davemloft.net (David Miller)
Date: Wed, 11 Feb 2009 15:00:53 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from
 build_rdma_recv().
In-Reply-To: <20090211222915.19520.22647.stgit@dell3.ogc.int>
References: <20090211222915.19520.22647.stgit@dell3.ogc.int>
Message-ID: <20090211.150053.02539000.davem@davemloft.net>

From: Steve Wise <swise at opengridcomputing.com>
Date: Wed, 11 Feb 2009 16:29:15 -0600

> From: Steve Wise <swise at opengridcomputing.com>
> 
> - remove modulo usage
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>

Acked-by: David S. Miller <davem at davemloft.net>


From Jie.Cai at cs.anu.edu.au  Wed Feb 11 18:30:19 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Thu, 12 Feb 2009 13:30:19 +1100
Subject: [ofa-general] Question on dat_ep_post_rdma_write with
	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <49927A53.1020403@cs.anu.edu.au>
References: <49927A53.1020403@cs.anu.edu.au>
Message-ID: <499389BB.6060806@cs.anu.edu.au>

I am get a bit confused by description on the DAT_COMPLETION_SUPPRESS_FLAG.

Looks like it suppress notification after DTO operations. Is it always true?

I have found that when I am using dat_ep_post_rdma_write to transfering
data over 128k (within 1 iov).  Event will be brought to server side 
(verified
with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER
will be received.

What's the problem?

Thanks

-- 
Mr. Jie Cai


From yunhong.jiang at intel.com  Wed Feb 11 17:18:21 2009
From: yunhong.jiang at intel.com (Jiang, Yunhong)
Date: Thu, 12 Feb 2009 09:18:21 +0800
Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
Message-ID: <E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>

Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it.
I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think).

BTW, where you got following log? That seems suggest config space function not found.

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found."

-- Yunhong Jiang
________________________________
From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl
Sent: 2009年2月11日 22:18
To: David Brown
Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org
Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module.

No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>> wrote:
Okay so my question to the openfabrics guys is, why would the OFED
drivers fail to read the firmware?

Any thoughts?

Thanks,
- David Brown


---------- Forwarded message ----------
From: David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>>
Date: Thu, Sep 11, 2008 at 2:24 PM
Subject: pciback module not working
To: xen-users at lists.xensource.com<mailto:xen-users at lists.xensource.com>, xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>


This issue was brought up about a year and a half ago. So I'll bring
it up again and see if anything happens.

I've got an infiniband network and am attempting to pass the
infiniband card through the host and give it to the guest.
I'm working with standard CentOS 5.2 on both guest and host with their
provided xen (3.0.3 ish). I've also attempted to install the newest
Xen 3.3 and use their standard host kernel and that did the same
thing. The guest dmesg output in the guest is similar on both
permissive and normal mode.

I'm getting issues with detecting the firmware on the card for some reason...

Any help would be appreciated.

Thanks,
- David Brown

=== GUEST dmesg output ===
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11
=======================

=== Host modprobe.conf ===
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
options pciback hide=(41:00.0)
=====================

=== Host lspci output ===
# lspci -vs 41:00.0
41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================

This makes sure it get loaded first off before anything else.
=== Host mkinitrd cmd ===
# mkinitrd -f --with=pciback --preload pciback
/boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
====================

=== Host pciback dmesg ===
pciback 0000:41:00.0: Driver tried to write to a read-only
configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of
your device obtained from lspci.
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
======================

=== Host pciback dmesg (after setting it permissive) ===
pciback 0000:41:00.0: enabling permissive mode configuration space accesses!
pciback 0000:41:00.0: permissive mode is potentially unsafe!
pciback: vpci: 0000:41:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
=========================================

=== Guest lspci output ===
# lspci -v
00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================
_______________________________________________
general mailing list
general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/80a95a92/attachment.html>

From subbukl at gmail.com  Wed Feb 11 21:52:25 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 12 Feb 2009 11:22:25 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
Message-ID: <f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>

no luck !
 dmesg in XEN PV guest shows :

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

even after executingh the following in dom0:

#echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive

I am getting the follwing messages on the console as part of the initial
bootup messages of the guest:

Started domain rhel52_64_3
PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.

after executing the following in dom0 :
#xm create -c rhel52_64_3


so, problem persisits,

~subbu


2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>

>  Seems it is because PCI frontend try to write some configuration space
> that PCIback has no config_field entry to support it.
> I think you can firstly try to do as dom0's dmesg suggested: "see
> permissive attribute in sysfs" (it should be "set permissive attribute...",
> I think).
>
> BTW, where you got following log? That seems suggest config space function
> not found.
>
> PCI: Fatal: No PCI config space access function found
> rtc: IRQ 8 is not free.
> i8042.c: No controller found."
>
> -- Yunhong Jiang
>
>  ------------------------------
> *From:* xen-devel-bounces at lists.xensource.com [mailto:
> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
> *Sent:* 2009年2月11日 22:18
> *To:* David Brown
> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working
>
> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
> paravirtualized guest with pciback module.
>
> No one seems to have tried answering this question on the list, let me ping
> xen-devel and ofed people again.
>
> after executing in dom0
> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>
> #dmesg
> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
> tap tap-1-51712: 2 getting info
> tap tap-2-51712: 2 getting info
> pciback 0000:0e:00.0: seizing device
> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>
> #xm create -c rhel52_64_3
>
> PCI: Fatal: No PCI config space access function found
> rtc: IRQ 8 is not free.
> i8042.c: No controller found.
>
>
> GUEST dmesg:
>
> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
> ib_mthca: Initializing 0000:00:00.0
> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
> PCI: Setting latency timer of device 0000:00:00.0 to 64
> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
> ib_mthca: probe of 0000:00:00.0 failed with error -11
>
> in dom0:
> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual
> slot 0
> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol
> 1 (x86_64-abi)
> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to
> a read-only configuration space field at offset 0x44, size 2. This may be
> harmless, but if you have problems with your device:
> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing
> list along with details of your device obtained from lspci.
> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
> 0002)
> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16
> (level, low) -> IRQ 16
> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0
> disabled
>
>
>
> some more details - [root at p128 ~]# rpm -qa | grep xen
> kernel-xen-2.6.18-92.1.22.el5
> xen-3.0.3-64.el5_2.9
> xen-libs-3.0.3-64.el5_2.9
> xen-libs-3.0.3-64.el5_2.9
>
> [root at p128 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.3.0
>         node_guid:                      0002:c902:0022:cd48
>         sys_image_guid:                 0002:c902:0022:cd4b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0x20
>         board_id:                       MT_0370130002
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>
>
> any help greatly appreciated.
>
> ~subbu
>
> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com> wrote:
>
>> Okay so my question to the openfabrics guys is, why would the OFED
>> drivers fail to read the firmware?
>>
>> Any thoughts?
>>
>> Thanks,
>> - David Brown
>>
>>
>> ---------- Forwarded message ----------
>> From: David Brown <dmlb2000 at gmail.com>
>> Date: Thu, Sep 11, 2008 at 2:24 PM
>> Subject: pciback module not working
>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>
>>
>> This issue was brought up about a year and a half ago. So I'll bring
>> it up again and see if anything happens.
>>
>> I've got an infiniband network and am attempting to pass the
>> infiniband card through the host and give it to the guest.
>> I'm working with standard CentOS 5.2 on both guest and host with their
>> provided xen (3.0.3 ish). I've also attempted to install the newest
>> Xen 3.3 and use their standard host kernel and that did the same
>> thing. The guest dmesg output in the guest is similar on both
>> permissive and normal mode.
>>
>> I'm getting issues with detecting the firmware on the card for some
>> reason...
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> - David Brown
>>
>> === GUEST dmesg output ===
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>> ib_mthca: Initializing 0000:00:00.0
>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>> =======================
>>
>> === Host modprobe.conf ===
>> alias eth0 bnx2
>> alias eth1 bnx2
>> alias scsi_hostadapter cciss
>> options pciback hide=(41:00.0)
>> =====================
>>
>> === Host lspci output ===
>> # lspci -vs 41:00.0
>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>> HCA] (rev 20)
>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>       Flags: fast devsel, IRQ 16
>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>       Capabilities: [40] Power Management version 2
>>       Capabilities: [48] Vital Product Data
>>       Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
>> Enable-
>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>       Capabilities: [60] Express Endpoint IRQ 0
>> =====================
>>
>> This makes sure it get loaded first off before anything else.
>> === Host mkinitrd cmd ===
>> # mkinitrd -f --with=pciback --preload pciback
>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>> ====================
>>
>> === Host pciback dmesg ===
>> pciback 0000:41:00.0: Driver tried to write to a read-only
>> configuration space field at offset 0x44, size 2. This may be
>> harmless, but if you have problems with your device:
>> 1) see permissive attribute in sysfs
>> 2) report problems to the xen-devel mailing list along with details of
>> your device obtained from lspci.
>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>> ======================
>>
>> === Host pciback dmesg (after setting it permissive) ===
>> pciback 0000:41:00.0: enabling permissive mode configuration space
>> accesses!
>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>> device vif1.0 entered promiscuous mode
>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>> =========================================
>>
>> === Guest lspci output ===
>> # lspci -v
>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>> HCA] (rev 20)
>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>       Flags: fast devsel, IRQ 16
>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>       Capabilities: [40] Power Management version 2
>>       Capabilities: [48] Vital Product Data
>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>> Queue=0/5 Enable-
>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>       Capabilities: [60] Express Endpoint IRQ 0
>> =====================
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/ab607784/attachment.html>

From yunhong.jiang at intel.com  Wed Feb 11 22:20:55 2009
From: yunhong.jiang at intel.com (Jiang, Yunhong)
Date: Thu, 12 Feb 2009 14:20:55 +0800
Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
Message-ID: <E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>

So any changes in dom0's dmesg?


________________________________
From: subbu kl [mailto:subbukl at gmail.com]
Sent: 2009年2月12日 13:52
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

no luck !
 dmesg in XEN PV guest shows :

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

even after executingh the following in dom0:

#echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive

I am getting the follwing messages on the console as part of the initial bootup messages of the guest:

Started domain rhel52_64_3
PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.

after executing the following in dom0 :
#xm create -c rhel52_64_3


so, problem persisits,

~subbu


2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>
Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it.
I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think).

BTW, where you got following log? That seems suggest config space function not found.

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found."

-- Yunhong Jiang
________________________________
From: xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com> [mailto:xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com>] On Behalf Of subbu kl
Sent: 2009年2月11日 22:18
To: David Brown
Cc: xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module.

No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>> wrote:
Okay so my question to the openfabrics guys is, why would the OFED
drivers fail to read the firmware?

Any thoughts?

Thanks,
- David Brown


---------- Forwarded message ----------
From: David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>>
Date: Thu, Sep 11, 2008 at 2:24 PM
Subject: pciback module not working
To: xen-users at lists.xensource.com<mailto:xen-users at lists.xensource.com>, xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>


This issue was brought up about a year and a half ago. So I'll bring
it up again and see if anything happens.

I've got an infiniband network and am attempting to pass the
infiniband card through the host and give it to the guest.
I'm working with standard CentOS 5.2 on both guest and host with their
provided xen (3.0.3 ish). I've also attempted to install the newest
Xen 3.3 and use their standard host kernel and that did the same
thing. The guest dmesg output in the guest is similar on both
permissive and normal mode.

I'm getting issues with detecting the firmware on the card for some reason...

Any help would be appreciated.

Thanks,
- David Brown

=== GUEST dmesg output ===
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11
=======================

=== Host modprobe.conf ===
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
options pciback hide=(41:00.0)
=====================

=== Host lspci output ===
# lspci -vs 41:00.0
41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================

This makes sure it get loaded first off before anything else.
=== Host mkinitrd cmd ===
# mkinitrd -f --with=pciback --preload pciback
/boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
====================

=== Host pciback dmesg ===
pciback 0000:41:00.0: Driver tried to write to a read-only
configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of
your device obtained from lspci.
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
======================

=== Host pciback dmesg (after setting it permissive) ===
pciback 0000:41:00.0: enabling permissive mode configuration space accesses!
pciback 0000:41:00.0: permissive mode is potentially unsafe!
pciback: vpci: 0000:41:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
=========================================

=== Guest lspci output ===
# lspci -v
00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================
_______________________________________________
general mailing list
general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/815d68a2/attachment.html>

From Sumeet.Lahorani at oracle.com  Wed Feb 11 22:31:42 2009
From: Sumeet.Lahorani at oracle.com (Sumeet Lahorani)
Date: Wed, 11 Feb 2009 22:31:42 -0800
Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops
In-Reply-To: <4992EABA.9090605@Voltaire.com>
References: <4990CD57.3080108@oracle.com> <4992EABA.9090605@Voltaire.com>
Message-ID: <4993C24E.504@oracle.com>


Olga, Or,

Thanks for the pointers.

Does this packet drop always occur at the host or could it also occur in 
the switches (Voltaire ISR 9024)?

Also, besides the "packet len too long ..." message, is the "dropped" 
statistic in ifconfig ib0 a good way to find out if such packet drops 
are happening?

- Sumeet

Or Gerlitz wrote:
> Sumeet Lahorani wrote:
>   
>> When we enable IB connected mode and increase MTU to 65520, we see the following
>> kernel: ib0: enabling connected mode will cause multicast packet drops
>> kernel: ib0: mtu > 2044 will cause multicast packet drops.
>>     
>
>   
>> Should we not be doing this? What kind of multicast packets will be dropped?
>> If we are not using multicast, do any drivers (bonding, ipoib etc) internally use 
>> multicast in a way that will cause them to not work correctly in connected mode? 
>>     
>
> Connected mode is supported only for unicast traffic where multicast traffic keeps going over the IB UD QP whose MTU is much lower (e.g 2-4K). To close the gap between the MTU seen by the network stack to the MTU used by the UD QP, IPoIB emulates receiving an icmp packet that tells the os stack to use a different path mtu for this multicast neighbour, see
>
> ipoib_start_xmit --> 
>   ipoib_send --> 
>    ipoib_cm_skb_too_long(mcast_mtu) --> 
>     skb->dst->ops->update_pmtu(skb->dst, mtu)
>
> When IP multicast is not used, multicast is used by the network stack and bonding just for the sake of sending ARPs on the broadcast group, and IGMP where the size of both is way below the IB mtu.
>
> Or.
>   


From subbukl at gmail.com  Wed Feb 11 22:42:47 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 12 Feb 2009 12:12:47 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
Message-ID: <f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>

oops missed it,

well now I dont see that enable permissive...message. here goes the messages
what I got in dom0 while booting domU

tap tap-1-51712: 2 getting info
pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:0e:00.0 to 64
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
xenbr0: topology change detected, propagating
xenbr0: port 3(vif1.0) entering forwarding state

any suspicious message ?
any Idea why I get that :
 PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.

message in domU bootup message ?

~subbu

On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:

>  So any changes in dom0's dmesg?
>
>
>  ------------------------------
> *From:* subbu kl [mailto:subbukl at gmail.com]
> *Sent:* 2009年2月12日 13:52
> *To:* Jiang, Yunhong
> *Cc:* David Brown; xen-devel at lists.xensource.com;
> general at lists.openfabrics.org
> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
> working
>
>  no luck !
>  dmesg in XEN PV guest shows :
>
> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
> ib_mthca: Initializing 0000:00:00.0
> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
> PCI: Setting latency timer of device 0000:00:00.0 to 64
> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
> ib_mthca: probe of 0000:00:00.0 failed with error -11
>
> even after executingh the following in dom0:
>
> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive
>
> I am getting the follwing messages on the console as part of the initial
> bootup messages of the guest:
>
> Started domain rhel52_64_3
> PCI: Fatal: No PCI config space access function found
> rtc: IRQ 8 is not free.
> i8042.c: No controller found.
>
> after executing the following in dom0 :
> #xm create -c rhel52_64_3
>
>
> so, problem persisits,
>
> ~subbu
>
>
> 2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>
>
>>  Seems it is because PCI frontend try to write some configuration space
>> that PCIback has no config_field entry to support it.
>> I think you can firstly try to do as dom0's dmesg suggested: "see
>> permissive attribute in sysfs" (it should be "set permissive attribute...",
>> I think).
>>
>> BTW, where you got following log? That seems suggest config space function
>> not found.
>>
>> PCI: Fatal: No PCI config space access function found
>> rtc: IRQ 8 is not free.
>> i8042.c: No controller found."
>>
>> -- Yunhong Jiang
>>
>>  ------------------------------
>> *From:* xen-devel-bounces at lists.xensource.com [mailto:
>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
>> *Sent:* 2009年2月11日 22:18
>> *To:* David Brown
>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working
>>
>>   I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
>> paravirtualized guest with pciback module.
>>
>> No one seems to have tried answering this question on the list, let me
>> ping xen-devel and ofed people again.
>>
>> after executing in dom0
>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>>
>> #dmesg
>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>> tap tap-1-51712: 2 getting info
>> tap tap-2-51712: 2 getting info
>> pciback 0000:0e:00.0: seizing device
>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>
>> #xm create -c rhel52_64_3
>>
>> PCI: Fatal: No PCI config space access function found
>> rtc: IRQ 8 is not free.
>> i8042.c: No controller found.
>>
>>
>> GUEST dmesg:
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>> ib_mthca: Initializing 0000:00:00.0
>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>
>> in dom0:
>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to
>> virtual slot 0
>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not
>> ready
>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol
>> 1 (x86_64-abi)
>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write
>> to a read-only configuration space field at offset 0x44, size 2. This may be
>> harmless, but if you have problems with your device:
>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing
>> list along with details of your device obtained from lspci.
>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
>> 0002)
>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16
>> (level, low) -> IRQ 16
>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0
>> disabled
>>
>>
>>
>> some more details - [root at p128 ~]# rpm -qa | grep xen
>> kernel-xen-2.6.18-92.1.22.el5
>> xen-3.0.3-64.el5_2.9
>> xen-libs-3.0.3-64.el5_2.9
>> xen-libs-3.0.3-64.el5_2.9
>>
>> [root at p128 ~]# ibv_devinfo
>> hca_id: mthca0
>>         fw_ver:                         5.3.0
>>         node_guid:                      0002:c902:0022:cd48
>>         sys_image_guid:                 0002:c902:0022:cd4b
>>         vendor_id:                      0x02c9
>>         vendor_part_id:                 25218
>>         hw_ver:                         0x20
>>         board_id:                       MT_0370130002
>>         phys_port_cnt:                  2
>>                 port:   1
>>                         state:                  PORT_INIT (2)
>>                         max_mtu:                2048 (4)
>>                         active_mtu:             512 (2)
>>                         sm_lid:                 0
>>                         port_lid:               0
>>                         port_lmc:               0x00
>>
>>                 port:   2
>>                         state:                  PORT_DOWN (1)
>>                         max_mtu:                2048 (4)
>>                         active_mtu:             512 (2)
>>                         sm_lid:                 0
>>                         port_lid:               0
>>                         port_lmc:               0x00
>>
>>
>> any help greatly appreciated.
>>
>> ~subbu
>>
>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com> wrote:
>>
>>> Okay so my question to the openfabrics guys is, why would the OFED
>>> drivers fail to read the firmware?
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> - David Brown
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: David Brown <dmlb2000 at gmail.com>
>>> Date: Thu, Sep 11, 2008 at 2:24 PM
>>> Subject: pciback module not working
>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>>
>>>
>>> This issue was brought up about a year and a half ago. So I'll bring
>>> it up again and see if anything happens.
>>>
>>> I've got an infiniband network and am attempting to pass the
>>> infiniband card through the host and give it to the guest.
>>> I'm working with standard CentOS 5.2 on both guest and host with their
>>> provided xen (3.0.3 ish). I've also attempted to install the newest
>>> Xen 3.3 and use their standard host kernel and that did the same
>>> thing. The guest dmesg output in the guest is similar on both
>>> permissive and normal mode.
>>>
>>> I'm getting issues with detecting the firmware on the card for some
>>> reason...
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks,
>>> - David Brown
>>>
>>> === GUEST dmesg output ===
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>> ib_mthca: Initializing 0000:00:00.0
>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>> =======================
>>>
>>> === Host modprobe.conf ===
>>> alias eth0 bnx2
>>> alias eth1 bnx2
>>> alias scsi_hostadapter cciss
>>> options pciback hide=(41:00.0)
>>> =====================
>>>
>>> === Host lspci output ===
>>> # lspci -vs 41:00.0
>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>> HCA] (rev 20)
>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>       Flags: fast devsel, IRQ 16
>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>       Capabilities: [40] Power Management version 2
>>>       Capabilities: [48] Vital Product Data
>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
>>> Enable-
>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>       Capabilities: [60] Express Endpoint IRQ 0
>>> =====================
>>>
>>> This makes sure it get loaded first off before anything else.
>>> === Host mkinitrd cmd ===
>>> # mkinitrd -f --with=pciback --preload pciback
>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>>> ====================
>>>
>>> === Host pciback dmesg ===
>>> pciback 0000:41:00.0: Driver tried to write to a read-only
>>> configuration space field at offset 0x44, size 2. This may be
>>> harmless, but if you have problems with your device:
>>> 1) see permissive attribute in sysfs
>>> 2) report problems to the xen-devel mailing list along with details of
>>> your device obtained from lspci.
>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>> ======================
>>>
>>> === Host pciback dmesg (after setting it permissive) ===
>>> pciback 0000:41:00.0: enabling permissive mode configuration space
>>> accesses!
>>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>>> device vif1.0 entered promiscuous mode
>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>> =========================================
>>>
>>> === Guest lspci output ===
>>> # lspci -v
>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>> HCA] (rev 20)
>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>       Flags: fast devsel, IRQ 16
>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>       Capabilities: [40] Power Management version 2
>>>       Capabilities: [48] Vital Product Data
>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>> Queue=0/5 Enable-
>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>       Capabilities: [60] Express Endpoint IRQ 0
>>> =====================
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>>
>>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/51636598/attachment.html>

From nicolas.morey-chaisemartin at ext.bull.net  Wed Feb 11 22:46:27 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 07:46:27 +0100
Subject: [ofa-general] [PATCH v3] opensm/osm_console.c : Added dump_portguid
 function to
 console to generate a list of port guids matching one or more regexps
Message-ID: <4993C5C3.6020700@ext.bull.net>

This add a dump_portguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such
by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
Diff from v2:
- Changed lock to read-only instead of exclusive
- Removed useless cl_spinlock_release (remains from 1st patch)

  opensm/opensm/osm_console.c |  104 +++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index c6e8e59..5bc1079 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -42,6 +42,7 @@
  #include <sys/types.h>
  #include <sys/socket.h>
  #include <netdb.h>
+#include <regex.h>
  #ifdef ENABLE_OSM_CONSOLE_SOCKET
  #include <arpa/inet.h>
  #endif
@@ -1173,6 +1174,108 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
  }

  /* more parse routines go here */
+typedef struct _regexp_list {
+	regex_t exp;
+	struct _regexp_list *next;
+} regexp_list_t;
+
+static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
+{
+	cl_qmap_t *p_port_guid_tbl;
+	osm_port_t *p_port;
+	osm_port_t *p_next_port;
+
+	regexp_list_t *p_head_regexp = NULL;
+	regexp_list_t *p_regexp;
+
+	/* Option variables */
+	char *p_cmd = NULL;
+	FILE *output = out;
+
+	/* Read commande line */
+
+	while (1) {
+		p_cmd = next_token(p_last);
+		if (p_cmd) {
+			if (strcmp(p_cmd, "file") == 0) {
+				p_cmd = next_token(p_last);
+				if (p_cmd) {
+					output = fopen(p_cmd, "w+");
+					if (output == NULL) {
+						fprintf(out,
+							"Could not open file %s: %s\n",
+							p_cmd, strerror(errno));
+						output = out;
+					}
+				} else
+					fprintf(out, "No file name passed\n");
+			} else {
+				p_regexp = malloc(sizeof(*p_regexp));
+				if (regcomp
+				    (&(p_regexp->exp), p_cmd,
+				     REG_NOSUB | REG_EXTENDED) != 0) {
+					fprintf(out,
+						"Couldn't parse regular expression %s. Skipping it.\n",
+						p_cmd);
+				}
+				p_regexp->next = p_head_regexp;
+				p_head_regexp = p_regexp;
+			}
+		} else
+			break;	/* No more tokens */
+
+	}
+
+	/* Check we have at least one expression to match */
+	if (p_head_regexp == NULL) {
+		fprintf(out, "No valid expression provided. Aborting\n");
+		return;
+	}
+
+	if (p_osm->sm.p_subn->need_update != 0) {
+		fprintf(out, "Subnet is not ready yet. Try again later.\n");
+		return;
+	}
+
+	/* Subnet doesn't need to be updated so we can carry on */
+
+	CL_PLOCK_ACQUIRE(p_osm->sm.p_lock);
+	p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl);
+
+	p_next_port = (osm_port_t *) cl_qmap_head(p_port_guid_tbl);
+	while (p_next_port != (osm_port_t *) cl_qmap_end(p_port_guid_tbl)) {
+
+		p_port = p_next_port;
+		p_next_port =
+		    (osm_port_t *) cl_qmap_next(&p_next_port->map_item);
+
+		for (p_regexp = p_head_regexp; p_regexp != NULL;
+		     p_regexp = p_regexp->next)
+			if (regexec
+			    (&(p_regexp->exp), p_port->p_node->print_desc, 0,
+			     NULL, 0) == 0)
+				fprintf(output, "0x%" PRIxLEAST64 "\n",
+					cl_ntoh64(p_port->p_physp->port_guid));
+	}
+
+	CL_PLOCK_RELEASE(p_osm->sm.p_lock);
+	if (output != out)
+		fclose(output);
+
+}
+
+static void help_dump_portguid(FILE * out, int detail)
+{
+	fprintf(out,
+		"dump_portguid [file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n");
+	if (detail) {
+		fprintf(out,
+			"getguidgetguid  -- Dump all the port GUID whom node_desc matches one of the provided regexp\n");
+		fprintf(out,
+			"   [file filename] -- Send the port GUID list to the specified file instead of regular output\n");
+	}
+
+}

  static const struct command console_cmds[] = {
  	{"help", &help_command, &help_parse},
@@ -1192,6 +1295,7 @@ static const struct command console_cmds[] = {
  #ifdef ENABLE_OSM_PERF_MGR
  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
  #endif				/* ENABLE_OSM_PERF_MGR */
+	{"dump_portguid", &help_dump_portguid, &dump_portguid_parse},
  	{NULL, NULL, NULL}	/* end of array */
  };

-- 
1.6.1


From yunhong.jiang at intel.com  Wed Feb 11 22:56:11 2009
From: yunhong.jiang at intel.com (Jiang, Yunhong)
Date: Thu, 12 Feb 2009 14:56:11 +0800
Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
Message-ID: <E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>

Sorry that seems the original mail has tried the permissive already :$
How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled?

      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]

As for the following information, I think it should be harmless since domU has no method of config spacess access method.
 PCI: Fatal: No PCI config space access function found

Thanks
Yunhong Jiang

________________________________
From: subbu kl [mailto:subbukl at gmail.com]
Sent: 2009年2月12日 14:43
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

oops missed it,

well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU

tap tap-1-51712: 2 getting info
pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:0e:00.0 to 64
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
xenbr0: topology change detected, propagating
xenbr0: port 3(vif1.0) entering forwarding state

any suspicious message ?
any Idea why I get that :
 PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.

message in domU bootup message ?

~subbu

On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
So any changes in dom0's dmesg?


________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 13:52
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

no luck !
 dmesg in XEN PV guest shows :

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

even after executingh the following in dom0:

#echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive

I am getting the follwing messages on the console as part of the initial bootup messages of the guest:

Started domain rhel52_64_3
PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.

after executing the following in dom0 :
#xm create -c rhel52_64_3


so, problem persisits,

~subbu


2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>
Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it.
I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think).

BTW, where you got following log? That seems suggest config space function not found.

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found."

-- Yunhong Jiang
________________________________
From: xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com> [mailto:xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com>] On Behalf Of subbu kl
Sent: 2009年2月11日 22:18
To: David Brown
Cc: xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module.

No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>> wrote:
Okay so my question to the openfabrics guys is, why would the OFED
drivers fail to read the firmware?

Any thoughts?

Thanks,
- David Brown


---------- Forwarded message ----------
From: David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>>
Date: Thu, Sep 11, 2008 at 2:24 PM
Subject: pciback module not working
To: xen-users at lists.xensource.com<mailto:xen-users at lists.xensource.com>, xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>


This issue was brought up about a year and a half ago. So I'll bring
it up again and see if anything happens.

I've got an infiniband network and am attempting to pass the
infiniband card through the host and give it to the guest.
I'm working with standard CentOS 5.2 on both guest and host with their
provided xen (3.0.3 ish). I've also attempted to install the newest
Xen 3.3 and use their standard host kernel and that did the same
thing. The guest dmesg output in the guest is similar on both
permissive and normal mode.

I'm getting issues with detecting the firmware on the card for some reason...

Any help would be appreciated.

Thanks,
- David Brown

=== GUEST dmesg output ===
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11
=======================

=== Host modprobe.conf ===
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
options pciback hide=(41:00.0)
=====================

=== Host lspci output ===
# lspci -vs 41:00.0
41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================

This makes sure it get loaded first off before anything else.
=== Host mkinitrd cmd ===
# mkinitrd -f --with=pciback --preload pciback
/boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
====================

=== Host pciback dmesg ===
pciback 0000:41:00.0: Driver tried to write to a read-only
configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of
your device obtained from lspci.
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
======================

=== Host pciback dmesg (after setting it permissive) ===
pciback 0000:41:00.0: enabling permissive mode configuration space accesses!
pciback 0000:41:00.0: permissive mode is potentially unsafe!
pciback: vpci: 0000:41:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
=========================================

=== Guest lspci output ===
# lspci -v
00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================
_______________________________________________
general mailing list
general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/55a18c17/attachment.html>

From subbukl at gmail.com  Wed Feb 11 22:58:59 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 12 Feb 2009 12:28:59 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
Message-ID: <f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>

So getting PCI config space access in domU will solve the problem ? if so
how can I achieve that ?

~subbu

On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:

>  Sorry that seems the original mail has tried the permissive already :$
> How will So how will the card do the QEUREY_FW command?Through config space
> or through MMIO? Following information is something strange, why all the
> MMIO range is disabled?
>
>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>
> As for the following information, I think it should be harmless since domU
> has no method of config spacess access method.
>  PCI: Fatal: No PCI config space access function found
>
> Thanks
> Yunhong Jiang
>
>  ------------------------------
> *From:* subbu kl [mailto:subbukl at gmail.com]
> *Sent:* 2009年2月12日 14:43
>
> *To:* Jiang, Yunhong
> *Cc:* David Brown; xen-devel at lists.xensource.com;
> general at lists.openfabrics.org
> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
> working
>
> oops missed it,
>
> well now I dont see that enable permissive...message. here goes the
> messages what I got in dom0 while booting domU
>
> tap tap-1-51712: 2 getting info
> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
> device vif1.0 entered promiscuous mode
> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> PCI: Setting latency timer of device 0000:0e:00.0 to 64
> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
> xenbr0: topology change detected, propagating
> xenbr0: port 3(vif1.0) entering forwarding state
>
> any suspicious message ?
> any Idea why I get that :
>  PCI: Fatal: No PCI config space access function found
> rtc: IRQ 8 is not free.
>
> message in domU bootup message ?
>
> ~subbu
>
> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:
>
>>  So any changes in dom0's dmesg?
>>
>>
>>  ------------------------------
>> *From:* subbu kl [mailto:subbukl at gmail.com]
>> *Sent:* 2009年2月12日 13:52
>> *To:* Jiang, Yunhong
>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>> general at lists.openfabrics.org
>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>> working
>>
>>   no luck !
>>  dmesg in XEN PV guest shows :
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>> ib_mthca: Initializing 0000:00:00.0
>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>
>> even after executingh the following in dom0:
>>
>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive
>>
>> I am getting the follwing messages on the console as part of the initial
>> bootup messages of the guest:
>>
>> Started domain rhel52_64_3
>> PCI: Fatal: No PCI config space access function found
>> rtc: IRQ 8 is not free.
>> i8042.c: No controller found.
>>
>> after executing the following in dom0 :
>> #xm create -c rhel52_64_3
>>
>>
>> so, problem persisits,
>>
>> ~subbu
>>
>>
>> 2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>
>>
>>>  Seems it is because PCI frontend try to write some configuration space
>>> that PCIback has no config_field entry to support it.
>>> I think you can firstly try to do as dom0's dmesg suggested: "see
>>> permissive attribute in sysfs" (it should be "set permissive attribute...",
>>> I think).
>>>
>>> BTW, where you got following log? That seems suggest config space
>>> function not found.
>>>
>>> PCI: Fatal: No PCI config space access function found
>>> rtc: IRQ 8 is not free.
>>> i8042.c: No controller found."
>>>
>>> -- Yunhong Jiang
>>>
>>>  ------------------------------
>>> *From:* xen-devel-bounces at lists.xensource.com [mailto:
>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
>>> *Sent:* 2009年2月11日 22:18
>>> *To:* David Brown
>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working
>>>
>>>   I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
>>> paravirtualized guest with pciback module.
>>>
>>> No one seems to have tried answering this question on the list, let me
>>> ping xen-devel and ofed people again.
>>>
>>> after executing in dom0
>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>>>
>>> #dmesg
>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>> tap tap-1-51712: 2 getting info
>>> tap tap-2-51712: 2 getting info
>>> pciback 0000:0e:00.0: seizing device
>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>
>>> #xm create -c rhel52_64_3
>>>
>>> PCI: Fatal: No PCI config space access function found
>>> rtc: IRQ 8 is not free.
>>> i8042.c: No controller found.
>>>
>>>
>>> GUEST dmesg:
>>>
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>> ib_mthca: Initializing 0000:00:00.0
>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>
>>> in dom0:
>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to
>>> virtual slot 0
>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not
>>> ready
>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9,
>>> protocol 1 (x86_64-abi)
>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write
>>> to a read-only configuration space field at offset 0x44, size 2. This may be
>>> harmless, but if you have problems with your device:
>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing
>>> list along with details of your device obtained from lspci.
>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
>>> 0002)
>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI
>>> 16 (level, low) -> IRQ 16
>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0
>>> disabled
>>>
>>>
>>>
>>> some more details - [root at p128 ~]# rpm -qa | grep xen
>>> kernel-xen-2.6.18-92.1.22.el5
>>> xen-3.0.3-64.el5_2.9
>>> xen-libs-3.0.3-64.el5_2.9
>>> xen-libs-3.0.3-64.el5_2.9
>>>
>>> [root at p128 ~]# ibv_devinfo
>>> hca_id: mthca0
>>>         fw_ver:                         5.3.0
>>>         node_guid:                      0002:c902:0022:cd48
>>>         sys_image_guid:                 0002:c902:0022:cd4b
>>>         vendor_id:                      0x02c9
>>>         vendor_part_id:                 25218
>>>         hw_ver:                         0x20
>>>         board_id:                       MT_0370130002
>>>         phys_port_cnt:                  2
>>>                 port:   1
>>>                         state:                  PORT_INIT (2)
>>>                         max_mtu:                2048 (4)
>>>                         active_mtu:             512 (2)
>>>                         sm_lid:                 0
>>>                         port_lid:               0
>>>                         port_lmc:               0x00
>>>
>>>                 port:   2
>>>                         state:                  PORT_DOWN (1)
>>>                         max_mtu:                2048 (4)
>>>                         active_mtu:             512 (2)
>>>                         sm_lid:                 0
>>>                         port_lid:               0
>>>                         port_lmc:               0x00
>>>
>>>
>>> any help greatly appreciated.
>>>
>>> ~subbu
>>>
>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com> wrote:
>>>
>>>> Okay so my question to the openfabrics guys is, why would the OFED
>>>> drivers fail to read the firmware?
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> - David Brown
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: David Brown <dmlb2000 at gmail.com>
>>>> Date: Thu, Sep 11, 2008 at 2:24 PM
>>>> Subject: pciback module not working
>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>>>
>>>>
>>>> This issue was brought up about a year and a half ago. So I'll bring
>>>> it up again and see if anything happens.
>>>>
>>>> I've got an infiniband network and am attempting to pass the
>>>> infiniband card through the host and give it to the guest.
>>>> I'm working with standard CentOS 5.2 on both guest and host with their
>>>> provided xen (3.0.3 ish). I've also attempted to install the newest
>>>> Xen 3.3 and use their standard host kernel and that did the same
>>>> thing. The guest dmesg output in the guest is similar on both
>>>> permissive and normal mode.
>>>>
>>>> I'm getting issues with detecting the firmware on the card for some
>>>> reason...
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks,
>>>> - David Brown
>>>>
>>>> === GUEST dmesg output ===
>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>> ib_mthca: Initializing 0000:00:00.0
>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>> =======================
>>>>
>>>> === Host modprobe.conf ===
>>>> alias eth0 bnx2
>>>> alias eth1 bnx2
>>>> alias scsi_hostadapter cciss
>>>> options pciback hide=(41:00.0)
>>>> =====================
>>>>
>>>> === Host lspci output ===
>>>> # lspci -vs 41:00.0
>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>> HCA] (rev 20)
>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>       Flags: fast devsel, IRQ 16
>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>       Capabilities: [40] Power Management version 2
>>>>       Capabilities: [48] Vital Product Data
>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
>>>> Enable-
>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>> =====================
>>>>
>>>> This makes sure it get loaded first off before anything else.
>>>> === Host mkinitrd cmd ===
>>>> # mkinitrd -f --with=pciback --preload pciback
>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>>>> ====================
>>>>
>>>> === Host pciback dmesg ===
>>>> pciback 0000:41:00.0: Driver tried to write to a read-only
>>>> configuration space field at offset 0x44, size 2. This may be
>>>> harmless, but if you have problems with your device:
>>>> 1) see permissive attribute in sysfs
>>>> 2) report problems to the xen-devel mailing list along with details of
>>>> your device obtained from lspci.
>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>> ======================
>>>>
>>>> === Host pciback dmesg (after setting it permissive) ===
>>>> pciback 0000:41:00.0: enabling permissive mode configuration space
>>>> accesses!
>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>>>> device vif1.0 entered promiscuous mode
>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>> =========================================
>>>>
>>>> === Guest lspci output ===
>>>> # lspci -v
>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>> HCA] (rev 20)
>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>       Flags: fast devsel, IRQ 16
>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>       Capabilities: [40] Power Management version 2
>>>>       Capabilities: [48] Vital Product Data
>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>> Queue=0/5 Enable-
>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>> =====================
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>
>>>
>>>
>>> --
>>> . . . s u b b u
>>> "You've got to be original, because if you're like someone else, what do
>>> they need you for?"
>>>
>>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>>
>>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/2bef1e87/attachment.html>

From yunhong.jiang at intel.com  Wed Feb 11 23:00:31 2009
From: yunhong.jiang at intel.com (Jiang, Yunhong)
Date: Thu, 12 Feb 2009 15:00:31 +0800
Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
Message-ID: <E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>

DomU access config space through pcibackend, so that message is ok.

________________________________
From: subbu kl [mailto:subbukl at gmail.com]
Sent: 2009年2月12日 14:59
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

So getting PCI config space access in domU will solve the problem ? if so how can I achieve that ?

~subbu

On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
Sorry that seems the original mail has tried the permissive already :$
How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled?

      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]

As for the following information, I think it should be harmless since domU has no method of config spacess access method.
 PCI: Fatal: No PCI config space access function found

Thanks
Yunhong Jiang

________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 14:43

To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

oops missed it,

well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU

tap tap-1-51712: 2 getting info
pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:0e:00.0 to 64
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
xenbr0: topology change detected, propagating
xenbr0: port 3(vif1.0) entering forwarding state

any suspicious message ?
any Idea why I get that :
 PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.

message in domU bootup message ?

~subbu

On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
So any changes in dom0's dmesg?


________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 13:52
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

no luck !
 dmesg in XEN PV guest shows :

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

even after executingh the following in dom0:

#echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive

I am getting the follwing messages on the console as part of the initial bootup messages of the guest:

Started domain rhel52_64_3
PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.

after executing the following in dom0 :
#xm create -c rhel52_64_3


so, problem persisits,

~subbu


2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>
Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it.
I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think).

BTW, where you got following log? That seems suggest config space function not found.

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found."

-- Yunhong Jiang
________________________________
From: xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com> [mailto:xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com>] On Behalf Of subbu kl
Sent: 2009年2月11日 22:18
To: David Brown
Cc: xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module.

No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>> wrote:
Okay so my question to the openfabrics guys is, why would the OFED
drivers fail to read the firmware?

Any thoughts?

Thanks,
- David Brown


---------- Forwarded message ----------
From: David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>>
Date: Thu, Sep 11, 2008 at 2:24 PM
Subject: pciback module not working
To: xen-users at lists.xensource.com<mailto:xen-users at lists.xensource.com>, xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>


This issue was brought up about a year and a half ago. So I'll bring
it up again and see if anything happens.

I've got an infiniband network and am attempting to pass the
infiniband card through the host and give it to the guest.
I'm working with standard CentOS 5.2 on both guest and host with their
provided xen (3.0.3 ish). I've also attempted to install the newest
Xen 3.3 and use their standard host kernel and that did the same
thing. The guest dmesg output in the guest is similar on both
permissive and normal mode.

I'm getting issues with detecting the firmware on the card for some reason...

Any help would be appreciated.

Thanks,
- David Brown

=== GUEST dmesg output ===
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11
=======================

=== Host modprobe.conf ===
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
options pciback hide=(41:00.0)
=====================

=== Host lspci output ===
# lspci -vs 41:00.0
41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================

This makes sure it get loaded first off before anything else.
=== Host mkinitrd cmd ===
# mkinitrd -f --with=pciback --preload pciback
/boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
====================

=== Host pciback dmesg ===
pciback 0000:41:00.0: Driver tried to write to a read-only
configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of
your device obtained from lspci.
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
======================

=== Host pciback dmesg (after setting it permissive) ===
pciback 0000:41:00.0: enabling permissive mode configuration space accesses!
pciback 0000:41:00.0: permissive mode is potentially unsafe!
pciback: vpci: 0000:41:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
=========================================

=== Guest lspci output ===
# lspci -v
00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================
_______________________________________________
general mailing list
general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/1f1af14e/attachment.html>

From ogerlitz at voltaire.com  Wed Feb 11 23:16:07 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Feb 2009 09:16:07 +0200
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used
	for 	bind
In-Reply-To: <798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
	<798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com>
Message-ID: <4993CCB7.6070203@voltaire.com>

Sean Hefty wrote:
> Not yet - but should be able to look into it by the end of the week.  From what
> Jason said, it sounds like ip_dev_find() doesn't behave like I was expecting. 
>   
OK, thanks for the update.

Or.


From wangwhao at cn.ibm.com  Wed Feb 11 23:37:17 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Thu, 12 Feb 2009 15:37:17 +0800
Subject: [ofa-general] sminfo report iberror in the first configuration on
	RHEL5.3
Message-ID: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>


Hi all:

I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped in
RHEL5.3 image) by "yum groupisntall". Then I load some drivers and wrote
network interface configuration file ifcfg-ib0. ifup ib0 also succeeded.
But IB utilites report Connetion timed out.


[root at xblade06 network-scripts]# sminfo
ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
sminfo: iberror: failed: query

I had to reboot the blade and rerun "openibd start". Then sminfo reported
correct contents. I do not suppose this reboot is required. Did I miss any
configuration step?

Moreover, "openibd start" report one warning message about hwconf. Anyone
has comments about this?

[root at xblade07 ~]# /etc/init.d/openibd start
Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file or
directory
                                                           [  OK  ]

Thanks a lot!

Wen Hao Wang
Email: wangwhao at cn.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/77d11eaa/attachment.html>

From kliteyn at dev.mellanox.co.il  Wed Feb 11 23:42:39 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 12 Feb 2009 09:42:39 +0200
Subject: [ofa-general] fat-tree CN nodes?
In-Reply-To: <49933588.6050607@ifi.uio.no>
References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org>
	<49933588.6050607@ifi.uio.no>
Message-ID: <4993D2EF.8030009@dev.mellanox.co.il>

Hi Frank,

Frank Olaf Sem-Jacobsen wrote:
> Right,so it has no connection with any topological properties of the fat 
> tree?

You're right, the term "Compute Node" by itself has no connection
to the topological properties of the fat tree. However, fat-tree
routing has some constraints on the topology, and one of these
constraints is that all the compute nodes are required to be
located at the same topological level of the tree (same rank).

> Which again means that the definition of compute nodes is only 
> necessary for the ability to balance these separately in the tree?

Right again, this is what the fat-tree routing does.

-- Yevgeny

> Thanks for your answer,
> 
> Cameron Harr wrote:
>> Hi Frank,
>> A compute node is a computer/server that is generally dedicated to 
>> doing computational work in a cluster or group of computers.
>> Cameron
>>
>> Frank Olaf Sem-Jacobsen wrote:
>>> Hi,
>>>
>>> I have been looking into the fat tree code, and I was wondering about 
>>> the definition of a compute node (CN). Are these part of the leaf 
>>> switches at the bottom of the fat tree, or are they extra switches 
>>> that are connected to the fat tree, e.g. the switch in a rack of 
>>> blades which is again connected to the fat tree?
>>>
>>> Appreciate the help,
> 
> 


From subbukl at gmail.com  Wed Feb 11 23:45:57 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 12 Feb 2009 13:15:57 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>
Message-ID: <f3b32c250902112345v2e46dc93g9ff086d8159ceb6@mail.gmail.com>

so back to square one ?
Why QUERY_FW should fail in domU ?

~subbu

On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:

>  DomU access config space through pcibackend, so that message is ok.
>
>  ------------------------------
> *From:* subbu kl [mailto:subbukl at gmail.com]
> *Sent:* 2009年2月12日 14:59
>
> *To:* Jiang, Yunhong
> *Cc:* David Brown; xen-devel at lists.xensource.com;
> general at lists.openfabrics.org
> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
> working
>
> So getting PCI config space access in domU will solve the problem ? if so
> how can I achieve that ?
>
> ~subbu
>
> On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:
>
>>  Sorry that seems the original mail has tried the permissive already :$
>> How will So how will the card do the QEUREY_FW command?Through config
>> space or through MMIO? Following information is something strange, why all
>> the MMIO range is disabled?
>>
>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>
>> As for the following information, I think it should be harmless since domU
>> has no method of config spacess access method.
>>   PCI: Fatal: No PCI config space access function found
>>
>> Thanks
>> Yunhong Jiang
>>
>>  ------------------------------
>>  *From:* subbu kl [mailto:subbukl at gmail.com]
>> *Sent:* 2009年2月12日 14:43
>>
>> *To:* Jiang, Yunhong
>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>> general at lists.openfabrics.org
>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>> working
>>
>>   oops missed it,
>>
>> well now I dont see that enable permissive...message. here goes the
>> messages what I got in dom0 while booting domU
>>
>> tap tap-1-51712: 2 getting info
>> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
>> device vif1.0 entered promiscuous mode
>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
>> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>> PCI: Setting latency timer of device 0000:0e:00.0 to 64
>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
>> xenbr0: topology change detected, propagating
>> xenbr0: port 3(vif1.0) entering forwarding state
>>
>> any suspicious message ?
>> any Idea why I get that :
>>  PCI: Fatal: No PCI config space access function found
>> rtc: IRQ 8 is not free.
>>
>> message in domU bootup message ?
>>
>> ~subbu
>>
>> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com
>> > wrote:
>>
>>>  So any changes in dom0's dmesg?
>>>
>>>
>>>  ------------------------------
>>> *From:* subbu kl [mailto:subbukl at gmail.com]
>>> *Sent:* 2009年2月12日 13:52
>>> *To:* Jiang, Yunhong
>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>> general at lists.openfabrics.org
>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>> working
>>>
>>>   no luck !
>>>  dmesg in XEN PV guest shows :
>>>
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>> ib_mthca: Initializing 0000:00:00.0
>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>
>>> even after executingh the following in dom0:
>>>
>>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive
>>>
>>> I am getting the follwing messages on the console as part of the initial
>>> bootup messages of the guest:
>>>
>>> Started domain rhel52_64_3
>>> PCI: Fatal: No PCI config space access function found
>>> rtc: IRQ 8 is not free.
>>> i8042.c: No controller found.
>>>
>>> after executing the following in dom0 :
>>> #xm create -c rhel52_64_3
>>>
>>>
>>> so, problem persisits,
>>>
>>> ~subbu
>>>
>>>
>>> 2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>
>>>
>>>>  Seems it is because PCI frontend try to write some configuration space
>>>> that PCIback has no config_field entry to support it.
>>>> I think you can firstly try to do as dom0's dmesg suggested: "see
>>>> permissive attribute in sysfs" (it should be "set permissive attribute...",
>>>> I think).
>>>>
>>>> BTW, where you got following log? That seems suggest config space
>>>> function not found.
>>>>
>>>> PCI: Fatal: No PCI config space access function found
>>>> rtc: IRQ 8 is not free.
>>>> i8042.c: No controller found."
>>>>
>>>> -- Yunhong Jiang
>>>>
>>>>  ------------------------------
>>>> *From:* xen-devel-bounces at lists.xensource.com [mailto:
>>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
>>>> *Sent:* 2009年2月11日 22:18
>>>> *To:* David Brown
>>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
>>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>> working
>>>>
>>>>   I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
>>>> paravirtualized guest with pciback module.
>>>>
>>>> No one seems to have tried answering this question on the list, let me
>>>> ping xen-devel and ofed people again.
>>>>
>>>> after executing in dom0
>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>>>>
>>>> #dmesg
>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>> tap tap-1-51712: 2 getting info
>>>> tap tap-2-51712: 2 getting info
>>>> pciback 0000:0e:00.0: seizing device
>>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>>
>>>> #xm create -c rhel52_64_3
>>>>
>>>> PCI: Fatal: No PCI config space access function found
>>>> rtc: IRQ 8 is not free.
>>>> i8042.c: No controller found.
>>>>
>>>>
>>>> GUEST dmesg:
>>>>
>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>>> ib_mthca: Initializing 0000:00:00.0
>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>
>>>> in dom0:
>>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
>>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to
>>>> virtual slot 0
>>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
>>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not
>>>> ready
>>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9,
>>>> protocol 1 (x86_64-abi)
>>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write
>>>> to a read-only configuration space field at offset 0x44, size 2. This may be
>>>> harmless, but if you have problems with your device:
>>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
>>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing
>>>> list along with details of your device obtained from lspci.
>>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
>>>> 0002)
>>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI
>>>> 16 (level, low) -> IRQ 16
>>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0
>>>> disabled
>>>>
>>>>
>>>>
>>>> some more details - [root at p128 ~]# rpm -qa | grep xen
>>>> kernel-xen-2.6.18-92.1.22.el5
>>>> xen-3.0.3-64.el5_2.9
>>>> xen-libs-3.0.3-64.el5_2.9
>>>> xen-libs-3.0.3-64.el5_2.9
>>>>
>>>> [root at p128 ~]# ibv_devinfo
>>>> hca_id: mthca0
>>>>         fw_ver:                         5.3.0
>>>>         node_guid:                      0002:c902:0022:cd48
>>>>         sys_image_guid:                 0002:c902:0022:cd4b
>>>>         vendor_id:                      0x02c9
>>>>         vendor_part_id:                 25218
>>>>         hw_ver:                         0x20
>>>>         board_id:                       MT_0370130002
>>>>         phys_port_cnt:                  2
>>>>                 port:   1
>>>>                         state:                  PORT_INIT (2)
>>>>                         max_mtu:                2048 (4)
>>>>                         active_mtu:             512 (2)
>>>>                         sm_lid:                 0
>>>>                         port_lid:               0
>>>>                         port_lmc:               0x00
>>>>
>>>>                 port:   2
>>>>                         state:                  PORT_DOWN (1)
>>>>                         max_mtu:                2048 (4)
>>>>                         active_mtu:             512 (2)
>>>>                         sm_lid:                 0
>>>>                         port_lid:               0
>>>>                         port_lmc:               0x00
>>>>
>>>>
>>>> any help greatly appreciated.
>>>>
>>>> ~subbu
>>>>
>>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com>wrote:
>>>>
>>>>> Okay so my question to the openfabrics guys is, why would the OFED
>>>>> drivers fail to read the firmware?
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Thanks,
>>>>> - David Brown
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: David Brown <dmlb2000 at gmail.com>
>>>>> Date: Thu, Sep 11, 2008 at 2:24 PM
>>>>> Subject: pciback module not working
>>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>>>>
>>>>>
>>>>> This issue was brought up about a year and a half ago. So I'll bring
>>>>> it up again and see if anything happens.
>>>>>
>>>>> I've got an infiniband network and am attempting to pass the
>>>>> infiniband card through the host and give it to the guest.
>>>>> I'm working with standard CentOS 5.2 on both guest and host with their
>>>>> provided xen (3.0.3 ish). I've also attempted to install the newest
>>>>> Xen 3.3 and use their standard host kernel and that did the same
>>>>> thing. The guest dmesg output in the guest is similar on both
>>>>> permissive and normal mode.
>>>>>
>>>>> I'm getting issues with detecting the firmware on the card for some
>>>>> reason...
>>>>>
>>>>> Any help would be appreciated.
>>>>>
>>>>> Thanks,
>>>>> - David Brown
>>>>>
>>>>> === GUEST dmesg output ===
>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>> =======================
>>>>>
>>>>> === Host modprobe.conf ===
>>>>> alias eth0 bnx2
>>>>> alias eth1 bnx2
>>>>> alias scsi_hostadapter cciss
>>>>> options pciback hide=(41:00.0)
>>>>> =====================
>>>>>
>>>>> === Host lspci output ===
>>>>> # lspci -vs 41:00.0
>>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>> HCA] (rev 20)
>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>       Flags: fast devsel, IRQ 16
>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>> [size=1M]
>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>       Capabilities: [40] Power Management version 2
>>>>>       Capabilities: [48] Vital Product Data
>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
>>>>> Enable-
>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>> =====================
>>>>>
>>>>> This makes sure it get loaded first off before anything else.
>>>>> === Host mkinitrd cmd ===
>>>>> # mkinitrd -f --with=pciback --preload pciback
>>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>>>>> ====================
>>>>>
>>>>> === Host pciback dmesg ===
>>>>> pciback 0000:41:00.0: Driver tried to write to a read-only
>>>>> configuration space field at offset 0x44, size 2. This may be
>>>>> harmless, but if you have problems with your device:
>>>>> 1) see permissive attribute in sysfs
>>>>> 2) report problems to the xen-devel mailing list along with details of
>>>>> your device obtained from lspci.
>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>> ======================
>>>>>
>>>>> === Host pciback dmesg (after setting it permissive) ===
>>>>> pciback 0000:41:00.0: enabling permissive mode configuration space
>>>>> accesses!
>>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>>>>> device vif1.0 entered promiscuous mode
>>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>> =========================================
>>>>>
>>>>> === Guest lspci output ===
>>>>> # lspci -v
>>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>> HCA] (rev 20)
>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>       Flags: fast devsel, IRQ 16
>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>> [size=1M]
>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>       Capabilities: [40] Power Management version 2
>>>>>       Capabilities: [48] Vital Product Data
>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>>> Queue=0/5 Enable-
>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>> =====================
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> . . . s u b b u
>>>> "You've got to be original, because if you're like someone else, what do
>>>> they need you for?"
>>>>
>>>>
>>>
>>>
>>> --
>>> . . . s u b b u
>>> "You've got to be original, because if you're like someone else, what do
>>> they need you for?"
>>>
>>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>>
>>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/cc9f08b0/attachment.html>

From yunhong.jiang at intel.com  Thu Feb 12 00:01:27 2009
From: yunhong.jiang at intel.com (Jiang, Yunhong)
Date: Thu, 12 Feb 2009 16:01:27 +0800
Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902112345v2e46dc93g9ff086d8159ceb6@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com>
	<f3b32c250902110618u5edb80b7xaf31e3acdf8d6709@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112345v2e46dc93g9ff086d8159ceb6@mail.gmail.com>
Message-ID: <E2263E4A5B2284449EEBD0AAB751098401C7969B96@PDSMSX501.ccr.corp.intel.com>

Can you please share more information how will the ib_mthca do QUERY_FW? Through config space access? Through MMIO access? I think more information will be helpful. The only thing seems strange to me is, from "Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO is disabled?

Thanks
Yunhong Jiang

________________________________
From: subbu kl [mailto:subbukl at gmail.com]
Sent: 2009年2月12日 15:46
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

so back to square one ?
Why QUERY_FW should fail in domU ?

~subbu

On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
DomU access config space through pcibackend, so that message is ok.

________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 14:59

To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

So getting PCI config space access in domU will solve the problem ? if so how can I achieve that ?

~subbu

On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
Sorry that seems the original mail has tried the permissive already :$
How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled?

      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]

As for the following information, I think it should be harmless since domU has no method of config spacess access method.
 PCI: Fatal: No PCI config space access function found

Thanks
Yunhong Jiang

________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 14:43

To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

oops missed it,

well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU

tap tap-1-51712: 2 getting info
pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:0e:00.0 to 64
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
xenbr0: topology change detected, propagating
xenbr0: port 3(vif1.0) entering forwarding state

any suspicious message ?
any Idea why I get that :
 PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.

message in domU bootup message ?

~subbu

On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
So any changes in dom0's dmesg?


________________________________
From: subbu kl [mailto:subbukl at gmail.com<mailto:subbukl at gmail.com>]
Sent: 2009年2月12日 13:52
To: Jiang, Yunhong
Cc: David Brown; xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

no luck !
 dmesg in XEN PV guest shows :

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

even after executingh the following in dom0:

#echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive

I am getting the follwing messages on the console as part of the initial bootup messages of the guest:

Started domain rhel52_64_3
PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.

after executing the following in dom0 :
#xm create -c rhel52_64_3


so, problem persisits,

~subbu


2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>
Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it.
I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think).

BTW, where you got following log? That seems suggest config space function not found.

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found."

-- Yunhong Jiang
________________________________
From: xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com> [mailto:xen-devel-bounces at lists.xensource.com<mailto:xen-devel-bounces at lists.xensource.com>] On Behalf Of subbu kl
Sent: 2009年2月11日 22:18
To: David Brown
Cc: xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>; general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working

I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module.

No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again.

after executing in dom0
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind

#dmesg
ACPI: PCI interrupt for device 0000:0e:00.0 disabled
tap tap-1-51712: 2 getting info
tap tap-2-51712: 2 getting info
pciback 0000:0e:00.0: seizing device
PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:0e:00.0 disabled

#xm create -c rhel52_64_3

PCI: Fatal: No PCI config space access function found
rtc: IRQ 8 is not free.
i8042.c: No controller found.


GUEST dmesg:

ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11

in dom0:
Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready
Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device:
Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled


some more details - [root at p128 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-92.1.22.el5
xen-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9
xen-libs-3.0.3-64.el5_2.9

[root at p128 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.3.0
        node_guid:                      0002:c902:0022:cd48
        sys_image_guid:                 0002:c902:0022:cd4b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0x20
        board_id:                       MT_0370130002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


any help greatly appreciated.

~subbu

On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>> wrote:
Okay so my question to the openfabrics guys is, why would the OFED
drivers fail to read the firmware?

Any thoughts?

Thanks,
- David Brown


---------- Forwarded message ----------
From: David Brown <dmlb2000 at gmail.com<mailto:dmlb2000 at gmail.com>>
Date: Thu, Sep 11, 2008 at 2:24 PM
Subject: pciback module not working
To: xen-users at lists.xensource.com<mailto:xen-users at lists.xensource.com>, xen-devel at lists.xensource.com<mailto:xen-devel at lists.xensource.com>


This issue was brought up about a year and a half ago. So I'll bring
it up again and see if anything happens.

I've got an infiniband network and am attempting to pass the
infiniband card through the host and give it to the guest.
I'm working with standard CentOS 5.2 on both guest and host with their
provided xen (3.0.3 ish). I've also attempted to install the newest
Xen 3.3 and use their standard host kernel and that did the same
thing. The guest dmesg output in the guest is similar on both
permissive and normal mode.

I'm getting issues with detecting the firmware on the card for some reason...

Any help would be appreciated.

Thanks,
- David Brown

=== GUEST dmesg output ===
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
ib_mthca: Initializing 0000:00:00.0
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
ib_mthca: probe of 0000:00:00.0 failed with error -11
=======================

=== Host modprobe.conf ===
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
options pciback hide=(41:00.0)
=====================

=== Host lspci output ===
# lspci -vs 41:00.0
41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================

This makes sure it get loaded first off before anything else.
=== Host mkinitrd cmd ===
# mkinitrd -f --with=pciback --preload pciback
/boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
====================

=== Host pciback dmesg ===
pciback 0000:41:00.0: Driver tried to write to a read-only
configuration space field at offset 0x44, size 2. This may be
harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of
your device obtained from lspci.
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
======================

=== Host pciback dmesg (after setting it permissive) ===
pciback 0000:41:00.0: enabling permissive mode configuration space accesses!
pciback 0000:41:00.0: permissive mode is potentially unsafe!
pciback: vpci: 0000:41:00.0: assign to virtual slot 0
device vif1.0 entered promiscuous mode
ADDRCONF(NETDEV_UP): vif1.0: link is not ready
blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:41:00.0 to 64
ACPI: PCI interrupt for device 0000:41:00.0 disabled
=========================================

=== Guest lspci output ===
# lspci -v
00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
      Subsystem: Hewlett-Packard Company Unknown device 170a
      Flags: fast devsel, IRQ 16
      Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
      Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
      Capabilities: [40] Power Management version 2
      Capabilities: [48] Vital Product Data
      Capabilities: [90] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
      Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
      Capabilities: [60] Express Endpoint IRQ 0
=====================
_______________________________________________
general mailing list
general at lists.openfabrics.org<mailto:general at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"


--
. . . s u b b u
"You've got to be original, because if you're like someone else, what do they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/9f4e8808/attachment.html>

From subbukl at gmail.com  Thu Feb 12 00:20:13 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 12 Feb 2009 13:50:13 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <E2263E4A5B2284449EEBD0AAB751098401C7969B96@PDSMSX501.ccr.corp.intel.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C78D73E7@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112345v2e46dc93g9ff086d8159ceb6@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969B96@PDSMSX501.ccr.corp.intel.com>
Message-ID: <f3b32c250902120020y5d73f054nd38d00e3063f67b3@mail.gmail.com>

did a quick search,
I believe its MMIO, as it is

in file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_main.c
mthca_QUERY_FW <http://www.cs.fsu.edu/%7Ebaker/devices/lxr/http/ident?i=mthca_QUERY_FW>()
is resulting into

mthca_QUERY_FW() which inturn will result into
mthca_cmd_post_dbell()/mthca_cmd_post_hcr() which inturn results into
__raw_writel((__force u32) cpu_to_be32(in_param >> 32),           ptr
+ offs[0]);


in the file -  http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_cmd.c

OFED people should be more helpful here to comment if I have missed out
something. Roland any clue?

~subbu

On Thu, Feb 12, 2009 at 1:31 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:

>  Can you please share more information how will the ib_mthca do QUERY_FW?
> Through config space access? Through MMIO access? I think more information
> will be helpful. The only thing seems strange to me is, from "Memory at
> fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO
> is disabled?
>
> Thanks
> Yunhong Jiang
>
>  ------------------------------
> *From:* subbu kl [mailto:subbukl at gmail.com]
> *Sent:* 2009年2月12日 15:46
>
> *To:* Jiang, Yunhong
> *Cc:* David Brown; xen-devel at lists.xensource.com;
> general at lists.openfabrics.org
> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
> working
>
> so back to square one ?
> Why QUERY_FW should fail in domU ?
>
> ~subbu
>
> On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:
>
>>  DomU access config space through pcibackend, so that message is ok.
>>
>>  ------------------------------
>>  *From:* subbu kl [mailto:subbukl at gmail.com]
>> *Sent:* 2009年2月12日 14:59
>>
>> *To:* Jiang, Yunhong
>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>> general at lists.openfabrics.org
>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>> working
>>
>>   So getting PCI config space access in domU will solve the problem ? if
>> so how can I achieve that ?
>>
>> ~subbu
>>
>> On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <yunhong.jiang at intel.com
>> > wrote:
>>
>>>  Sorry that seems the original mail has tried the permissive already :$
>>> How will So how will the card do the QEUREY_FW command?Through config
>>> space or through MMIO? Following information is something strange, why all
>>> the MMIO range is disabled?
>>>
>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>
>>> As for the following information, I think it should be harmless since
>>> domU has no method of config spacess access method.
>>>   PCI: Fatal: No PCI config space access function found
>>>
>>> Thanks
>>> Yunhong Jiang
>>>
>>>  ------------------------------
>>>  *From:* subbu kl [mailto:subbukl at gmail.com]
>>> *Sent:* 2009年2月12日 14:43
>>>
>>> *To:* Jiang, Yunhong
>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>> general at lists.openfabrics.org
>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>> working
>>>
>>>   oops missed it,
>>>
>>> well now I dont see that enable permissive...message. here goes the
>>> messages what I got in dom0 while booting domU
>>>
>>> tap tap-1-51712: 2 getting info
>>> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
>>> device vif1.0 entered promiscuous mode
>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
>>> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>> PCI: Setting latency timer of device 0000:0e:00.0 to 64
>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
>>> xenbr0: topology change detected, propagating
>>> xenbr0: port 3(vif1.0) entering forwarding state
>>>
>>> any suspicious message ?
>>> any Idea why I get that :
>>>  PCI: Fatal: No PCI config space access function found
>>> rtc: IRQ 8 is not free.
>>>
>>> message in domU bootup message ?
>>>
>>> ~subbu
>>>
>>> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <
>>> yunhong.jiang at intel.com> wrote:
>>>
>>>>  So any changes in dom0's dmesg?
>>>>
>>>>
>>>>  ------------------------------
>>>> *From:* subbu kl [mailto:subbukl at gmail.com]
>>>> *Sent:* 2009年2月12日 13:52
>>>> *To:* Jiang, Yunhong
>>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>>> general at lists.openfabrics.org
>>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>> working
>>>>
>>>>   no luck !
>>>>  dmesg in XEN PV guest shows :
>>>>
>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>>> ib_mthca: Initializing 0000:00:00.0
>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>
>>>> even after executingh the following in dom0:
>>>>
>>>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive
>>>>
>>>> I am getting the follwing messages on the console as part of the initial
>>>> bootup messages of the guest:
>>>>
>>>> Started domain rhel52_64_3
>>>> PCI: Fatal: No PCI config space access function found
>>>> rtc: IRQ 8 is not free.
>>>> i8042.c: No controller found.
>>>>
>>>> after executing the following in dom0 :
>>>> #xm create -c rhel52_64_3
>>>>
>>>>
>>>> so, problem persisits,
>>>>
>>>> ~subbu
>>>>
>>>>
>>>> 2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>
>>>>
>>>>>  Seems it is because PCI frontend try to write some configuration
>>>>> space that PCIback has no config_field entry to support it.
>>>>> I think you can firstly try to do as dom0's dmesg suggested: "see
>>>>> permissive attribute in sysfs" (it should be "set permissive attribute...",
>>>>> I think).
>>>>>
>>>>> BTW, where you got following log? That seems suggest config space
>>>>> function not found.
>>>>>
>>>>> PCI: Fatal: No PCI config space access function found
>>>>> rtc: IRQ 8 is not free.
>>>>> i8042.c: No controller found."
>>>>>
>>>>> -- Yunhong Jiang
>>>>>
>>>>>  ------------------------------
>>>>> *From:* xen-devel-bounces at lists.xensource.com [mailto:
>>>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
>>>>> *Sent:* 2009年2月11日 22:18
>>>>> *To:* David Brown
>>>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
>>>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>>> working
>>>>>
>>>>>   I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
>>>>> paravirtualized guest with pciback module.
>>>>>
>>>>> No one seems to have tried answering this question on the list, let me
>>>>> ping xen-devel and ofed people again.
>>>>>
>>>>> after executing in dom0
>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>>>>>
>>>>> #dmesg
>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>>> tap tap-1-51712: 2 getting info
>>>>> tap tap-2-51712: 2 getting info
>>>>> pciback 0000:0e:00.0: seizing device
>>>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
>>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>>>
>>>>> #xm create -c rhel52_64_3
>>>>>
>>>>> PCI: Fatal: No PCI config space access function found
>>>>> rtc: IRQ 8 is not free.
>>>>> i8042.c: No controller found.
>>>>>
>>>>>
>>>>> GUEST dmesg:
>>>>>
>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>>
>>>>> in dom0:
>>>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
>>>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to
>>>>> virtual slot 0
>>>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
>>>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not
>>>>> ready
>>>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9,
>>>>> protocol 1 (x86_64-abi)
>>>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to
>>>>> write to a read-only configuration space field at offset 0x44, size 2. This
>>>>> may be harmless, but if you have problems with your device:
>>>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
>>>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel
>>>>> mailing list along with details of your device obtained from lspci.
>>>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 ->
>>>>> 0002)
>>>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI
>>>>> 16 (level, low) -> IRQ 16
>>>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device
>>>>> 0000:0e:00.0 disabled
>>>>>
>>>>>
>>>>>
>>>>> some more details - [root at p128 ~]# rpm -qa | grep xen
>>>>> kernel-xen-2.6.18-92.1.22.el5
>>>>> xen-3.0.3-64.el5_2.9
>>>>> xen-libs-3.0.3-64.el5_2.9
>>>>> xen-libs-3.0.3-64.el5_2.9
>>>>>
>>>>> [root at p128 ~]# ibv_devinfo
>>>>> hca_id: mthca0
>>>>>         fw_ver:                         5.3.0
>>>>>         node_guid:                      0002:c902:0022:cd48
>>>>>         sys_image_guid:                 0002:c902:0022:cd4b
>>>>>         vendor_id:                      0x02c9
>>>>>         vendor_part_id:                 25218
>>>>>         hw_ver:                         0x20
>>>>>         board_id:                       MT_0370130002
>>>>>         phys_port_cnt:                  2
>>>>>                 port:   1
>>>>>                         state:                  PORT_INIT (2)
>>>>>                         max_mtu:                2048 (4)
>>>>>                         active_mtu:             512 (2)
>>>>>                         sm_lid:                 0
>>>>>                         port_lid:               0
>>>>>                         port_lmc:               0x00
>>>>>
>>>>>                 port:   2
>>>>>                         state:                  PORT_DOWN (1)
>>>>>                         max_mtu:                2048 (4)
>>>>>                         active_mtu:             512 (2)
>>>>>                         sm_lid:                 0
>>>>>                         port_lid:               0
>>>>>                         port_lmc:               0x00
>>>>>
>>>>>
>>>>> any help greatly appreciated.
>>>>>
>>>>> ~subbu
>>>>>
>>>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com>wrote:
>>>>>
>>>>>> Okay so my question to the openfabrics guys is, why would the OFED
>>>>>> drivers fail to read the firmware?
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Thanks,
>>>>>> - David Brown
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: David Brown <dmlb2000 at gmail.com>
>>>>>> Date: Thu, Sep 11, 2008 at 2:24 PM
>>>>>> Subject: pciback module not working
>>>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>>>>>
>>>>>>
>>>>>> This issue was brought up about a year and a half ago. So I'll bring
>>>>>> it up again and see if anything happens.
>>>>>>
>>>>>> I've got an infiniband network and am attempting to pass the
>>>>>> infiniband card through the host and give it to the guest.
>>>>>> I'm working with standard CentOS 5.2 on both guest and host with their
>>>>>> provided xen (3.0.3 ish). I've also attempted to install the newest
>>>>>> Xen 3.3 and use their standard host kernel and that did the same
>>>>>> thing. The guest dmesg output in the guest is similar on both
>>>>>> permissive and normal mode.
>>>>>>
>>>>>> I'm getting issues with detecting the firmware on the card for some
>>>>>> reason...
>>>>>>
>>>>>> Any help would be appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> - David Brown
>>>>>>
>>>>>> === GUEST dmesg output ===
>>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>>> =======================
>>>>>>
>>>>>> === Host modprobe.conf ===
>>>>>> alias eth0 bnx2
>>>>>> alias eth1 bnx2
>>>>>> alias scsi_hostadapter cciss
>>>>>> options pciback hide=(41:00.0)
>>>>>> =====================
>>>>>>
>>>>>> === Host lspci output ===
>>>>>> # lspci -vs 41:00.0
>>>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>>> HCA] (rev 20)
>>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>>       Flags: fast devsel, IRQ 16
>>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>>> [size=1M]
>>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>>       Capabilities: [40] Power Management version 2
>>>>>>       Capabilities: [48] Vital Product Data
>>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>>>> Queue=0/5 Enable-
>>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>>> =====================
>>>>>>
>>>>>> This makes sure it get loaded first off before anything else.
>>>>>> === Host mkinitrd cmd ===
>>>>>> # mkinitrd -f --with=pciback --preload pciback
>>>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>>>>>> ====================
>>>>>>
>>>>>> === Host pciback dmesg ===
>>>>>> pciback 0000:41:00.0: Driver tried to write to a read-only
>>>>>> configuration space field at offset 0x44, size 2. This may be
>>>>>> harmless, but if you have problems with your device:
>>>>>> 1) see permissive attribute in sysfs
>>>>>> 2) report problems to the xen-devel mailing list along with details of
>>>>>> your device obtained from lspci.
>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>>> ======================
>>>>>>
>>>>>> === Host pciback dmesg (after setting it permissive) ===
>>>>>> pciback 0000:41:00.0: enabling permissive mode configuration space
>>>>>> accesses!
>>>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>>>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>>>>>> device vif1.0 entered promiscuous mode
>>>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>>> =========================================
>>>>>>
>>>>>> === Guest lspci output ===
>>>>>> # lspci -v
>>>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>>> HCA] (rev 20)
>>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>>       Flags: fast devsel, IRQ 16
>>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>>> [size=1M]
>>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>>       Capabilities: [40] Power Management version 2
>>>>>>       Capabilities: [48] Vital Product Data
>>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>>>> Queue=0/5 Enable-
>>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>>> =====================
>>>>>> _______________________________________________
>>>>>> general mailing list
>>>>>> general at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>
>>>>>> To unsubscribe, please visit
>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> . . . s u b b u
>>>>> "You've got to be original, because if you're like someone else, what
>>>>> do they need you for?"
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> . . . s u b b u
>>>> "You've got to be original, because if you're like someone else, what do
>>>> they need you for?"
>>>>
>>>>
>>>
>>>
>>> --
>>> . . . s u b b u
>>> "You've got to be original, because if you're like someone else, what do
>>> they need you for?"
>>>
>>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>>
>>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/0cf12867/attachment.html>

From ogerlitz at Voltaire.com  Thu Feb 12 00:25:59 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Thu, 12 Feb 2009 10:25:59 +0200
Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops
In-Reply-To: <4993C24E.504@oracle.com>
References: <4990CD57.3080108@oracle.com> <4992EABA.9090605@Voltaire.com>
	<4993C24E.504@oracle.com>
Message-ID: <4993DD17.4020205@Voltaire.com>

Sumeet Lahorani wrote:
> Does this packet drop always occur at the host or could it also occur in 
> the switches (Voltaire ISR 9024)?

The drop happens at the host, here's the relevant ipoib code snippet from drivers/infiniband/ulp/ipoib/ipoib_ib.c :: ipoib_send()

> 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
> 			ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
> 				   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
> 			++dev->stats.tx_dropped;
> 			++dev->stats.tx_errors;
> 			ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
> 			return;

 
> Also, besides the "packet len too long ..." message, is the "dropped"
> statistic in ifconfig ib0 a good way to find out if such packet drops
> are happening?

yes, see the code above.

Or.


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb 12 01:11:24 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 10:11:24 +0100
Subject: [ofa-general] [PATCH 0/3] Fat Tree - Routing between non-CN	nodes
Message-ID: <4993E7BC.4000105@ext.bull.net>

Repost of the previous set of patches:

Hi,

We are current working on a Ftree topology where IO nodes are connected on spine switches.
Using the cn_guid_file and root_guid_file works great.
It is possible to route the whole tree as a fat tree. All the CNs are connected to the other CN and IO nodes.
However, we are missing some connectivity between IO nodes. This is the expected behavior as the route between those IO nodes would have
to go down to go back up on another spine switch.

However, we need at least a bit of connectivity between those nodes. There won't be any real traffic but just some "ping" for HA purposes.

Therefore, I have implemented two new options to openSM: io_guid_file and max_reverse_hops.
The io_guid_file provides a list of all the IO guid (it may differs from the list of non-CN nodes)
The max_reverse_hops gives the number of time IO nodes (described by io_guid_file) are allowed to use a switch backward.

According to my tests this has absolutely no effects on regular routing and manages to connect the io nodes together, if max_reverse_hops is big enough.


Regards

Nicolas Morey- Chaisemartin
____

Nicolas Morey-Chaisemartin (3):
   opensm:   Added io_guid_file and max_reverse_hops options
   opensm/osm_ucast_ftree.c: Added possible reverse hops for Ftree
     algorithm.
   Added documentation for io_guid_file and max_reverse_hop feature

  opensm/doc/current-routing.txt     |   32 +++++
  opensm/include/opensm/osm_subnet.h |    6 +
  opensm/man/opensm.8.in             |   27 ++++
  opensm/opensm/main.c               |   26 ++++-
  opensm/opensm/osm_subnet.c         |   12 ++
  opensm/opensm/osm_ucast_ftree.c    |  244 +++++++++++++++++++++++++++++-------
  6 files changed, 303 insertions(+), 44 deletions(-)


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb 12 01:11:34 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 10:11:34 +0100
Subject: [ofa-general] [PATCH 1/3] opensm: Added io_guid_file and
	max_reverse_hops options
In-Reply-To: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
References: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
Message-ID: <4993E7C6.8020501@ext.bull.net>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/include/opensm/osm_subnet.h |    6 ++++++
  opensm/opensm/main.c               |   26 +++++++++++++++++++++++++-
  opensm/opensm/osm_subnet.c         |   12 ++++++++++++
  3 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 8863e47..671b51f 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -190,6 +190,8 @@ typedef struct osm_subn_opt {
  	char *lfts_file;
  	char *root_guid_file;
  	char *cn_guid_file;
+	char *io_guid_file;
+       uint16_t max_reverse_hops;
  	char *ids_guid_file;
  	char *guid_routing_order_file;
  	char *sa_db_file;
@@ -383,6 +385,10 @@ typedef struct osm_subn_opt {
  *		Name of the file that contains list of compute node guids that
  *		will be used by fat-tree routing (provided by User)
  *
+*	io_guid_file
+*		Name of the file that contains list of I/O node guids that
+*		will be used by fat-tree routing (provided by User)
+*
  *	ids_guid_file
  *		Name of the file that contains list of ids which should be
  *		used by Up/Down algorithm instead of node GUIDs
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index a8dc9e6..b5e3337 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -212,6 +212,12 @@ static void show_usage(void)
  	printf("--cn_guid_file, -u <path to file>\n"
  	       "          Set the compute nodes for the Fat-Tree routing algorithm\n"
  	       "          to the guids provided in the given file (one to a line)\n\n");
+	printf("--io_guid_file, -G <path to file>\n"
+	       "          Set the I/O nodes for the Fat-Tree routing algorithm\n"
+	       "          to the guids provided in the given file (one to a line)\n\n");
+	printf("--max_reverse_hops, -H <hop_count>\n"
+	       "          Set the max number of hops the wrong way around\n"
+	       "          an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n");
  	printf("--ids_guid_file, -m <path to file>\n"
  	       "          Name of the map file with set of the IDs which will be used\n"
  	       "          by Up/Down routing algorithm instead of node GUIDs\n"
@@ -526,7 +532,7 @@ int main(int argc, char *argv[])
  	uint32_t val;
  	unsigned config_file_done = 0;
  	const char *const short_option =
-	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:";
+	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:";

  	/*
  	   In the array below, the 2nd parameter specifies the number
@@ -570,6 +576,8 @@ int main(int argc, char *argv[])
  		{"sadb_file", 1, NULL, 'S'},
  		{"root_guid_file", 1, NULL, 'a'},
  		{"cn_guid_file", 1, NULL, 'u'},
+		{"io_guid_file", 1, NULL, 'G'},
+		{"max_reverse_hops", 1, NULL, 'H'},
  		{"ids_guid_file", 1, NULL, 'm'},
  		{"guid_routing_order_file", 1, NULL, 'X'},
  		{"stay_on_fatal", 0, NULL, 'y'},
@@ -880,6 +888,22 @@ int main(int argc, char *argv[])
  			       opt.cn_guid_file);
  			break;

+		case 'G':
+			/*
+			   Specifies I/O node guids file
+			 */
+			opt.io_guid_file = optarg;
+			printf(" I/O Node Guid File: %s\n",
+			       opt.io_guid_file);
+			break;
+		case 'H':
+			/*
+			   Specifies I/O max reverted hops
+			 */
+			opt.max_reverse_hops =  atoi(optarg);
+			printf(" Max Reverse Hops: %d\n",
+			       opt.max_reverse_hops);
+			break;
  		case 'm':
  			/* Specifies ids guid file */
  			SET_STR_OPT(opt.ids_guid_file, optarg);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 69937c1..b356d33 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -578,6 +578,8 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
  	p_opt->lfts_file = NULL;
  	p_opt->root_guid_file = NULL;
  	p_opt->cn_guid_file = NULL;
+	p_opt->io_guid_file = NULL;
+	p_opt->max_reverse_hops = 0;
  	p_opt->ids_guid_file = NULL;
  	p_opt->guid_routing_order_file = NULL;
  	p_opt->sa_db_file = NULL;
@@ -1393,6 +1395,16 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
  		p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str);

  	fprintf(out,
+		"# The file holding the fat-tree I/O node guids\n"
+		"# One guid in each line\nio_guid_file %s\n\n",
+		p_opts->io_guid_file ? p_opts->io_guid_file : null_str);
+
+	fprintf(out,
+		"# Number of reverse hops allowed for I/O nodes \n"
+		"# Used for connectivity between I/O nodes connected to Top Switches\nmax_reverse_hops %d\n\n",
+		p_opts->max_reverse_hops);
+
+	fprintf(out,
  		"# The file holding the node ids which will be used by"
  		" Up/Down algorithm instead\n# of GUIDs (one guid and"
  		" id in each line)\nids_guid_file %s\n\n",
-- 
1.6.1


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb 12 01:11:38 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 10:11:38 +0100
Subject: [ofa-general] [PATCH 2/3] opensm/osm_ucast_ftree.c: Added possible
 reverse hops for Ftree algorithm.
In-Reply-To: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
References: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
Message-ID: <4993E7CA.60103@ext.bull.net>

     This allows connectivity between nodes declared in the io_guid_file when they had none with the regular algorithm
     and it can be solved by doin less than max_reverse_hops in the tree.
     This is meant to be used for I/O  and service nodes connected to the Top Switches of a Fat Tree, that need connectivity
     but no real bandwidth.

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/opensm/osm_ucast_ftree.c |  244 ++++++++++++++++++++++++++++++++-------
  1 files changed, 201 insertions(+), 43 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 53218d1..d92265b 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -150,6 +150,7 @@ typedef struct ftree_port_group_t_ {
  	ftree_hca_or_sw remote_hca_or_sw;	/* pointer to remote hca/switch */
  	cl_ptr_vector_t ports;	/* vector of ports to the same lid */
  	boolean_t is_cn;	/* whether this port is a compute node */
+	boolean_t is_io;	/* whether this port is an I/O node */
  	uint32_t counter_down;	/* number of allocated routs downwards */
  } ftree_port_group_t;

@@ -199,6 +200,7 @@ typedef struct ftree_fabric_t_ {
  	cl_qmap_t sw_tbl;
  	cl_qmap_t sw_by_tuple_tbl;
  	cl_qmap_t cn_guid_tbl;
+	cl_qmap_t io_guid_tbl;
  	unsigned cn_num;
  	uint8_t leaf_switch_rank;
  	uint8_t max_switch_rank;
@@ -386,7 +388,8 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid,
  			      IN ib_net64_t remote_node_guid,
  			      IN uint8_t remote_node_type,
  			      IN void *p_remote_hca_or_sw,
-			      IN boolean_t is_cn)
+			      IN boolean_t is_cn,
+			      IN boolean_t is_io)
  {
  	ftree_port_group_t *p_group =
  	    (ftree_port_group_t *) malloc(sizeof(ftree_port_group_t));
@@ -434,6 +437,7 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid,
  	cl_ptr_vector_init(&p_group->ports, 0,	/* min size */
  			   8);	/* grow size */
  	p_group->is_cn = is_cn;
+	p_group->is_io = is_io;
  	return p_group;
  }				/* __osm_ftree_port_group_create() */

@@ -699,7 +703,7 @@ __osm_ftree_sw_add_port(IN ftree_sw_t * p_sw,
  							remote_node_guid,
  							remote_node_type,
  							p_remote_hca_or_sw,
-							FALSE);
+							FALSE, FALSE);
  		CL_ASSERT(p_group);

  		if (direction == FTREE_DIRECTION_UP)
@@ -830,7 +834,8 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
  			 IN ib_net64_t remote_port_guid,
  			 IN ib_net64_t remote_node_guid,
  			 IN uint8_t remote_node_type,
-			 IN void *p_remote_hca_or_sw, IN boolean_t is_cn)
+			 IN void *p_remote_hca_or_sw, IN boolean_t is_cn,
+			 IN boolean_t is_io)
  {
  	ftree_port_group_t *p_group;

@@ -853,7 +858,7 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
  							remote_node_guid,
  							remote_node_type,
  							p_remote_hca_or_sw,
-							is_cn);
+							is_cn, is_io);
  		p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group;
  	}
  	__osm_ftree_port_group_add_port(p_group, port_num, remote_port_num);
@@ -879,6 +884,7 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
  	cl_qmap_init(&p_ftree->sw_tbl);
  	cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
  	cl_qmap_init(&p_ftree->cn_guid_tbl);
+	cl_qmap_init(&p_ftree->io_guid_tbl);

  	return p_ftree;
  }
@@ -945,6 +951,18 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
  	}
  	cl_qmap_remove_all(&p_ftree->cn_guid_tbl);

+	/* remove all the elements of io_guid_tbl */
+	p_next_guid_element =
+	    (name_map_item_t *) cl_qmap_head(&p_ftree->io_guid_tbl);
+	while (p_next_guid_element !=
+	       (name_map_item_t *) cl_qmap_end(&p_ftree->io_guid_tbl)) {
+		p_guid_element = p_next_guid_element;
+		p_next_guid_element =
+		    (name_map_item_t *) cl_qmap_next(&p_guid_element->item);
+		free(p_guid_element);
+	}
+	cl_qmap_remove_all(&p_ftree->io_guid_tbl);
+
  	/* free the leaf switches array */
  	if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches))
  		free(p_ftree->leaf_switches);
@@ -1335,6 +1353,14 @@ static inline boolean_t __osm_ftree_fabric_cns_provided(IN ftree_fabric_t *

  /***************************************************/

+static inline boolean_t __osm_ftree_fabric_ios_provided(IN ftree_fabric_t *
+							p_ftree)
+{
+	return (p_ftree->p_osm->subn.opt.io_guid_file != NULL);
+}
+
+/***************************************************/
+
  static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree)
  {
  	ftree_sw_t *p_sw;
@@ -1901,7 +1927,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  					       IN uint8_t target_rank,
  					       IN boolean_t is_real_lid,
  					       IN boolean_t is_main_path,
-					       IN uint8_t highest_rank_in_route)
+					       IN uint8_t highest_rank_in_route,
+					       IN uint16_t reverse_hops)
  {
  	ftree_sw_t *p_remote_sw;
  	uint16_t ports_num;
@@ -2008,13 +2035,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  		/* second case: skip the port group if the remote (lower)
  		   switch has been already configured for this target LID */
  		if (is_real_lid && !is_main_path &&
-		    p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
+		    p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] !=
+		    OSM_NO_PATH)
  			continue;

  		/* setting fwd tbl port only if this is real LID */
  		if (is_real_lid) {
  			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
-				p_min_port->remote_port_num;
+			    p_min_port->remote_port_num;
  			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
  				"Switch %s: set path to CA LID %u through port %u\n",
  				__osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -2034,7 +2062,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  							((target_rank -
  							  highest_rank_in_route)
  							 + (p_remote_sw->rank -
-							    highest_rank_in_route)));
+							    highest_rank_in_route)
+							 + reverse_hops * 2));
  			}

  		}
@@ -2049,15 +2078,13 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,

  		/* Recursion step:
  		   Assign upgoing ports by stepping down, starting on REMOTE switch */
-		created_route |=
-		    __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
-								   p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
-								   NULL,	/* prev. position - NULL to mark that we went down and not up */
-								   target_lid,	/* LID that we're routing to */
-								   target_rank,	/* rank of the LID that we're routing to */
-								   is_real_lid,	/* whether the target LID is real or dummy */
-								   is_main_path,	/* whether this is path to HCA that should by tracked by counters */
-								   highest_rank_in_route);	/* highest visited point in the tree before going down */
+		created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
+										NULL,	/* prev. position - NULL to mark that we went down and not up */
+										target_lid,	/* LID that we're routing to */
+										target_rank,	/* rank of the LID that we're routing to */
+										is_real_lid,	/* whether the target LID is real or dummy */
+										is_main_path,	/* whether this is path to HCA that should by tracked by counters */
+										highest_rank_in_route, reverse_hops);	/* highest visited point in the tree before going down */
  	}
  	/* done scanning all the down-going port groups */

@@ -2066,7 +2093,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  	   going through all the downgoing groups */
  	if (created_route)
  		p_sw->down_port_groups_idx =
-			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+		    (p_sw->down_port_groups_idx +
+		     1) % p_sw->down_port_groups_num;

  	return created_route;
  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */
@@ -2091,7 +2119,9 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  					       IN ib_net16_t target_lid,
  					       IN uint8_t target_rank,
  					       IN boolean_t is_real_lid,
-					       IN boolean_t is_main_path)
+					       IN boolean_t is_main_path,
+					       IN uint16_t reverse_hop_credit,
+					       IN uint16_t reverse_hops)
  {
  	ftree_sw_t *p_remote_sw;
  	uint16_t ports_num;
@@ -2112,11 +2142,42 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  						       target_rank,	/* rank of the LID that we're routing to */
  						       is_real_lid,	/* whether this target LID is real or dummy */
  						       is_main_path,	/* whether this path to HCA should by tracked by counters */
-						       p_sw->rank);	/* the highest visited point in the tree before going down */
+						       p_sw->rank,	/* the highest visited point in the tree before going down */
+						       reverse_hops);	/* Number of reverse_hops done up to this point */

  	/* recursion stop condition - if it's a root switch, */
-	if (p_sw->rank == 0)
+	if (p_sw->rank == 0) {
+		if (reverse_hop_credit > 0) {
+			/* We go up by going down as we have some reverse_hop_credit left */
+			/* We use the index to scatter a bit the reverse up routes */
+			p_sw->down_port_groups_idx =
+			    (p_sw->down_port_groups_idx +
+			     1) % p_sw->down_port_groups_num;
+			i = p_sw->down_port_groups_idx;
+			for (j = 0; j < p_sw->down_port_groups_num; j++) {
+
+				p_group = p_sw->down_port_groups[i];
+				i = (i + 1) % p_sw->down_port_groups_num;
+
+				/* Skip this port group unless it points to a switch */
+				if (p_group->remote_node_type !=
+				    IB_NODE_TYPE_SWITCH)
+					continue;
+				p_remote_sw = p_group->remote_hca_or_sw.p_sw;
+
+				__osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
+									       p_sw,	/* this switch - prev. position switch for the function */
+									       target_lid,	/* LID that we're routing to */
+									       target_rank,	/* rank of the LID that we're routing to */
+									       is_real_lid,	/* whether this target LID is real or dummy */
+									       is_main_path,	/* whether this is path to HCA that should by tracked by counters */
+									       reverse_hop_credit - 1,	/* Remaining reverse_hops allowed */
+									       reverse_hops + 1);	/* Number of reverse_hops done up to this point */
+			}
+
+		}
  		return;
+	}

  	/* Find the least loaded upgoing port group */
  	p_min_group = NULL;
@@ -2202,14 +2263,20 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  		p_min_group->counter_down++;
  		p_min_port->counter_down++;
  		if (is_real_lid) {
-			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
-				p_min_port->remote_port_num;
-			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-				"Switch %s: set path to CA LID %u through port %u\n",
-				__osm_ftree_tuple_to_str(p_remote_sw->tuple),
-				cl_ntoh16(target_lid),
-				p_min_port->remote_port_num);
-
+			/* This LID may already be in the LFT in the reverse_hop feature is used */
+			/* We update the LFT only if this LID isn't already present. */
+			if (p_remote_sw->p_osm_sw->
+			    new_lft[cl_ntoh16(target_lid)] == OSM_NO_PATH) {
+				p_remote_sw->p_osm_sw->
+				    new_lft[cl_ntoh16(target_lid)] =
+				    p_min_port->remote_port_num;
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+					"Switch %s: set path to CA LID %u through port %u\n",
+					__osm_ftree_tuple_to_str(p_remote_sw->
+								 tuple),
+					cl_ntoh16(target_lid),
+					p_min_port->remote_port_num);
+			}
  			/* On the remote switch that is pointed by the min_group,
  			   set hops for ALL the ports in the remote group. */

@@ -2223,7 +2290,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  							cl_ntoh16(target_lid),
  							p_port->remote_port_num,
  							target_rank -
-							p_remote_sw->rank);
+							p_remote_sw->rank +
+							2 * reverse_hops);
  			}
  		}

@@ -2234,7 +2302,9 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  							       target_lid,	/* LID that we're routing to */
  							       target_rank,	/* rank of the LID that we're routing to */
  							       is_real_lid,	/* whether this target LID is real or dummy */
-							       is_main_path);	/* whether this is path to HCA that should by tracked by counters */
+							       is_main_path,	/* whether this is path to HCA that should by tracked by counters */
+							       reverse_hop_credit,	/* Remaining reverse_hops allowed */
+							       reverse_hops);	/* Number of reverse_hops done up to this point */
  	}

  	/* we're done for the third case */
@@ -2278,7 +2348,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  		p_remote_sw = p_group->remote_hca_or_sw.p_sw;

  		/* skip if target lid has been already set on remote switch fwd tbl */
-		if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
+		if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] !=
+		    OSM_NO_PATH)
  			continue;

  		if (p_sw->is_leaf) {
@@ -2297,7 +2368,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,

  		cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port);
  		p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
-			p_port->remote_port_num;
+		    p_port->remote_port_num;

  		/* On the remote switch that is pointed by the p_group,
  		   set hops for ALL the ports in the remote group. */
@@ -2310,7 +2381,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  						cl_ntoh16(target_lid),
  						p_port->remote_port_num,
  						target_rank -
-						p_remote_sw->rank);
+						p_remote_sw->rank +
+						2 * reverse_hops);
  		}

  		/* Recursion step:
@@ -2320,7 +2392,37 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
  							       target_lid,	/* LID that we're routing to */
  							       target_rank,	/* rank of the LID that we're routing to */
  							       TRUE,	/* whether the target LID is real or dummy */
-							       FALSE);	/* whether this is path to HCA that should by tracked by counters */
+							       FALSE,	/* whether this is path to HCA that should by tracked by counters */
+							       reverse_hop_credit,	/* Remaining reverse_hops allowed */
+							       reverse_hops);	/* Number of reverse_hops done up to this point */
+	}
+
+	/* If we don't have any reverse hop credits, we are done */
+	if (reverse_hop_credit == 0)
+		return;
+
+	/* We explore all the down group ports */
+	/* We try to reverse jump for each of them */
+	/* They already have a route to us from the upgoing_by_going_down started earlier */
+	/* This is only so it'll continue exploring up, after this step backwards */
+	for (i = 0; i < p_sw->down_port_groups_num; i++) {
+		p_group = p_sw->down_port_groups[i];
+		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
+
+		/* Skip this port group unless it points to a switch */
+		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+			continue;
+
+		/* Recursion step:
+		   Assign downgoing ports by stepping up, fter doing one step down starting on REMOTE switch. */
+		__osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
+							       p_sw,	/* this switch - prev. position switch for the function */
+							       target_lid,	/* LID that we're routing to */
+							       target_rank,	/* rank of the LID that we're routing to */
+							       TRUE,	/* whether the target LID is real or dummy */
+							       TRUE,	/* whether this is path to HCA that should by tracked by counters */
+							       reverse_hop_credit - 1,	/* Remaining reverse_hops allowed */
+							       reverse_hops + 1);	/* Number of reverse_hops done up to this point */
  	}

  }				/* ftree_fabric_route_downgoing_by_going_up() */
@@ -2408,7 +2510,9 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
  								       hca_lid,	/* LID that we're routing to */
  								       p_sw->rank + 1,	/* rank of the LID that we're routing to */
  								       TRUE,	/* whether this HCA LID is real or dummy */
-								       TRUE);	/* whether this path to HCA should by tracked by counters */
+								       TRUE,	/* whether this path to HCA should by tracked by counters */
+								       0,	/* Number of reverse hops allowed */
+								       0);	/* Number of reverse hops done yet */

  			/* count how many real targets have been routed from this leaf switch */
  			routed_targets_on_leaf++;
@@ -2433,7 +2537,9 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
  									       0,	/* LID that we're routing to - ignored for dummy HCA */
  									       0,	/* rank of the LID that we're routing to - ignored for dummy HCA */
  									       FALSE,	/* whether this HCA LID is real or dummy */
-									       TRUE);	/* whether this path to HCA should by tracked by counters */
+									       TRUE,	/* whether this path to HCA should by tracked by counters */
+									       0,	/* Number of reverse hops allowed */
+									       0);	/* Number of reverse hops done yet */
  			}
  		}
  	}
@@ -2518,7 +2624,9 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
  								       hca_lid,	/* LID that we're routing to */
  								       p_sw->rank + 1,	/* rank of the LID that we're routing to */
  								       TRUE,	/* whether this HCA LID is real or dummy */
-								       TRUE);	/* whether this path to HCA should by tracked by counters */
+								       TRUE,	/* whether this path to HCA should by tracked by counters */
+								       p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops : 0,	/* Number or reverse hops allowed */
+								       0);	/* Number or reverse hops done yet */
  		}
  		/* done with all the port groups of this HCA - go to next HCA */
  	}
@@ -2570,7 +2678,9 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
  							       p_sw->base_lid,	/* LID that we're routing to */
  							       p_sw->rank,	/* rank of the LID that we're routing to */
  							       TRUE,	/* whether the target LID is a real or dummy */
-							       FALSE);	/* whether this path should by tracked by counters */
+							       FALSE,	/* whether this path to HCA should by tracked by counters */
+							       0,	/* Number of reverse hops allowed */
+							       0);	/* Number of reverse hops done yet */
  	}

  	OSM_LOG_EXIT(&p_ftree->p_osm->log);
@@ -2802,6 +2912,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
  	uint8_t i;
  	uint8_t remote_port_num;
  	boolean_t is_cn = FALSE;
+	boolean_t is_io = FALSE;
  	int res = 0;

  	for (i = 0; i < osm_node_get_num_physp(p_node); i++) {
@@ -2879,9 +2990,31 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
  				"Marking CN port GUID 0x%016" PRIx64 "\n",
  				cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
  		} else {
-			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-				"Marking non-CN port GUID 0x%016" PRIx64 "\n",
-				cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
+			if (__osm_ftree_fabric_ios_provided(p_ftree)) {
+				name_map_item_t *p_elem =
+				    (name_map_item_t *)
+				    cl_qmap_get(&p_ftree->io_guid_tbl,
+						cl_ntoh64
+						(osm_physp_get_port_guid
+						 (p_osm_port)));
+				if (p_elem !=
+				    (name_map_item_t *)
+				    cl_qmap_end(&p_ftree->io_guid_tbl))
+					is_io = TRUE;
+
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+					"Marking I/O port GUID 0x%016" PRIx64
+					"\n",
+					cl_ntoh64(osm_physp_get_port_guid
+						  (p_osm_port)));
+
+			} else {
+				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+					"Marking non-CN port GUID 0x%016" PRIx64
+					"\n",
+					cl_ntoh64(osm_physp_get_port_guid
+						  (p_osm_port)));
+			}
  		}

  		__osm_ftree_hca_add_port(p_hca,	/* local ftree_hca object */
@@ -2894,7 +3027,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree,
  					 remote_node_guid,	/* remote node guid */
  					 remote_node_type,	/* remote node type */
  					 (void *)p_remote_sw,	/* remote ftree_hca/sw object */
-					 is_cn);	/* whether this port is compute node */
+					 is_cn, is_io);	/* whether this port is compute node */
  	}

  Exit:
@@ -3354,6 +3487,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
  		if (parse_node_map(p_ftree->p_osm->subn.opt.cn_guid_file,
  				   add_guid_item_to_map,
  				   &p_ftree->cn_guid_tbl)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: "
+				"Problem parsin CN guid file\n");
  			status = -1;
  			goto Exit;
  		}
@@ -3366,6 +3501,29 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree)
  		}
  	}

+
+	if (__osm_ftree_fabric_ios_provided(p_ftree)) {
+		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+			"Fetching I/O nodes from file %s\n",
+			p_ftree->p_osm->subn.opt.io_guid_file);
+
+		if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file,
+				   add_guid_item_to_map,
+				   &p_ftree->io_guid_tbl)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+				"ERR AB23: " "Problem parsin I/O guid file\n");
+			status = -1;
+			goto Exit;
+		}
+
+		if (!cl_qmap_count(&p_ftree->io_guid_tbl)) {
+			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+				"ERR AB23: "
+				"I/O node guids file has no valid guids\n");
+			status = -1;
+			goto Exit;
+		}
+	}
  Exit:
  	OSM_LOG_EXIT(&p_ftree->p_osm->log);
  	return status;
-- 
1.6.1


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb 12 01:11:42 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 10:11:42 +0100
Subject: [ofa-general] [PATCH 3/3] Added documentation for io_guid_file and
 max_reverse_hop feature
In-Reply-To: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
References: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
Message-ID: <4993E7CE.3090908@ext.bull.net>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
  opensm/doc/current-routing.txt |   32 ++++++++++++++++++++++++++++++++
  opensm/man/opensm.8.in         |   27 +++++++++++++++++++++++++++
  2 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt
index 0034d0e..1302860 100644
--- a/opensm/doc/current-routing.txt
+++ b/opensm/doc/current-routing.txt
@@ -237,6 +237,38 @@ in the same directory where the OpenSM log resides. This ordering file provides
  the CN order that may be used to create efficient communication pattern, that
  will match the routing tables.

+Routing between non-CN nodes
+
+
+The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree.
+In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes.
+In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them,
+there will not necessarily be a route between N1,N2 and N3.
+Such routes would require to use at least one of the Switch the wrong way around
+(In fact, go out of one of the top Switch through a downgoing port while we are supposed to go up).
+
+  Spine1   Spine2    Spine 3
+   / \     /  |  \    /   \
+  /   \   /   |   \  /     \
+ N1  Switch   N2  Switch    N3
+      /|\          /|\
+     / | \        / | \
+    Going down to compute nodes
+
+To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option.
+Theses nodes will be allowed to use switches the wrong way around a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'.
+With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.
+
+In the scheme above, with a max_reverse_hop of 1, routes will be instanciated between N1<->N2 and N2<->N3.
+With a max_reverse_hops value of 2, N1,N2 and N3 will all have routes between them.
+
+Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way.
+This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used
+to allow connectivity for HA purposes or similar.
+Also having routes the other way around can in theory cause credit loops.
+
+Use these options with extreme care !
+

  Usage:

diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 7690980..ce14c02 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -22,6 +22,8 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
  [\-S | \-\-sadb_file <file name>]
  [\-a | \-\-root_guid_file <path to file>]
  [\-u | \-\-cn_guid_file <path to file>]
+[\-G | \-\-io_guid_file <path to file>]
+[\-H | \-\-max_reverse_hops <max reverse hops allowed>]
  [\-X | \-\-guid_routing_order_file <path to file>]
  [\-m | \-\-ids_guid_file <path to file>]
  [\-o(nce)]
@@ -183,6 +185,16 @@ algorithm to the guids provided in the given file (one to a line).
  Set the compute nodes for the Fat-Tree routing algorithm
  to the guids provided in the given file (one to a line).
  .TP
+\fB\-G\fR, \fB\-\-io_guid_file\fR <file name>
+Set the I/O nodes for the Fat-Tree routing algorithm
+to the guids provided in the given file (one to a line).
+I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches
+the wrong way around to improve connectivity.
+.TP
+\fB\-H\fR, \fB\-\-max_reverse_hops\fR <file name>
+Set the maximum number of reverse hops an I/O node is allowed
+to make. A reverse hop is the use of a switch the wrong way around.
+.TP
  \fB\-m\fR, \fB\-\-ids_guid_file\fR <file name>
  Name of the map file with set of the IDs which will be used
  by Up/Down routing algorithm instead of node GUIDs
@@ -800,6 +812,21 @@ in the same directory where the OpenSM log resides. This ordering file provides
  the CN order that may be used to create efficient communication pattern, that
  will match the routing tables.

+Routing between non-CN nodes
+
+The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree.
+In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes.
+To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option.
+Theses nodes will be allowed to use switches the wrong way round a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'.
+With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.
+
+Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way.
+This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used
+to allow connectivity for HA purposes or similar.
+Also having routes the other way around can in theory cause credit loops.
+
+Use these options with extreme care !
+
  Activation through OpenSM

  Use '-R ftree' option to activate the fat-tree algorithm.
-- 
1.6.1


From vlad at lists.openfabrics.org  Thu Feb 12 03:19:50 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 12 Feb 2009 03:19:50 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090212-0200 daily build status
Message-ID: <20090212111950.91876E60888@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at dev.mellanox.co.il  Thu Feb 12 03:20:44 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 12 Feb 2009 13:20:44 +0200
Subject: [ofa-general] sminfo report iberror in the first configuration
	on	RHEL5.3
In-Reply-To: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
References: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
Message-ID: <4994060C.4050001@mellanox.co.il>

Wen Hao Wang wrote:
>
> Hi all:
>
> I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped 
> in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and 
> wrote network interface configuration file ifcfg-ib0. ifup ib0 also 
> succeeded. But IB utilites report Connetion timed out.
>
>
> [root at xblade06 network-scripts]# sminfo
> ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> sminfo: iberror: failed: query
>
> I had to reboot the blade and rerun "openibd start". Then sminfo 
> reported correct contents. I do not suppose this reboot is required. 
> Did I miss any configuration step?
>
> Moreover, "openibd start" report one warning message about hwconf. 
> Anyone has comments about this?
>
> [root at xblade07 ~]# /etc/init.d/openibd start
> Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such 
> file or directory
> [ OK ]
>
> Thanks a lot!
>
> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
>
>   
Doug??

Tziporet


From tziporet at dev.mellanox.co.il  Thu Feb 12 03:28:29 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 12 Feb 2009 13:28:29 +0200
Subject: [ewg] Re: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes
In-Reply-To: <499311B4.4090607@nasa.gov>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com>
	<499311B4.4090607@nasa.gov>
Message-ID: <499407DD.4070307@mellanox.co.il>

Jeff Becker wrote:
>
> Thanks to NASA's developing relationship with Novell, I got access to
> SLES11 rc3 iso's. I'm downloading them now, and will start on the
> backports when that's done.
>   

Thanks a lot

Tziporet


From ruffing at motama.com  Thu Feb 12 03:48:19 2009
From: ruffing at motama.com (Jan Ruffing)
Date: Thu, 12 Feb 2009 12:48:19 +0100
Subject: [ofa-general] Drop in TCP performance when using OFED?
Message-ID: <49940C83.5020909@motama.com>

Hallo,

After I installed the OFED (1.4 beta), I noticed a drop in TCP
performance via Infiniband: from 10 GBit/s to less than 8 GBit/s.
Is that "expected behaviour"? Is there a way to avoid this performance loss?

The HCA used in both test machines is a Mellanox Infinihost III Lx DDR
HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel.


Performance with Open Suse 11 "out of the box", using Open Suse 11
Infiniband packages:

tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M
------------------------------------------------------------
Client connecting to 192.168.2.2, TCP port 5001
TCP window size: 515 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 47730 connected with 192.168.2.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 11.6 GBytes 10.0 Gbits/sec


Performance after the Installation of ODED 1.4. beta:

tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M
------------------------------------------------------------
Client connecting to 192.168.2.2, TCP port 5001
TCP window size:   902 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.1 port 38864 connected with 192.168.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  8.43 GBytes  7.24 Gbits/sec


Thanks in advance,
Jan

-- 
Jan Ruffing
Software Developer

Motama GmbH
Lortzingstraße 10 · 66111 Saarbrücken · Germany
tel +49 681 940 85 50 · fax +49 681 940 85 49
ruffing at motama.com · www.motama.com

Companies register · district council Saarbrücken · HRB 15249
CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger

This e-mail may contain confidential and/or privileged information. 
If you are not the intended recipient (or have received this e-mail 
in error) please notify the sender immediately and destroy this 
e-mail. Any unauthorized copying, disclosure or distribution of the 
material in this e-mail is strictly forbidden.


From hal.rosenstock at gmail.com  Thu Feb 12 04:04:44 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 12 Feb 2009 07:04:44 -0500
Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first
	configuration on RHEL5.3
In-Reply-To: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
References: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
Message-ID: <f0e08f230902120404sc03d51ayc34bc3327d5a588b@mail.gmail.com>

On Thu, Feb 12, 2009 at 2:37 AM, Wen Hao Wang <wangwhao at cn.ibm.com> wrote:
> Hi all:
>
> I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped in
> RHEL5.3 image) by "yum groupisntall". Then I load some drivers and wrote
> network interface configuration file ifcfg-ib0. ifup ib0 also succeeded. But
> IB utilites report Connetion timed out.
>
>
> [root at xblade06 network-scripts]# sminfo
> ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> sminfo: iberror: failed: query

It looks like the SM found the blade and at least configured the SMLID
but somehow LID routing did not work between the blade and the SM (at
LID 9). Was this problem persistent (without rebooting the blade) ?
Was the blade IB port active ?

-- Hal

> I had to reboot the blade and rerun "openibd start". Then sminfo reported
> correct contents. I do not suppose this reboot is required. Did I miss any
> configuration step?
>
> Moreover, "openibd start" report one warning message about hwconf. Anyone
> has comments about this?
>
> [root at xblade07 ~]# /etc/init.d/openibd start
> Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file or
> directory
> [ OK ]
>
> Thanks a lot!
>
> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From nicolas.morey-chaisemartin at ext.bull.net  Thu Feb 12 04:20:36 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Thu, 12 Feb 2009 13:20:36 +0100
Subject: [ofa-general] sminfo report iberror in the first configuration
	on	RHEL5.3
In-Reply-To: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
References: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
Message-ID: <49941414.2050400@ext.bull.net>

Wen Hao Wang wrote:
>
> Hi all:
>
> I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped 
> in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and 
> wrote network interface configuration file ifcfg-ib0. ifup ib0 also 
> succeeded. But IB utilites report Connetion timed out.
>
>
> [root at xblade06 network-scripts]# sminfo
> ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> sminfo: iberror: failed: query
>
> I had to reboot the blade and rerun "openibd start". Then sminfo 
> reported correct contents. I do not suppose this reboot is required. 
> Did I miss any configuration step?
>
> Moreover, "openibd start" report one warning message about hwconf. 
> Anyone has comments about this?
>
> [root at xblade07 ~]# /etc/init.d/openibd start
> Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such 
> file or directory
> [ OK ]
>
> Thanks a lot!
>
> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Sounds to me as if you haven't any Subnet Manager (OpenSM or managed 
switch) running.
$sminfo
sminfo: sm lid 2 sm guid 0x8f1040041254a, activity count 751941 priority 
3 state 3 SMINFO_MASTER
$ sminfo -P 2
ibwarn: [17975] mad_rpc: _do_madrpc failed; dport (Lid 3945)
sminfo: iberror: failed: query

(we don't have any SM on the subnet connected to port 2)

Your reboot might have started OpenSM. Thus making it works

Nicolas


From tziporet at dev.mellanox.co.il  Thu Feb 12 04:21:30 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 12 Feb 2009 14:21:30 +0200
Subject: [ofa-general] Drop in TCP performance when using OFED?
In-Reply-To: <49940C83.5020909@motama.com>
References: <49940C83.5020909@motama.com>
Message-ID: <4994144A.8010102@mellanox.co.il>

Jan Ruffing wrote:
> Hallo,
>
> After I installed the OFED (1.4 beta), I noticed a drop in TCP
> performance via Infiniband: from 10 GBit/s to less than 8 GBit/s.
> Is that "expected behaviour"? Is there a way to avoid this performance loss?
>
> The HCA used in both test machines is a Mellanox Infinihost III Lx DDR
> HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel.
>   

Is it SDP or IPoIB?
What is the FW version you use?
>
> Performance with Open Suse 11 "out of the box", using Open Suse 11
> Infiniband packages:
>
> tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M
> ------------------------------------------------------------
> Client connecting to 192.168.2.2, TCP port 5001
> TCP window size: 515 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.2.1 port 47730 connected with 192.168.2.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 11.6 GBytes 10.0 Gbits/sec
>
>
> Performance after the Installation of ODED 1.4. beta:
>
> tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M
> ------------------------------------------------------------
> Client connecting to 192.168.2.2, TCP port 5001
> TCP window size:   902 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.2.1 port 38864 connected with 192.168.2.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  8.43 GBytes  7.24 Gbits/sec
>
>
> Thanks in advance,
> Jan
>
>   


From hal.rosenstock at gmail.com  Thu Feb 12 04:41:28 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 12 Feb 2009 07:41:28 -0500
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <20090207123355.GP17713@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
Message-ID: <f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>

Sasha,

On 2/7/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 14:12 Fri 06 Feb     , Hal Rosenstock wrote:
>>
>> I'm looking at adding pkey support into the OpenSM vendor layer. The
>> pkey table is a per port structure and is part of ib_port_attr_t. That
>> structure also include num_pkeys. There is only related API:
>> osm_vendor_get_all_port_attr which takes several pointers, the second
>> one is a pointer to a preallocated array of port attributes (memory
>> allocation for that is done by the client). ib_port_attr_t includes a
>> pointer to the pkey table. So the only way this can work is if that
>> allocation is also done by the client which makes that a valid
>> parameter on input (as well as output).
>
> This could be a client choice: if pkey table pointer is initialized as
> NULL osm_vendor_get_all_port_attr() allocates memory and initialize the
> table and its size, otherwise it fills up only provided by client pkey
> table entries.

That's what I originally thought too but I'm not so sure looking at
the other vendor layers. For example, osm_vendor_al.c (which I think
is used in Windows currently) has the following code in
osm_vendor_get_all_port_attr (and other vendor layers except umad are
similar):

                        for (port_num = 0; port_num < num_ports; port_num++) {
                                p_attr_array[port_count] =
                                    *__osm_ca_info_get_port_attr_ptr(p_ca_info,
                                                                     port_num);
                                port_count++;
                        }

and

static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t *
                                                       const p_ca_info,
                                                       IN const uint8_t index)
{
        return (&p_ca_info->p_attr->p_port_attr[index]);
}

so I'm thinking the tables need to be supplied by the underlying
vendor library (al, umad, ...). Do you concur ?

-- Hal

> Sasha
>


From hal.rosenstock at gmail.com  Thu Feb 12 05:13:47 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 12 Feb 2009 08:13:47 -0500
Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
Message-ID: <f0e08f230902120513k4e1ba6a7y40139648f73f27ed@mail.gmail.com>

On Thu, Feb 12, 2009 at 7:41 AM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> so I'm thinking the tables need to be supplied by the underlying
> vendor library (al, umad, ...). Do you concur ?

If so, this can be supported as part of umad or better yet as part of
OpenSM umad vendor with no umad changes.

-- Hal


From dledford at redhat.com  Thu Feb 12 05:20:30 2009
From: dledford at redhat.com (Doug Ledford)
Date: Thu, 12 Feb 2009 08:20:30 -0500
Subject: [ofa-general] sminfo report iberror in the first configuration
	on	RHEL5.3
In-Reply-To: <4994060C.4050001@mellanox.co.il>
References: <OFC3E4FD73.F9810D25-ON4825755B.00253823-4825755B.0029DBF8@cn.ibm.com>
	<4994060C.4050001@mellanox.co.il>
Message-ID: <1234444830.10037.313.camel@firewall.xsintricity.com>

On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> Wen Hao Wang wrote:
> >
> > Hi all:
> >
> > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped 
> > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and 
> > wrote network interface configuration file ifcfg-ib0. ifup ib0 also 
> > succeeded. But IB utilites report Connetion timed out.
> >
> >
> > [root at xblade06 network-scripts]# sminfo
> > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > sminfo: iberror: failed: query
> >
> > I had to reboot the blade and rerun "openibd start". Then sminfo 
> > reported correct contents. I do not suppose this reboot is required. 
> > Did I miss any configuration step?

There was an unintentional bug in the rhel5.2 openibd init script in
that it automatically turned itself on during install (generally, most
init scripts should default to *not* turning themselves on during
install of the package, nor should they start themselves during install
of the package...this is for security reasons, imagine if you installed
the bind name server on your box and it automatically started up before
you had a chance to configure it).  In rhel5.3 we fixed that bug.  So,
you may need to 'chkconfig --level 2345 openibd on' to make sure openibd
starts up each time.  The error you list above is consistent with not
all of the kernel modules being loaded when you tried to use the sminfo
program.

> > Moreover, "openibd start" report one warning message about hwconf. 
> > Anyone has comments about this?
> >
> > [root at xblade07 ~]# /etc/init.d/openibd start
> > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such 
> > file or directory
> > [ OK ]

Can you see if the kudzu package is installed on your machine?  The
openib package uses this config file written by kudzu to determine what
hardware drivers to load.  I suppose I should put a specific requires in
the rpm for that.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090212/a6472db3/attachment.sig>

From ruffing at motama.com  Thu Feb 12 05:35:49 2009
From: ruffing at motama.com (Jan Ruffing)
Date: Thu, 12 Feb 2009 14:35:49 +0100
Subject: [ofa-general] Drop in TCP performance when using OFED?
In-Reply-To: <4994144A.8010102@mellanox.co.il>
References: <49940C83.5020909@motama.com> <4994144A.8010102@mellanox.co.il>
Message-ID: <499425B5.2050000@motama.com>

Tziporet Koren wrote:
> Jan Ruffing wrote:
>> After I installed the OFED (1.4 beta), I noticed a drop in TCP
>> performance via Infiniband: from 10 GBit/s to less than 8 GBit/s.
>> Is that "expected behaviour"? Is there a way to avoid this
>> performance loss?
>>
>> The HCA used in both test machines is a Mellanox Infinihost III Lx DDR
>> HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel.
>>   
>
> Is it SDP or IPoIB?
> What is the FW version you use?
That's using IPoIB.
The FW version is 1.2.0 (according to ibv_devinfo).

-- 
Jan Ruffing
Software Developer

Motama GmbH
Lortzingstraße 10 · 66111 Saarbrücken · Germany
tel +49 681 940 85 50 · fax +49 681 940 85 49
ruffing at motama.com · www.motama.com

Companies register · district council Saarbrücken · HRB 15249
CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger

This e-mail may contain confidential and/or privileged information. 
If you are not the intended recipient (or have received this e-mail 
in error) please notify the sender immediately and destroy this 
e-mail. Any unauthorized copying, disclosure or distribution of the 
material in this e-mail is strictly forbidden.


From kliteyn at dev.mellanox.co.il  Thu Feb 12 06:55:39 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 12 Feb 2009 16:55:39 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa.c: fixing SA MAD dump 
Message-ID: <4994386B.1040703@dev.mellanox.co.il>

Hi Sasha,

osm_sa_send() returns the MAD to the pool after sending it,
so dumping the MAD after sending it is wrong - fixing.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_sa.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c
index 185557f..416d44a 100644
--- a/opensm/opensm/osm_sa.c
+++ b/opensm/opensm/osm_sa.c
@@ -498,9 +498,9 @@ void osm_sa_respond(osm_sa_t *sa, osm_madw_t *madw, size_t attr_size,
 		free(item);
 	}

+	osm_dump_sa_mad(sa->p_log, resp_sa_mad, OSM_LOG_FRAMES);
 	osm_sa_send(sa, resp_madw, FALSE);

-	osm_dump_sa_mad(sa->p_log, resp_sa_mad, OSM_LOG_FRAMES);
 Exit:
 	/* need to set the mem free ... */
 	item = cl_qlist_remove_head(list);
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu Feb 12 07:01:22 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 12 Feb 2009 17:01:22 +0200
Subject: [ofa-general] [PATCH] opensm/osm_state_mgr.c: small bug in scanning
	lid table
Message-ID: <499439C2.40206@dev.mellanox.co.il>

Hi Sasha,

ref_size and curr_size return the size of the array,
which counts LIDs from 0, so max_lid will be out of
actual LIDs that are used.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_state_mgr.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index f5d3837..0a27044 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -932,7 +932,7 @@ static void __osm_state_mgr_check_tbl_consistency(IN osm_sm_t * sm)
 	/* They should be the same, but compare it anyway */
 	max_lid = (ref_size > curr_size) ? ref_size : curr_size;

-	for (lid = 1; lid <= max_lid; lid++) {
+	for (lid = 1; lid < max_lid; lid++) {
 		p_port_ref = NULL;
 		p_port_stored = NULL;
 		cl_ptr_vector_at(p_port_lid_tbl, lid, (void *)&p_port_stored);
@@ -1006,7 +1006,7 @@ static void cleanup_switch(cl_map_item_t *item, void *log)

 	if (!sw->new_lft)
 		return;
-	
+
 	if (memcmp(sw->lft, sw->new_lft, IB_LID_UCAST_END_HO + 1))
 		osm_log(log, OSM_LOG_ERROR, "ERR 331D: "
 			"LFT of switch 0x%016" PRIx64 " is not up to date.\n",
-- 
1.5.1.4


From purdy at sgi.com  Thu Feb 12 07:14:36 2009
From: purdy at sgi.com (Dale Purdy)
Date: Thu, 12 Feb 2009 09:14:36 -0600
Subject: [ofa-general] [PATCH] Fix credit loop checking
Message-ID: <20090212151436.GA17309@sgi.com>


 ibdiagnet/ibdiagui and ibdmchk assume that up/down routing is being
 used if it is able to find roots, whether the root are correct or not.
 If it finds roots it does its credit loop validation based on whether
 the up/down rules are followed instead of doing a rigorous credit loop
 check.  If the roots aren't correct, this can lead to determination of
 credit loop problems on topologies that don't have problems.  ibdmchk
 has an option to supply one's own root_guids file to override this if
 you actually are using up/down and have your own roots that were used
 to set up the routing, but ibdiagnet/ibdiagui doesn't.  In any case
 there shouldn't be assumptions about what the routing algorithm is, or
 what the roots are when checking for credit loops.

 Add a -u option to ibdiagnet/ibdiagui.  Change ibdiagnet/ibdiagui and
 ibdmchk to only do the simple up/down rule checking against roots when
 the -u option is used.  Otherwise the full credit loop check is done.

Signed-off-by: Dale Purdy <purdy at sgi.com>
---
 ibdiag/doc/ibdiagnet.pod  |   11 ++++++++++-
 ibdiag/doc/ibdiagui.pod   |    2 +-
 ibdiag/src/ibdebug.tcl    |   11 +++++++----
 ibdiag/src/ibdebug_if.tcl |   10 +++++++---
 ibdm/src/osm_check.cpp    |   22 +++++++++-------------
 5 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/ibdiag/doc/ibdiagnet.pod b/ibdiag/doc/ibdiagnet.pod
index d2cf460..cdc78ed 100644
--- a/ibdiag/doc/ibdiagnet.pod
+++ b/ibdiag/doc/ibdiagnet.pod
@@ -4,7 +4,7 @@ B<ibdiagnet - IB diagnostic net>
 
 =head1 SYNOPSYS
 
-ibdiagnet [-c <count>] [-v] [-r] [-o <out-dir>]
+ibdiagnet [-c <count>] [-v] [-r] [-u] [-o <out-dir>]
   [-t <topo-file>] [-s <sys-name>] [-i <dev-index>] [-p <port-num>] [-wt]
   [-pm] [-pc] [-P <<PM>=<Value>>]
   [-lw <1x|4x|12x>] [-ls <2.5|5|10>]
@@ -135,6 +135,15 @@ Provides a report of the fabric qualities
 
 =back
 
+=item B<-u>              :
+
+=over
+
+=item
+Credit loop check based on UpDown rules
+
+=back
+
 =item B<-t <topo-file>>  :
 
 =over
diff --git a/ibdiag/doc/ibdiagui.pod b/ibdiag/doc/ibdiagui.pod
index 4e0250f..86a2df9 100644
--- a/ibdiag/doc/ibdiagui.pod
+++ b/ibdiag/doc/ibdiagui.pod
@@ -4,7 +4,7 @@ B<ibdiagui - IB Diagnostic GUI>
 
 =head1 SYNOPSYS
 
-ibdiagui [-c <count>] [-v] [-r] [-o <out-dir>]
+ibdiagui [-c <count>] [-v] [-r] [-u] [-o <out-dir>]
      [-t <topo-file>] [-s <sys-name>] [-i <dev-index>] [-p <port-num>]
      [-pm] [-pc] [-P <PM counter>=<Trash Limit>]
      [-lw <1x|4x|12x>] [-ls <2.5|5|10>]
diff --git a/ibdiag/src/ibdebug.tcl b/ibdiag/src/ibdebug.tcl
index 3a464f2..04a8566 100644
--- a/ibdiag/src/ibdebug.tcl
+++ b/ibdiag/src/ibdebug.tcl
@@ -4391,10 +4391,13 @@ proc DumpFabQualities {} {
     inform "-I-ibdiagnet:check.credit.loops.header"
 
     # report credit loops
-    ibdmCalcMinHopTables $fabric
-    set roots [ibdmFindRootNodesByMinHop $fabric]
-    # just flush out any logs
-    set report [ibdmGetAndClearInternalLog]
+    set roots ""
+    if { [info exists G(argv:updown)] } {
+	ibdmCalcMinHopTables $fabric
+	set roots [ibdmFindRootNodesByMinHop $fabric]
+	# just flush out any logs
+	set report [ibdmGetAndClearInternalLog]
+    }
     if {[llength $roots]} {
 	inform "-I-reporting:found.roots" $roots
 	ibdmReportNonUpDownCa2CaPaths $fabric $roots
diff --git a/ibdiag/src/ibdebug_if.tcl b/ibdiag/src/ibdebug_if.tcl
index 21afc45..cf1b571 100644
--- a/ibdiag/src/ibdebug_if.tcl
+++ b/ibdiag/src/ibdebug_if.tcl
@@ -163,6 +163,10 @@ proc SetInfoArgv {} {
 	-t,param "topo-file"
 	-t,desc  "Specifies the topology file name"
 
+	-u,name  "updown"
+	-u,desc  "Indicates that UpDown rule checking should be done against automaticly determined roots"
+	-u,arglen   0
+
 	-v,name  "verbose"
 	-v,desc  "Instructs the tool to run in verbose mode"
 	-v,arglen   0
@@ -322,8 +326,8 @@ proc SetToolsFlags {} {
     array set TOOLS_FLAGS {
 	ibping     "(n|l|d) . c w v o     . t s i p "
 	ibdiagpath "(n|l|d) . c   v o smp . t s i p    . pm pc P . lw ls sl ."
-	ibdiagui   "          c   v r o   . t s i p    . pm pc P . lw ls ."
-	ibdiagnet  "          c   v r o   . t s i p wt . pm pc P . lw ls    . skip load_db csv"
+	ibdiagui   "          c   v r u o   . t s i p    . pm pc P . lw ls ."
+	ibdiagnet  "          c   v r u o   . t s i p wt . pm pc P . lw ls    . skip load_db csv"
 	ibcfg    "(n|l|d) (c|q)       . t s i p o"
 	ibmad    "(m) (a) (n|l|d)     . t s i p o ; (q) a"
 	ibsac    "(m) (a) k           . t s i p o ; (q) a"
@@ -2535,7 +2539,7 @@ proc showHelpPage { args } {
             Hop-count information:
             maximal hop-count, an example path, and a hop-count histogram
             All CA-to-CA paths traced
-            Credit loop report
+            Credit loop report (based on UpDown if -u option is provided)
             mgid-mlid-HCAs matching table
             Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not
             reported.
diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp
index 1c18c1c..09a3637 100644
--- a/ibdm/src/osm_check.cpp
+++ b/ibdm/src/osm_check.cpp
@@ -552,21 +552,17 @@ int main (int argc, char **argv) {
   list <IBNode *> rootNodes;
   int anyErr = 0;
 
-  if (RootsFileName.size())
-    {
-      if (TopoFile.size())
-	{
-	  rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName);
-	}
-      else
-	{
-	  rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName);
-	}
-    }
-  else
-    {
+  if (UseUpDown) {
+    if (RootsFileName.size()) {
+      if (TopoFile.size()) {
+        rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName);
+      } else {
+        rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName);
+      }
+    } else {
       rootNodes = SubnMgtFindRootNodesByMinHop(&fabric);
     }
+  }
 
   if (!rootNodes.empty()) {
     cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl;
-- 
1.5.6.5


From stan.smith at intel.com  Thu Feb 12 08:51:55 2009
From: stan.smith at intel.com (Smith, Stan)
Date: Thu, 12 Feb 2009 08:51:55 -0800
Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names
	missing from osm_vendor_t ?
In-Reply-To: <20090211014635.GS26139@sashak.voltaire.com>
References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com>
	<498F5A8F.2000101@dev.mellanox.co.il>
	<498F5E7B.6020208@dev.mellanox.co.il>
	<3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com>
	<20090209235414.GM26139@sashak.voltaire.com>
	<3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com>
	<20090211014635.GS26139@sashak.voltaire.com>
Message-ID: <3F6F638B8D880340AB536D29CD4C1E19319BF21F@orsmsx501.amr.corp.intel.com>

Sasha Khapyorsky wrote:
> On 16:34 Mon 09 Feb     , Smith, Stan wrote:
>>
>> Path of least resistance thinking would point towards not doing a
>> switch as the vendor-ibal to vendor-ibumad would be part of the
>> Windows OpenSM migration to OFED 1.4x OpenSM. My thinking is that
>> making a switch to vendor-ibumad would be a good deal more
>> work/involved just to get saquery working.
>
> For just saquery it would be overkill. (BTW I posted patch which
> cleans osm vendor calls from saquery - hope the problem of vendor-ibal
> extending will be eliminated soon).

Thank you very much! Yes your new saquery patches will eliminate the vendor-ibal issues and any proposed vendor-ibal mods.

Stan.

>
> I was thinking about vendor switching in context of OpenSM itself - in
> order to unify OpenSM/umad access layer between different systems (and
> eventually to cleanup all those osm vendor mess).
>
>> Not knowing the Windows OpenSM code base, moving part of it forward
>> seems like a task 'which' could blossom into a good deal more work
>> for the small return of saquery working? Frankly, I'd rather see
>> work put into getting OFED OpenSM 1.4 working on Windows.
>
> Sure, it could be done as part of WinOF OpenSM upgrade process (doing
> this just for fun against outdated OpenSM codebase doesn't buy a
> much).
>
> Sasha


From sashak at voltaire.com  Thu Feb 12 12:00:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 12 Feb 2009 22:00:25 +0200
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
Message-ID: <20090212200025.GC14416@sashak.voltaire.com>

Hi Hal,

On 07:41 Thu 12 Feb     , Hal Rosenstock wrote:
> 
> That's what I originally thought too but I'm not so sure looking at
> the other vendor layers. For example, osm_vendor_al.c (which I think
> is used in Windows currently) has the following code in
> osm_vendor_get_all_port_attr (and other vendor layers except umad are
> similar):
> 
>                         for (port_num = 0; port_num < num_ports; port_num++) {
>                                 p_attr_array[port_count] =
>                                     *__osm_ca_info_get_port_attr_ptr(p_ca_info,
>                                                                      port_num);
>                                 port_count++;
>                         }
> 
> and
> 
> static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t *
>                                                        const p_ca_info,
>                                                        IN const uint8_t index)
> {
>         return (&p_ca_info->p_attr->p_port_attr[index]);
> }
> 
> so I'm thinking the tables need to be supplied by the underlying
> vendor library (al, umad, ...). Do you concur ?

It is already supplied by libibumad - by umad_get_ca()
(ca.ports[i]->pkeys). I think you just need to copy this to
ib_port_attr_t structure.

Sasha


From sashak at voltaire.com  Thu Feb 12 12:12:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 12 Feb 2009 22:12:02 +0200
Subject: [ofa-general] Re: [PATCH v3] opensm/osm_console.c : Added
	dump_portguid function
	to console to generate a list of port guids matching one or more
	regexps
In-Reply-To: <4993C5C3.6020700@ext.bull.net>
References: <4993C5C3.6020700@ext.bull.net>
Message-ID: <20090212201202.GD14416@sashak.voltaire.com>

On 07:46 Thu 12 Feb     , Nicolas Morey Chaisemartin wrote:
> This add a dump_portguid functionnality to openSM console which makes it 
> really easy to generate cn_guid_file, root_guid_file and such
> by dumping into a file all port guids whom nodedesc contains at least one 
> of the provided regexps
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Feb 12 12:20:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 12 Feb 2009 22:20:20 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa.c: fixing SA MAD dump
In-Reply-To: <4994386B.1040703@dev.mellanox.co.il>
References: <4994386B.1040703@dev.mellanox.co.il>
Message-ID: <20090212202020.GE14416@sashak.voltaire.com>

On 16:55 Thu 12 Feb     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> osm_sa_send() returns the MAD to the pool after sending it,
> so dumping the MAD after sending it is wrong - fixing.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Appied, Thanks.


From sashak at voltaire.com  Thu Feb 12 12:31:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 12 Feb 2009 22:31:05 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: small bug in
	scanning lid table
In-Reply-To: <499439C2.40206@dev.mellanox.co.il>
References: <499439C2.40206@dev.mellanox.co.il>
Message-ID: <20090212203105.GF14416@sashak.voltaire.com>

On 17:01 Thu 12 Feb     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> ref_size and curr_size return the size of the array,
> which counts LIDs from 0, so max_lid will be out of
> actual LIDs that are used.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From arlin.r.davis at intel.com  Thu Feb 12 14:25:50 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 12 Feb 2009 14:25:50 -0800
Subject: [ofa-general] Question on dat_ep_post_rdma_write with
	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <499389BB.6060806@cs.anu.edu.au>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>


>I am get a bit confused by description on the 
>DAT_COMPLETION_SUPPRESS_FLAG.
>
>Looks like it suppress notification after DTO operations. Is 
>it always true?

Yes, with the exception of errors. 

>I have found that when I am using dat_ep_post_rdma_write to transfering
>data over 128k (within 1 iov).  Event will be brought to server side 
>(verified
>with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER
>will be received.

What side is server and which is client? You will not see any 
indication on the remote side of an rdma_write. If you see an 
event with invalid event number then there is a failure during 
the operation or the QP went into error state.

What version of uDAPL are you using? 2.0 or 1.2? 

Is this IB or iWARP?

-arlin

From andy.grover at oracle.com  Thu Feb 12 14:26:05 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 12 Feb 2009 14:26:05 -0800
Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp
Message-ID: <4994A1FD.2060704@oracle.com>

Hi Vlad,

Bringing up an old issue...

With RDS-level flow control enabled, RDS attempts to set rnr_counter to
0 on an already connected QP by transitioning through SQD state. SQD is
not supported on ConnectX, and so we either need do it the right way or
make other plans.

Is there an alternative way to adjust rnr_counter, or should we just
assume this is unchangeable once connected?

Thanks -- Regards -- Andy


From Jie.Cai at cs.anu.edu.au  Thu Feb 12 14:42:26 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 13 Feb 2009 09:42:26 +1100
Subject: [ofa-general] Question on dat_ep_post_rdma_write
	with	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
Message-ID: <4994A5D2.1040405@cs.anu.edu.au>


Davis, Arlin R wrote:
>> I am get a bit confused by description on the 
>> DAT_COMPLETION_SUPPRESS_FLAG.
>>
>> Looks like it suppress notification after DTO operations. Is 
>> it always true?
>>     
>
> Yes, with the exception of errors. 
>
>   
>> I have found that when I am using dat_ep_post_rdma_write to transfering
>> data over 128k (within 1 iov).  Event will be brought to server side 
>> (verified
>> with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER
>> will be received.
>>     
>
> What side is server and which is client? 
sever side did the rdma write, and client side is the remote side.

> You will not see any 
> indication on the remote side of an rdma_write. If you see an 
> event with invalid event number then there is a failure during 
> the operation or the QP went into error state.
>
> What version of uDAPL are you using? 2.0 or 1.2? 
>   
I am using uDAPL 2.0.

However, when I used DAT_COMPLETION_SUPPRESS_FLAG
at server side and the data been transfered is larger than 128KB,
there is an event come to the server side with rdma write cookie.

Is there an limitation on the size of data been transfered?


> Is this IB or iWARP?
>
>   
This is IB, and I am using Mellanox ConnectX IB HCAs.
> -arlin

- Jie


From sean.hefty at intel.com  Thu Feb 12 14:34:49 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Feb 2009 14:34:49 -0800
Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp
In-Reply-To: <4994A1FD.2060704@oracle.com>
References: <4994A1FD.2060704@oracle.com>
Message-ID: <EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>

>With RDS-level flow control enabled, RDS attempts to set rnr_counter to
>0 on an already connected QP by transitioning through SQD state. SQD is
>not supported on ConnectX, and so we either need do it the right way or
>make other plans.

Can this be set to 0 when connecting?

- Sean


From andy.grover at oracle.com  Thu Feb 12 14:43:49 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 12 Feb 2009 14:43:49 -0800
Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp
In-Reply-To: <EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>
References: <4994A1FD.2060704@oracle.com>
	<EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>
Message-ID: <4994A625.9060008@oracle.com>

Sean Hefty wrote:
>> With RDS-level flow control enabled, RDS attempts to set rnr_counter to
>> 0 on an already connected QP by transitioning through SQD state. SQD is
>> not supported on ConnectX, and so we either need do it the right way or
>> make other plans.
> 
> Can this be set to 0 when connecting?

Yes of course, it would just be a little nicer if we could change once
connected, instead of only when initiating the connection, so I wanted
to find out if that was also possible.

Thanks -- Regards -- Andy


From rdreier at cisco.com  Thu Feb 12 14:47:32 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Feb 2009 14:47:32 -0800
Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp
In-Reply-To: <4994A625.9060008@oracle.com> (Andy Grover's message of "Thu, 12
	Feb 2009 14:43:49 -0800")
References: <4994A1FD.2060704@oracle.com>
	<EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>
	<4994A625.9060008@oracle.com>
Message-ID: <ada63jf6znf.fsf@cisco.com>

>> With RDS-level flow control enabled, RDS attempts to set rnr_counter to
>> 0 on an already connected QP by transitioning through SQD state. SQD is
>> not supported on ConnectX, and so we either need do it the right way or
>> make other plans.

Is SQD really not supported by ConnectX?  If so it is likely a temporary
firmware issue I would think.

 - R.


From arlin.r.davis at intel.com  Thu Feb 12 14:48:36 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 12 Feb 2009 14:48:36 -0800
Subject: [ofa-general] Question on dat_ep_post_rdma_write with
	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <4994A5D2.1040405@cs.anu.edu.au>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
	<4994A5D2.1040405@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>

 
>However, when I used DAT_COMPLETION_SUPPRESS_FLAG
>at server side and the data been transfered is larger than 128KB,
>there is an event come to the server side with rdma write cookie.

You are most likely running into access violations or some other
error. You should see the following message with any DTO error:

"DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n"

What is the DTO event status on the server side?

>Is there an limitation on the size of data been transfered?

based on the HCA max_msg_sz, usually 2GBytes (ibv_devinfo -v). 

-arlin


From Jie.Cai at cs.anu.edu.au  Thu Feb 12 15:09:24 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 13 Feb 2009 10:09:24 +1100
Subject: [ofa-general] Question on dat_ep_post_rdma_write
	with	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
	<4994A5D2.1040405@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>
Message-ID: <4994AC24.3090907@cs.anu.edu.au>


Davis, Arlin R wrote:
>  
>   
>> However, when I used DAT_COMPLETION_SUPPRESS_FLAG
>> at server side and the data been transfered is larger than 128KB,
>> there is an event come to the server side with rdma write cookie.
>>     
>
> You are most likely running into access violations or some other
> error. You should see the following message with any DTO error:
>
> "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n"
>   
I didn't see this error message.
> What is the DTO event status on the server side?
>
>   
12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP
>> Is there an limitation on the size of data been transfered?
>>     
>
> based on the HCA max_msg_sz, usually 2GBytes (ibv_devinfo -v). 
>
> -arlin
>
>   


From arlin.r.davis at intel.com  Thu Feb 12 15:52:22 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 12 Feb 2009 15:52:22 -0800
Subject: [ofa-general] Question on dat_ep_post_rdma_write with
	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <4994AC24.3090907@cs.anu.edu.au>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
	<4994A5D2.1040405@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>
	<4994AC24.3090907@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A5D3661@orsmsx506.amr.corp.intel.com>

 
>> You are most likely running into access violations or some other
>> error. You should see the following message with any DTO error:
>>
>> "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n"
>>   
>I didn't see this error message.

What dapl packages are installed? rpm -qa | grep dapl

What provider device name are you using? ofa-v2-ib0?

>>   
>12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP

Your DTO event string mapping looks odd. You have an error minor 
status along with a success major status. 

Does event.event_number == DAT_DTO_COMPLETION_EVENT?

What is event.event_data.dto_completion_event_data.status?

-arlin

From Jie.Cai at cs.anu.edu.au  Thu Feb 12 16:13:19 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 13 Feb 2009 11:13:19 +1100
Subject: [ofa-general] Question on dat_ep_post_rdma_write
	with	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <E3280858FA94444CA49D2BA02341C9833A5D3661@orsmsx506.amr.corp.intel.com>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
	<4994A5D2.1040405@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>
	<4994AC24.3090907@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D3661@orsmsx506.amr.corp.intel.com>
Message-ID: <4994BB1F.8080902@cs.anu.edu.au>


Davis, Arlin R wrote:
>  
>
>   
>>> You are most likely running into access violations or some other
>>> error. You should see the following message with any DTO error:
>>>
>>> "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n"
>>>   
>>>       
>> I didn't see this error message.
>>     
>
> What dapl packages are installed? rpm -qa | grep dapl
>   
dapl-devel-2.0.7-1.ofed1.3
dapl-1.2.5-1.ofed1.3
dapl-devel-static-2.0.7-1.ofed1.3
dapl-devel-1.2.5-1.ofed1.3
dapl-utils-2.0.7-1.ofed1.3
dapl-2.0.7-1.ofed1.3


> What provider device name are you using? ofa-v2-ib0?
>   

yes, I am using ofa-v2-ib0.
>   
>>>   
>>>       
>> 12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP
>>     
>
> Your DTO event string mapping looks odd. You have an error minor 
> status along with a success major status. 
>
> Does event.event_number == DAT_DTO_COMPLETION_EVENT?
>   
Yes, it is a DAT_DTO_COMPLETION_EVENT.
> What is event.event_data.dto_completion_event_data.status?
>   
I printed it out, the event.event_data.dto_completion_event_data.status 
is 4.
> -arlin


From arlin.r.davis at intel.com  Thu Feb 12 16:05:13 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 12 Feb 2009 16:05:13 -0800
Subject: [ofa-general] Question on dat_ep_post_rdma_write with
	DAT_COMPLETION_SUPPRESS_FLAG.
In-Reply-To: <4994AC24.3090907@cs.anu.edu.au>
References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D34BA@orsmsx506.amr.corp.intel.com>
	<4994A5D2.1040405@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A5D3529@orsmsx506.amr.corp.intel.com>
	<4994AC24.3090907@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A5D369C@orsmsx506.amr.corp.intel.com>

 
>>   
>12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP

You are incorrectly using the string to return code mapping. 
dat_return_subtype (dat_error.h) of DAT_RESOURCE_TEP == 4
is really a DAT_DTO_ERR_LOCAL_PROTECTION error.

see dat.h for DTO completion status:

typedef enum dat_dto_completion_status
{
    DAT_DTO_SUCCESS                  = 0,
    DAT_DTO_ERR_FLUSHED              = 1,
    DAT_DTO_ERR_LOCAL_LENGTH         = 2,
    DAT_DTO_ERR_LOCAL_EP             = 3,
    DAT_DTO_ERR_LOCAL_PROTECTION     = 4,  <<<<<
    DAT_DTO_ERR_BAD_RESPONSE         = 5,
    DAT_DTO_ERR_REMOTE_ACCESS        = 6,
    DAT_DTO_ERR_REMOTE_RESPONDER     = 7,
    DAT_DTO_ERR_TRANSPORT            = 8,
    DAT_DTO_ERR_RECEIVER_NOT_READY   = 9,
    DAT_DTO_ERR_PARTIAL_PACKET       = 10,
    DAT_RMR_OPERATION_FAILED         = 11
} DAT_DTO_COMPLETION_STATUS;

-arlin


From wangwhao at cn.ibm.com  Thu Feb 12 16:05:48 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Fri, 13 Feb 2009 08:05:48 +0800
Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first
	configuration	on RHEL5.3
In-Reply-To: <1234444830.10037.313.camel@firewall.xsintricity.com>
Message-ID: <OF3B0E1DC3.D4DC6EB6-ON4825755B.00835D8B-4825755C.0000863D@cn.ibm.com>


Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:

> On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > Wen Hao Wang wrote:
> > >
> > > Hi all:
> > >
> > > I changed my blade OS to RHEL5.3 yesterday and installed OFED
(shipped
> > > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and

> > > wrote network interface configuration file ifcfg-ib0. ifup ib0 also
> > > succeeded. But IB utilites report Connetion timed out.
> > >
> > >
> > > [root at xblade06 network-scripts]# sminfo
> > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > > sminfo: iberror: failed: query
> > >
> > > I had to reboot the blade and rerun "openibd start". Then sminfo
> > > reported correct contents. I do not suppose this reboot is required.
> > > Did I miss any configuration step?
>
> There was an unintentional bug in the rhel5.2 openibd init script in
> that it automatically turned itself on during install (generally, most
> init scripts should default to *not* turning themselves on during
> install of the package, nor should they start themselves during install
> of the package...this is for security reasons, imagine if you installed
> the bind name server on your box and it automatically started up before
> you had a chance to configure it).  In rhel5.3 we fixed that bug.  So,

Yeah. I heard of this bug.

> you may need to 'chkconfig --level 2345 openibd on' to make sure openibd
> starts up each time.  The error you list above is consistent with not
> all of the kernel modules being loaded when you tried to use the sminfo
> program.

Even after reboot, service openibd is not started automatically.
[root at xblade06 ~]# chkconfig --list openibd
openibd         0:off   1:off   2:off   3:off   4:off   5:off   6:off

I agree with you that maybe some modules were not loaded. But what's that?
Before reboot, I run "/etc/init.d/openibd start" and "/etc/init.d/network
restart". No error was reported. "openibd status" also looked good.

>
> > > Moreover, "openibd start" report one warning message about hwconf.
> > > Anyone has comments about this?
> > >
> > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such
> > > file or directory
> > > [ OK ]
>
> Can you see if the kudzu package is installed on your machine?  The
> openib package uses this config file written by kudzu to determine what
> hardware drivers to load.  I suppose I should put a specific requires in
> the rpm for that.

kudzu is installed.
[root at xblade06 ~]# rpm -q kudzu
kudzu-1.2.57.1.21-1

>
> --
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband
>
> [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]


Thanks!

Wen Hao Wang
Email: wangwhao at cn.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090213/65b68ded/attachment.html>

From sean.hefty at intel.com  Thu Feb 12 16:09:01 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Feb 2009 16:09:01 -0800
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
Message-ID: <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com>

>	It seems that even when the rdma-cm consumer binds to a specific
address,
>	the rdma-cm address resolution code follows the order of the
>devices/rules
>	in routing table. So the user can't really dictate an outgoing interface
>	based on the src address provided to rdma_resolve_addr.
>
>Did you had the chance to look into that?

I'm running 2.6.28 with 1 HCA with 2 ports.  I added debug output around calls
to rdma_translate_ip() and cma_acquire_dev().  The short answer is that things
appear to work as expected.

ib0 is on port 1 - 192.168.0.102
ib1 is on port 2 - 192.168.0.122

If I run ucmatose -b ip_addr (with or without -s option), I see that
rdma_translate_ip() returns different net_device structures for the different
input addresses.  cma_acquire_dev() also indicates that different ports on the
same HCA are being used for the two addresses.

If I unplug one of the ports, I can no long connect if I use the IP address that
corresponds to that port, but the other port still works.  It doesn't matter
which port I unplug, as long as I use the correct IP address.

- Sean


From wangwhao at cn.ibm.com  Thu Feb 12 16:10:22 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Fri, 13 Feb 2009 08:10:22 +0800
Subject: [ofa-general] sminfo report iberror in the first configuration on
	RHEL5.3
In-Reply-To: <49941414.2050400@ext.bull.net>
Message-ID: <OF9FB55093.44FA70EE-ON4825755C.0000A1B0-4825755C.0000F13B@cn.ibm.com>


Nicolas Morey Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net> 写于
2009-02-12 20:20:36:

> Wen Hao Wang wrote:
> >
> > Hi all:
> >
> > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped
> > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and
> > wrote network interface configuration file ifcfg-ib0. ifup ib0 also
> > succeeded. But IB utilites report Connetion timed out.
> >
> >
> > [root at xblade06 network-scripts]# sminfo
> > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > sminfo: iberror: failed: query
> >
> > I had to reboot the blade and rerun "openibd start". Then sminfo
> > reported correct contents. I do not suppose this reboot is required.
> > Did I miss any configuration step?
> >
> > Moreover, "openibd start" report one warning message about hwconf.
> > Anyone has comments about this?
> >
> > [root at xblade07 ~]# /etc/init.d/openibd start
> > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such
> > file or directory
> > [ OK ]
> >
> > Thanks a lot!
> >
> > Wen Hao Wang
> > Email: wangwhao at cn.ibm.com
> >
> >
------------------------------------------------------------------------
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.
> org/mailman/listinfo/openib-general
> Sounds to me as if you haven't any Subnet Manager (OpenSM or managed
> switch) running.
> $sminfo
> sminfo: sm lid 2 sm guid 0x8f1040041254a, activity count 751941 priority
> 3 state 3 SMINFO_MASTER
> $ sminfo -P 2
> ibwarn: [17975] mad_rpc: _do_madrpc failed; dport (Lid 3945)
> sminfo: iberror: failed: query
>
> (we don't have any SM on the subnet connected to port 2)
>
> Your reboot might have started OpenSM. Thus making it works
>
> Nicolas
>
>

OpenSM is running on another machine with Lid 9. While this machine
(xblade06)
has Lid 8. Here is the output after reboot:

[root at xblade06 ~]# sminfo
sminfo: sm lid 9 sm guid 0x2c90300013101, activity count 618300 priority 0
state 3 SMINFO_MASTER
[root at xblade06 ~]# ps -ef|grep opensm
root      5369  5234  0 00:08 pts/0    00:00:00 grep opensm
[root at xblade06 ~]# ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              0002c903000134b0
[root at xblade06 ~]# ibnetdiscover |grep 2c903000134b0
# Initiated from node 0002c903000134b0 port 0002c903000134b1
[10]    "H-0002c903000134b0"[1](2c903000134b1)          # "xblade06 HCA-1"
lid 8 4xSDR
caguid=0x2c903000134b0
Ca      2 "H-0002c903000134b0"          # "xblade06 HCA-1"

Thanks!

Wen Hao Wang
Email: wangwhao at cn.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090213/32185d98/attachment.html>

From wangwhao at cn.ibm.com  Thu Feb 12 16:17:37 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Fri, 13 Feb 2009 08:17:37 +0800
Subject: [ofa-general] sminfo report iberror in the first configuration on
	RHEL5.3
In-Reply-To: <f0e08f230902120404sc03d51ayc34bc3327d5a588b@mail.gmail.com>
Message-ID: <OF247906DD.B08D02A3-ON4825755C.000114B6-4825755C.00019B35@cn.ibm.com>


Hal Rosenstock <hal.rosenstock at gmail.com> 写于 2009-02-12 20:04:44:

> On Thu, Feb 12, 2009 at 2:37 AM, Wen Hao Wang <wangwhao at cn.ibm.com>
wrote:
> > Hi all:
> >
> > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped
in
> > RHEL5.3 image) by "yum groupisntall". Then I load some drivers and
wrote
> > network interface configuration file ifcfg-ib0. ifup ib0 also
succeeded. But
> > IB utilites report Connetion timed out.
> >
> >
> > [root at xblade06 network-scripts]# sminfo
> > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > sminfo: iberror: failed: query
>
> It looks like the SM found the blade and at least configured the SMLID
> but somehow LID routing did not work between the blade and the SM (at
> LID 9). Was this problem persistent (without rebooting the blade) ?
> Was the blade IB port active ?
>
> -- Hal

Before reboot, I tried following operations
openibd restart
network restart
ibcheckerrors
ibclearerrors

But none of them helped. I had no idea what else I could do. So I tried
reboot.

If I remember correct, the port state was Linkup before rebooting. And now
it
is active

>
> > I had to reboot the blade and rerun "openibd start". Then sminfo
reported
> > correct contents. I do not suppose this reboot is required. Did I miss
any
> > configuration step?
> >
> > Moreover, "openibd start" report one warning message about hwconf.
Anyone
> > has comments about this?
> >
> > [root at xblade07 ~]# /etc/init.d/openibd start
> > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file
or
> > directory
> > [ OK ]
> >
> > Thanks a lot!
> >
> > Wen Hao Wang
> > Email: wangwhao at cn.ibm.com
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >

Thanks!

Wen Hao Wang
Email: wangwhao at cn.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090213/844f2fc2/attachment.html>

From hal.rosenstock at gmail.com  Thu Feb 12 16:41:40 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 12 Feb 2009 19:41:40 -0500
Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <20090212200025.GC14416@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
	<20090212200025.GC14416@sashak.voltaire.com>
Message-ID: <f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>

Sasha,

On Thu, Feb 12, 2009 at 3:00 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 07:41 Thu 12 Feb     , Hal Rosenstock wrote:
>>
>> That's what I originally thought too but I'm not so sure looking at
>> the other vendor layers. For example, osm_vendor_al.c (which I think
>> is used in Windows currently) has the following code in
>> osm_vendor_get_all_port_attr (and other vendor layers except umad are
>> similar):
>>
>>                         for (port_num = 0; port_num < num_ports; port_num++) {
>>                                 p_attr_array[port_count] =
>>                                     *__osm_ca_info_get_port_attr_ptr(p_ca_info,
>>                                                                      port_num);
>>                                 port_count++;
>>                         }
>>
>> and
>>
>> static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t *
>>                                                        const p_ca_info,
>>                                                        IN const uint8_t index)
>> {
>>         return (&p_ca_info->p_attr->p_port_attr[index]);
>> }
>>
>> so I'm thinking the tables need to be supplied by the underlying
>> vendor library (al, umad, ...). Do you concur ?
>
> It is already supplied by libibumad - by umad_get_ca()
> (ca.ports[i]->pkeys). I think you just need to copy this to
> ib_port_attr_t structure.

Yes but rather than using supplied pointers (as inputs for the per
port pkey/guid tables), the other vendor layers require a large enough
buffer for these tables and set the port pointers appropriately (on
output) rather than supplying these pointers as input parameters. So
if we use these as input, then we definitely break the other vendor
layers.

-- Hal

> Sasha
>


From sean.hefty at intel.com  Thu Feb 12 16:56:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Feb 2009 16:56:19 -0800
Subject: [ofa-general] [ib-mgmt] ibdiag_common.h question
Message-ID: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com>

I noticed the following in ibdiag_common.h:

#define	DEBUG	if (ibdebug || ibverbose) IBWARN
#define	VERBOSE	if (ibdebug || ibverbose > 1) IBWARN

This allows for else statements to mismatch when defined.

- Sean


From rdreier at cisco.com  Thu Feb 12 21:43:14 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Feb 2009 21:43:14 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from
	build_rdma_recv().
In-Reply-To: <20090211222915.19520.22647.stgit@dell3.ogc.int> (Steve Wise's
	message of "Wed, 11 Feb 2009 16:29:15 -0600")
References: <20090211222915.19520.22647.stgit@dell3.ogc.int>
Message-ID: <adad4dmao3x.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Thu Feb 12 21:47:34 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Feb 2009 21:47:34 -0800
Subject: [ofa-general] Re: [PATCH v3] RDMA/nes: Account for freed pbl after
	hw operation
In-Reply-To: <20090202231521.GA6220@ctung-MOBL> (Chien Tung's message of "Mon, 
	2 Feb 2009 17:15:21 -0600")
References: <20090202231521.GA6220@ctung-MOBL>
Message-ID: <ada8woaanwp.fsf@cisco.com>

looks good, applied... one comment:

 > Add proper pbl accounting in case nes_reg_mr failed.
 > 
 > Signed-off-by: Don Wood <donald.e.wood at intel.com>

when you forward someone else's patch, you should add your
Signed-off-by line after theirs.

 - R.


From sean.hefty at intel.com  Thu Feb 12 23:21:21 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Feb 2009 23:21:21 -0800
Subject: [ofa-general] [ib-diag] sminfo: add support for WinOF
Message-ID: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>

Allow sminfo to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Would there be any objection to including the windows source files (.c and .h)
in the mgmt tree?

 infiniband-diags/Makefile.am                |    2 +
 infiniband-diags/include/ibdiag_common.h    |    2 +
 infiniband-diags/include/linux/ibdiag_osd.h |   43 +++++++++++++++++++++++++++
 infiniband-diags/src/ibdiag_common.c        |   13 ++++----
 infiniband-diags/src/sminfo.c               |   15 ++++-----
 5 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
index f9cc5bd..0d32abd 100644
--- a/infiniband-diags/Makefile.am
+++ b/infiniband-diags/Makefile.am
@@ -1,5 +1,5 @@
 
-INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband
+INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband -I$(srcdir)/include/linux
 
 if DEBUG
 DBGFLAGS = -ggdb -D_DEBUG_
diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h
index 4783b8e..2dea873 100644
--- a/infiniband-diags/include/ibdiag_common.h
+++ b/infiniband-diags/include/ibdiag_common.h
@@ -52,7 +52,7 @@ extern int ibd_timeout;
 #undef DEBUG
 #define	DEBUG	if (ibdebug || ibverbose) IBWARN
 #define	VERBOSE	if (ibdebug || ibverbose > 1) IBWARN
-#define IBERROR(fmt, args...)	iberror(__FUNCTION__, fmt, ## args)
+#define IBERROR(fmt, ...)	iberror(__FUNCTION__, fmt, ## __VA_ARGS__)
 
 struct ibdiag_opt {
 	const char *name;
diff --git a/infiniband-diags/include/linux/ibdiag_osd.h b/infiniband-diags/include/linux/ibdiag_osd.h
new file mode 100644
index 0000000..5c6faa9
--- /dev/null
+++ b/infiniband-diags/include/linux/ibdiag_osd.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2009 Intel Corp, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _IBDIAG_OSD_H_
+#define _IBDIAG_OSD_H_
+
+#include <unistd.h>
+#include <inttypes.h>
+#include <config.h>
+
+#define CDECL
+
+#endif /* _IBDIAG_OSD_H_ */
diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index bda1efa..154e00c 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -43,15 +43,14 @@
 #include <stdlib.h>
 #include <stdarg.h>
 #include <sys/types.h>
-#include <unistd.h>
 #include <ctype.h>
-#include <config.h>
 #include <getopt.h>
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 #include <ibdiag_common.h>
 #include <ibdiag_version.h>
+#include "ibdiag_osd.h"
 
 int ibdebug;
 int ibverbose;
@@ -204,7 +203,7 @@ static const struct ibdiag_opt common_opts[] = {
 	{ "usage", 'u', 0, NULL, "usage message" },
 	{ "help", 'h', 0, NULL, "help message" },
 	{ "version", 'V', 0, NULL, "show version" },
-	{}
+	{ 0 }
 };
 
 static void make_opt(struct option *l, const struct ibdiag_opt *o,
@@ -254,11 +253,11 @@ static struct option *make_long_opts(const char *exclude_str,
 
 static void make_str_opts(const struct option *o, char *p, unsigned size)
 {
-	int i, n = 0;
+	unsigned i, n = 0;
 
 	for (n = 0; o->name  && n + 2 + o->has_arg < size; o++) {
-		p[n++] = o->val;
-		for (i = 0; i < o->has_arg; i++)
+		p[n++] = (char) o->val;
+		for (i = 0; i < (unsigned) o->has_arg; i++)
 			p[n++] = ':';
 	}
 	p[n] = '\0';
@@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt,
 	char str_opts[1024];
 	const struct ibdiag_opt *o;
 
-	memset(opts_map, 0, sizeof(opts_map));
+	memset((void *) opts_map, 0, sizeof(opts_map));
 
 	prog_name = argv[0];
 	prog_args = usage_args;
diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
index e96c782..7767668 100644
--- a/infiniband-diags/src/sminfo.c
+++ b/infiniband-diags/src/sminfo.c
@@ -37,14 +37,13 @@
 
 #include <stdio.h>
 #include <stdlib.h>
-#include <unistd.h>
-#include <inttypes.h>
 #include <getopt.h>
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
 #include "ibdiag_common.h"
+#include "ibdiag_osd.h"
 
 static uint8_t sminfo[1024];
 
@@ -59,10 +58,10 @@ enum {
 };
 
 char *statestr[] = {
-	[SMINFO_NOTACT] "SMINFO_NOTACT",
-	[SMINFO_DISCOVER] "SMINFO_DISCOVER",
-	[SMINFO_STANDBY] "SMINFO_STANDBY",
-	[SMINFO_MASTER] "SMINFO_MASTER",
+	"SMINFO_NOTACT",
+	"SMINFO_DISCOVER",
+	"SMINFO_STANDBY",
+	"SMINFO_MASTER",
 };
 
 #define STATESTR(s)	(((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???")
@@ -88,7 +87,7 @@ static int process_opt(void *context, int ch, char *optarg)
 	return 0;
 }
 
-int main(int argc, char **argv)
+int CDECL main(int argc, char **argv)
 {
 	int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS};
 	int mod = 0;
@@ -100,7 +99,7 @@ int main(int argc, char **argv)
 		{ "state", 's', 1, "<0-3>", "set SM state"},
 		{ "priority", 'p', 1, "<0-15>", "set SM priority"},
 		{ "activity", 'a', 1, NULL, "set activity count"},
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<sm_lid|sm_dr_path> [modifier]";
 

From sean.hefty at intel.com  Thu Feb 12 23:31:31 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Feb 2009 23:31:31 -0800
Subject: [ofa-general] [ibmad] libibmad: add MAD_EXPORT to exported calls
Message-ID: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com>

From: Stan Smith <stan.smith at intel.com>

ibtracert and ibroute need xdump and smp_query_via exported
from the library.  Add MAD_EXPORT to the calls for Windows support.

Signed-off-by: Stan Smith <stan.smith at intel.com>
Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 libibmad/include/infiniband/mad.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index bd62ec7..1aaaa1b 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -748,7 +748,7 @@ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
 			      unsigned mod, unsigned timeout);
 MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
 			    unsigned mod, unsigned timeout);
-uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
+MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
 		       unsigned mod, unsigned timeout, const void *srcport);
 uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
 		     unsigned timeout, const void *srcport);
@@ -875,7 +875,7 @@ static inline uint64_t htonll(uint64_t x)
 	exit(-1); \
 } while(0)
 
-void xdump(FILE * file, char *msg, void *p, int size);
+MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size);
 
 END_C_DECLS
 #endif				/* _MAD_H_ */


From nicolas.morey-chaisemartin at ext.bull.net  Fri Feb 13 01:24:24 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Fri, 13 Feb 2009 10:24:24 +0100
Subject: [ofa-general] [PATCH 1/3 v2] opensm: Added io_guid_file and
 max_reverse_hops options
In-Reply-To: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>
References: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>
Message-ID: <49953C48.3030203@ext.bull.net>

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl and wouldn't be read from the cached option file.

  opensm/include/opensm/osm_subnet.h |    6 ++++++
  opensm/opensm/main.c               |   26 +++++++++++++++++++++++++-
  opensm/opensm/osm_subnet.c         |   14 ++++++++++++++
  3 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 8863e47..671b51f 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -190,6 +190,8 @@ typedef struct osm_subn_opt {
  	char *lfts_file;
  	char *root_guid_file;
  	char *cn_guid_file;
+	char *io_guid_file;
+       uint16_t max_reverse_hops;
  	char *ids_guid_file;
  	char *guid_routing_order_file;
  	char *sa_db_file;
@@ -383,6 +385,10 @@ typedef struct osm_subn_opt {
  *		Name of the file that contains list of compute node guids that
  *		will be used by fat-tree routing (provided by User)
  *
+*	io_guid_file
+*		Name of the file that contains list of I/O node guids that
+*		will be used by fat-tree routing (provided by User)
+*
  *	ids_guid_file
  *		Name of the file that contains list of ids which should be
  *		used by Up/Down algorithm instead of node GUIDs
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index a8dc9e6..b5e3337 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -212,6 +212,12 @@ static void show_usage(void)
  	printf("--cn_guid_file, -u <path to file>\n"
  	       "          Set the compute nodes for the Fat-Tree routing algorithm\n"
  	       "          to the guids provided in the given file (one to a line)\n\n");
+	printf("--io_guid_file, -G <path to file>\n"
+	       "          Set the I/O nodes for the Fat-Tree routing algorithm\n"
+	       "          to the guids provided in the given file (one to a line)\n\n");
+	printf("--max_reverse_hops, -H <hop_count>\n"
+	       "          Set the max number of hops the wrong way around\n"
+	       "          an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n");
  	printf("--ids_guid_file, -m <path to file>\n"
  	       "          Name of the map file with set of the IDs which will be used\n"
  	       "          by Up/Down routing algorithm instead of node GUIDs\n"
@@ -526,7 +532,7 @@ int main(int argc, char *argv[])
  	uint32_t val;
  	unsigned config_file_done = 0;
  	const char *const short_option =
-	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:";
+	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:";

  	/*
  	   In the array below, the 2nd parameter specifies the number
@@ -570,6 +576,8 @@ int main(int argc, char *argv[])
  		{"sadb_file", 1, NULL, 'S'},
  		{"root_guid_file", 1, NULL, 'a'},
  		{"cn_guid_file", 1, NULL, 'u'},
+		{"io_guid_file", 1, NULL, 'G'},
+		{"max_reverse_hops", 1, NULL, 'H'},
  		{"ids_guid_file", 1, NULL, 'm'},
  		{"guid_routing_order_file", 1, NULL, 'X'},
  		{"stay_on_fatal", 0, NULL, 'y'},
@@ -880,6 +888,22 @@ int main(int argc, char *argv[])
  			       opt.cn_guid_file);
  			break;

+		case 'G':
+			/*
+			   Specifies I/O node guids file
+			 */
+			opt.io_guid_file = optarg;
+			printf(" I/O Node Guid File: %s\n",
+			       opt.io_guid_file);
+			break;
+		case 'H':
+			/*
+			   Specifies I/O max reverted hops
+			 */
+			opt.max_reverse_hops =  atoi(optarg);
+			printf(" Max Reverse Hops: %d\n",
+			       opt.max_reverse_hops);
+			break;
  		case 'm':
  			/* Specifies ids guid file */
  			SET_STR_OPT(opt.ids_guid_file, optarg);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 69937c1..2ee7cf7 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -146,6 +146,8 @@ static const opt_rec_t opt_tbl[] = {
  	{ "lfts_file", OPT_OFFSET(lfts_file), opts_parse_charp, NULL, 0 },
  	{ "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL, 0 },
  	{ "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 },
+	{ "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 },
+	{ "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16, NULL, 0 },
  	{ "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0 },
  	{ "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_parse_charp, NULL, 0 },
  	{ "sa_db_file", OPT_OFFSET(sa_db_file), opts_parse_charp, NULL, 0 },
@@ -578,6 +580,8 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
  	p_opt->lfts_file = NULL;
  	p_opt->root_guid_file = NULL;
  	p_opt->cn_guid_file = NULL;
+	p_opt->io_guid_file = NULL;
+	p_opt->max_reverse_hops = 0;
  	p_opt->ids_guid_file = NULL;
  	p_opt->guid_routing_order_file = NULL;
  	p_opt->sa_db_file = NULL;
@@ -1393,6 +1397,16 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
  		p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str);

  	fprintf(out,
+		"# The file holding the fat-tree I/O node guids\n"
+		"# One guid in each line\nio_guid_file %s\n\n",
+		p_opts->io_guid_file ? p_opts->io_guid_file : null_str);
+
+	fprintf(out,
+		"# Number of reverse hops allowed for I/O nodes \n"
+		"# Used for connectivity between I/O nodes connected to Top Switches\nmax_reverse_hops %d\n\n",
+		p_opts->max_reverse_hops);
+
+	fprintf(out,
  		"# The file holding the node ids which will be used by"
  		" Up/Down algorithm instead\n# of GUIDs (one guid and"
  		" id in each line)\nids_guid_file %s\n\n",
-- 
1.6.1


From prabhat.sharda at gmail.com  Fri Feb 13 03:07:27 2009
From: prabhat.sharda at gmail.com (prabhat sharda)
Date: Fri, 13 Feb 2009 16:37:27 +0530
Subject: [ofa-general] Installation problem: cannot find -libverbs
Message-ID: <743c0f8a0902130307r2999859bk931c2759a07635b8@mail.gmail.com>

Hi,

I am a newbie on OFED. I am trying to install OFED-1.4 on RHEL 5.2,
which is as per notes is a supported platform.

While installing OFED through menu, by selecting all the packages, the
process halts listing below message:

"Failed to build tgt-generic RPM
See /tmp/OFED.28181.logs/tgt-generic.rpmbuild.log "

The log file listed above has the below error message:

***************************************************************************
cc iscsi/conn.o iscsi/param.o iscsi/session.o iscsi/iscsid.o
iscsi/target.o iscsi/chap.o iscsi/transport.o iscsi/iscsi_tcp.o
iscsi/isns.o iscsi/libcrc32c.o bs_rdwr.o bs_aio.o iscsi/iscsi_rdma.o
tgtd.o mgmt.o target.o scsi.o log.o driver.o util.o work.o parser.o
spc.o sbc.o mmc.o osd.o scc.o smc.o ssc.o bs_ssc.o bs.o -o tgtd
-lcrypto -L /usr/OFED/lib64 -libverbs -lrdmacm -lpthread
/usr/bin/ld: cannot find -libverbs
collect2: ld returned 1 exit status
make: *** [tgtd] Error 1
make: Leaving directory `/var/tmp/OFED_topdir/BUILD/tgt-generic/usr'
error: Bad exit status from /var/tmp/rpm-tmp.42646 (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.42646 (%build)
****************************************************************


My installation machine is of 32 bit. On examining, I found that there
is not any folder " /usr/OFED/lib64" ,  where as "lib" exist on path
"/usr/OFED" having the files "libibverbs.a", "libibverbs.so",
"libibverbs.so.1" and "libibverbs.so.1.0.0".

Can anyone help me out to resolve this issue? Let me know if i missed
to check something. Thanks in advance.

Regards,
Prabhat


From vlad at lists.openfabrics.org  Fri Feb 13 03:15:05 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 13 Feb 2009 03:15:05 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090213-0200 daily build status
Message-ID: <20090213111506.42481E60F24@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Fri Feb 13 03:58:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 13 Feb 2009 06:58:12 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
	<20090212200025.GC14416@sashak.voltaire.com>
	<f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
Message-ID: <f0e08f230902130358g23e4d8ddqf896ab24eb97390d@mail.gmail.com>

On Thu, Feb 12, 2009 at 7:41 PM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> Yes but rather than using supplied pointers (as inputs for the per
> port pkey/guid tables), the other vendor layers require a large enough
> buffer for these tables and set the port pointers appropriately (on
> output) rather than supplying these pointers as input parameters. So
> if we use these as input, then we definitely break the other vendor
> layers.

Another choice is to ifdef these differences between Linux and Windows
at least until umad is used there.

-- Hal


From diego.guella at sircomtech.com  Fri Feb 13 07:32:28 2009
From: diego.guella at sircomtech.com (Diego Guella)
Date: Fri, 13 Feb 2009 16:32:28 +0100
Subject: [ofa-general] ib_create_qp and ib_get_err_str weirdness
Message-ID: <01fa01c98df0$47baed30$0100000a@DIEGO>

Hello,

I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR single-port card.
I noticed a strange behavior of ib_create_qp function:

-----
memset(&qp_create, 0, sizeof(qp_create));
qp_create.qp_type = IB_QPT_RELIABLE_CONN; // Reliable Connected
qp_create.sq_depth = ctx->qdepth;
qp_create.rq_depth = ctx->qdepth;
qp_create.sq_sge = ctx->hca_attr->max_sges;
qp_create.rq_sge = ctx->hca_attr->max_sges;
qp_create.h_sq_cq = ctx->cq_h;
qp_create.h_rq_cq = ctx->cq_h;
qp_create.h_srq = NULL;
qp_create.sq_signaled = 1;
ctx->qp_h = 0;
rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h);
-----
return value ("rc") is 3 (=IB_INVALID_PARAMETER).

I spent some time figuring out the problem was the SQ SGE value:
http://lists.openfabrics.org/pipermail/general/2006-June/023417.html

According to iba/ib_al.h:
-----
* IB_INVALID_MAX_SGE
* The requested maximum number of scatter-gather entries for the send or
* receive queue could not be supported.
-----
so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)?

In the discussion I mentioned, it turned out that even using 
hca_attr->max_sges there is the possibility that ib_create_qp fails.
Which is my case.
I have the need to send some audio buffers (32 or more) from an IO node to a 
computing node using RDMA WRITE.
The ownership of the buffers is of the audio driver, and I haven't the 
guarantee that the audio buffers are contiguous.
I was trying and send them using the lowest possible number of WR, each one 
with the highest possible number of sge.
But, given the hca_attr->max_sge unreliability, how do you recommend to 
achieve this goal?
Should I post a WR for each buffer I'd want to send through RDMA WRITE?


Another less-related problem:
ib_get_err_str is not correct for every input value, for example I noticed 
that for
ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is 
IB_INVALID_MR_HANDLE


I don't know if these problems apply to linux too, so I'm including general 
list.

Thanks and best regards,
Diego


From or.gerlitz at gmail.com  Fri Feb 13 07:39:55 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 13 Feb 2009 17:39:55 +0200
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for 
	bind
In-Reply-To: <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>
	<15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>
	<8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com>
Message-ID: <15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com>

On Fri, Feb 13, 2009 at 2:09 AM, Sean Hefty <sean.hefty at intel.com> wrote:
> If I run ucmatose -b ip_addr (with or without -s option), I see that
> rdma_translate_ip() returns different net_device structures for the different
> input addresses.  cma_acquire_dev() also indicates that different ports on the
> same HCA are being used for the two addresses.

> If I unplug one of the ports, I can no long connect if I use the IP address that
> corresponds to that port, but the other port still works.  It doesn't matter
> which port I unplug, as long as I use the correct IP address.

I wasn't sure if you actually run the whole test or just let rdma_bind
to be called and see the above. Anyway, if you send me a patch with
the prints you've added, I can repeat it in my setup and we'll see.

Or.


From or.gerlitz at gmail.com  Fri Feb 13 07:48:16 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 13 Feb 2009 17:48:16 +0200
Subject: ***SPAM*** Re: [ofa-general] mlx4 changing RNR_RETRY for an
	established qp
In-Reply-To: <4994A625.9060008@oracle.com>
References: <4994A1FD.2060704@oracle.com>
	<EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>
	<4994A625.9060008@oracle.com>
Message-ID: <15ddcffd0902130748n3b02c6b5v9be07d324c287692@mail.gmail.com>

On Fri, Feb 13, 2009 at 12:43 AM, Andy Grover <andy.grover at oracle.com> wrote:
> Yes of course, it would just be a little nicer if we could change once
> connected, instead of only when initiating the connection, so I wanted
> to find out if that was also possible.

Hi Andy,

I've made a comment to Olaf couple of months ago on an alternative way
for you to change the RNR value , see
http://oss.oracle.com/pipermail/rds-devel/2008-May/000595.html - the
archive copy has an awful long sentences - so I'll also fwd it to you
directly.

Or.


From or.gerlitz at gmail.com  Fri Feb 13 07:50:00 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 13 Feb 2009 17:50:00 +0200
Subject: Fwd: [ofa-general] RDS flow control
In-Reply-To: <4833D223.5090007@voltaire.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com> <ada1w3w4qor.fsf@cisco.com>
	<4833D223.5090007@voltaire.com>
Message-ID: <15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com>

---------- Forwarded message ----------
From: Or Gerlitz <ogerlitz at voltaire.com>
Date: Wed, May 21, 2008 at 9:41 AM
Subject: Re: [ofa-general] RDS flow control
To: Olaf Kirch <olaf.kirch at oracle.com>
Cc: Roland Dreier <rdreier at cisco.com>, rds-devel at oss.oracle.com,
general at lists.openfabrics.org


Roland Dreier wrote:
>
>  > Is there a way of changing the RNR retry count back to 0 after
establishing
>  > the connection?
>
> Yes... quite complicated but possible.  Basically you have to transition
> to the QP to the "send queue drained" (SQD) state, change the rnr retry
> value in an SQD->SQD transition and then transition back to RTS.

In case the RTS->SQD->SQD->RTS transition is not applicable or just for the
sake of being aware to more solutions, I gave it some thought and its seems
possible for you to build a protocol which uses exchange (through the
private data carried by the CM messages) whether each side supports credit
management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN
device capability decide what value to place into the QP RNR retries.

On the passive side of the connection its trivial, since the rdma-cm uses
the values you place into the conn_param parameters of rdma_accept.

On the active side, things are a bit more complex, but with some changes, I
think you would be able to do it also in a different way than the SQD one:
the RNR retries are set into the QP once its being moved to RTS
(Ready-To-Send). So, if you managed to get the QP into your hands --before--
the RTU is sent (since this point in time is the last synchoronization step
provided to you by the IB CM), you could set the RNR retries value accroding
to info carried in the REP message sent by the passive (which you have
posted in the private data to rdma_accept, etc).

This would be possible, if you enhance the rdma-cm to deliver
RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP
port space (eg conditioned on some new field in conn_param) where today its
supported only to PS_SDP ones.

Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE
event, decide what RNR retry value you want to use, and call rdma_accept
providing this value (one more little change is needed here in cma.c), the
rdma cm would override the value set by cm_init_qp_rts_attr, see
cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr ->
cm_init_qp_rts_attr

and you are done...

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090213/b94aca3f/attachment.html>

From chien.tin.tung at intel.com  Fri Feb 13 07:59:35 2009
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Fri, 13 Feb 2009 08:59:35 -0700
Subject: [ofa-general] RE: [PATCH v3] RDMA/nes: Account for freed pbl after
	hw operation
In-Reply-To: <ada8woaanwp.fsf@cisco.com>
References: <20090202231521.GA6220@ctung-MOBL> <ada8woaanwp.fsf@cisco.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830323437DF7@azsmsx501.amr.corp.intel.com>


>when you forward someone else's patch, you should add your
>Signed-off-by line after theirs.

Will do.

Chien

From dledford at redhat.com  Fri Feb 13 08:13:32 2009
From: dledford at redhat.com (Doug Ledford)
Date: Fri, 13 Feb 2009 11:13:32 -0500
Subject: [ofa-general] sminfo report iberror in the first
	configuration	on RHEL5.3
In-Reply-To: <OF3B0E1DC3.D4DC6EB6-ON4825755B.00835D8B-4825755C.0000863D@cn.ibm.com>
References: <OF3B0E1DC3.D4DC6EB6-ON4825755B.00835D8B-4825755C.0000863D@cn.ibm.com>
Message-ID: <1234541612.751.1.camel@firewall.xsintricity.com>

On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote:
> Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:
> 
> > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > > Wen Hao Wang wrote:
> > > >
> > > > Hi all:
> > > >
> > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED
> (shipped 
> > > > in RHEL5.3 image) by "yum groupisntall". Then I load some
> drivers and 
> > > > wrote network interface configuration file ifcfg-ib0. ifup ib0
> also 
> > > > succeeded. But IB utilites report Connetion timed out.
> > > >
> > > >
> > > > [root at xblade06 network-scripts]# sminfo
> > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > > > sminfo: iberror: failed: query
> > > >
> > > > I had to reboot the blade and rerun "openibd start". Then
> sminfo 
> > > > reported correct contents. I do not suppose this reboot is
> required. 
> > > > Did I miss any configuration step?
> > 
> > There was an unintentional bug in the rhel5.2 openibd init script in
> > that it automatically turned itself on during install (generally,
> most
> > init scripts should default to *not* turning themselves on during
> > install of the package, nor should they start themselves during
> install
> > of the package...this is for security reasons, imagine if you
> installed
> > the bind name server on your box and it automatically started up
> before
> > you had a chance to configure it).  In rhel5.3 we fixed that bug.
>  So,
> 
> Yeah. I heard of this bug.
> 
> > you may need to 'chkconfig --level 2345 openibd on' to make sure
> openibd
> > starts up each time.  The error you list above is consistent with
> not
> > all of the kernel modules being loaded when you tried to use the
> sminfo
> > program.
> 
> Even after reboot, service openibd is not started automatically.
> [root at xblade06 ~]# chkconfig --list openibd
> openibd         0:off   1:off   2:off   3:off   4:off   5:off   6:off

That's because you have to run the command I listed in my first email to
turn it on.

> I agree with you that maybe some modules were not loaded. But what's
> that?
> Before reboot, I run "/etc/init.d/openibd start" and
> "/etc/init.d/network
> restart". No error was reported. "openibd status" also looked good.

Running start on a service does not enable that service at the next
reboot.  You must specifically enable the service in order for it to
start automatically.

> > 
> > > > Moreover, "openibd start" report one warning message about
> hwconf. 
> > > > Anyone has comments about this?
> > > >
> > > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No
> such 
> > > > file or directory
> > > > [ OK ]
> > 
> > Can you see if the kudzu package is installed on your machine?  The
> > openib package uses this config file written by kudzu to determine
> what
> > hardware drivers to load.  I suppose I should put a specific
> requires in
> > the rpm for that.
> 
> kudzu is installed.
> [root at xblade06 ~]# rpm -q kudzu
> kudzu-1.2.57.1.21-1

Make sure kudzu has been run at least once then (it would appear to be
turned off on your machine or else /etc/sysconfig/hwconf would exist).
You can run it manually from the command line and that should be
sufficient for the openibd init script's needs.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090213/5cc773af/attachment.sig>

From sean.hefty at intel.com  Fri Feb 13 10:19:43 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 13 Feb 2009 10:19:43 -0800
Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for
	bind
In-Reply-To: <15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com>
References: <Pine.LNX.4.64.0902041755450.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902041756410.26058@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051211380.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051212250.6575@zuben.voltaire.com>	
	<Pine.LNX.4.64.0902051321440.7389@zuben.voltaire.com>	
	<15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com>	
	<8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com>
	<15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com>
Message-ID: <56DD47B66EFC4D23A33D6201E4093128@amr.corp.intel.com>

>I wasn't sure if you actually run the whole test or just let rdma_bind
>to be called and see the above. Anyway, if you send me a patch with
>the prints you've added, I can repeat it in my setup and we'll see.

I let ucmatose run successfully.  It's kind of a hassel for me to generate a
patch for this (I made them directly on the kernel code on my test systems), but
these are the changes:

rdma_translate_ip (addr.c)
Add printk after ip_dev_find to display the ip and dev variables.

cma_acquire_dev (cma.c)
Add printk after ib_find_cached_gid to display cma_dev and id_priv->id.port_num

- Sean


From sean.hefty at intel.com  Fri Feb 13 10:39:48 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 13 Feb 2009 10:39:48 -0800
Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
Message-ID: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>

>diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-
>diags/src/ibdiag_common.c
>index bda1efa..154e00c 100644
>--- a/infiniband-diags/src/ibdiag_common.c
>+++ b/infiniband-diags/src/ibdiag_common.c
>@@ -43,15 +43,14 @@
> #include <stdlib.h>
> #include <stdarg.h>
> #include <sys/types.h>
>-#include <unistd.h>
> #include <ctype.h>
>-#include <config.h>
> #include <getopt.h>
>
> #include <infiniband/umad.h>
> #include <infiniband/mad.h>
> #include <ibdiag_common.h>
> #include <ibdiag_version.h>
>+#include "ibdiag_osd.h"

I think it'll be easier to just put this include in ibdiag_common.h...

>diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
>index e96c782..7767668 100644
>--- a/infiniband-diags/src/sminfo.c
>+++ b/infiniband-diags/src/sminfo.c
>@@ -37,14 +37,13 @@
>
> #include <stdio.h>
> #include <stdlib.h>
>-#include <unistd.h>
>-#include <inttypes.h>
> #include <getopt.h>
>
> #include <infiniband/umad.h>
> #include <infiniband/mad.h>
>
> #include "ibdiag_common.h"
>+#include "ibdiag_osd.h"

...and avoid adding it to all the source files.  I'll update my patches, but
wait for comments against this patch before re-submitting.

- Sean


From andy.grover at oracle.com  Fri Feb 13 11:05:27 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Fri, 13 Feb 2009 11:05:27 -0800
Subject: Fwd: [ofa-general] RDS flow control
In-Reply-To: <15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com>
References: <200805121157.38135.jon@opengridcomputing.com>	
	<200805191006.00114.olaf.kirch@oracle.com>	
	<20080520204522.GD31790@opengridcomputing.com>	
	<200805202313.40213.olaf.kirch@oracle.com>
	<ada1w3w4qor.fsf@cisco.com>	 <4833D223.5090007@voltaire.com>
	<15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com>
Message-ID: <4995C477.5010109@oracle.com>

Thanks Or! This is exactly the kind of info I was looking for.

Regards -- Andy

Or Gerlitz wrote:
> ---------- Forwarded message ----------
> From: Or Gerlitz <ogerlitz at voltaire.com>
> Date: Wed, May 21, 2008 at 9:41 AM
> Subject: Re: [ofa-general] RDS flow control
> To: Olaf Kirch <olaf.kirch at oracle.com>
> Cc: Roland Dreier <rdreier at cisco.com>, rds-devel at oss.oracle.com,
> general at lists.openfabrics.org
> 
> 
> Roland Dreier wrote:
>>  > Is there a way of changing the RNR retry count back to 0 after
> establishing
>>  > the connection?
>>
>> Yes... quite complicated but possible.  Basically you have to transition
>> to the QP to the "send queue drained" (SQD) state, change the rnr retry
>> value in an SQD->SQD transition and then transition back to RTS.
> 
> In case the RTS->SQD->SQD->RTS transition is not applicable or just for the
> sake of being aware to more solutions, I gave it some thought and its seems
> possible for you to build a protocol which uses exchange (through the
> private data carried by the CM messages) whether each side supports credit
> management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN
> device capability decide what value to place into the QP RNR retries.
> 
> On the passive side of the connection its trivial, since the rdma-cm uses
> the values you place into the conn_param parameters of rdma_accept.
> 
> On the active side, things are a bit more complex, but with some changes, I
> think you would be able to do it also in a different way than the SQD one:
> the RNR retries are set into the QP once its being moved to RTS
> (Ready-To-Send). So, if you managed to get the QP into your hands --before--
> the RTU is sent (since this point in time is the last synchoronization step
> provided to you by the IB CM), you could set the RNR retries value accroding
> to info carried in the REP message sent by the passive (which you have
> posted in the private data to rdma_accept, etc).
> 
> This would be possible, if you enhance the rdma-cm to deliver
> RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP
> port space (eg conditioned on some new field in conn_param) where today its
> supported only to PS_SDP ones.
> 
> Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE
> event, decide what RNR retry value you want to use, and call rdma_accept
> providing this value (one more little change is needed here in cma.c), the
> rdma cm would override the value set by cm_init_qp_rts_attr, see
> cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr ->
> cm_init_qp_rts_attr
> 
> and you are done...
> 
> Or.
> 


From ralph.campbell at qlogic.com  Fri Feb 13 11:31:02 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 13 Feb 2009 11:31:02 -0800
Subject: [ofa-general] [PATCH] opensm: fix structure definition for trap
	257-258
Message-ID: <1234553462.3948.31.camel@chromite.mv.qlogic.com>

I was looking at a structure definition for trap messages in the opensm
code and noticed this minor bug.
Here is a patch to correct the problem.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index 0f9d110..cc92f36 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -7176,10 +7176,9 @@ typedef struct _ib_mad_notice_attr	// Total Size calc  Accumulated
 			ib_net16_t pad1;	// 2
 			ib_net16_t lid1;	// 2
 			ib_net16_t lid2;	// 2
-			ib_net32_t key;	// 2
-			uint8_t sl;	// 1
-			ib_net32_t qp1;	// 4
-			ib_net32_t qp2;	// 4
+			ib_net32_t key;	// 4
+			ib_net32_t qp1;	// 4b sl, 4b pad, 24b qp1
+			ib_net32_t qp2;	// 8b pad, 24b qp2
 			ib_gid_t gid1;	// 16
 			ib_gid_t gid2;	// 16
 		} PACK_SUFFIX ntc_257_258;


From vst at vlnb.net  Fri Feb 13 12:02:54 2009
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Fri, 13 Feb 2009 23:02:54 +0300
Subject: [Scst-devel] [ofa-general]
	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <4980B8DE.3060806@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48EBA581.4040301@mellanox.com>	<48EBA72B.4000909@harr.org>	<48EBBDB1.1080203@harr.org>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>
	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl nb.net>
	<4980B8DE.3060806@harr.org>
Message-ID: <4995D1EE.4000807@vlnb.net>

Cameron Harr, on 01/28/2009 10:58 PM wrote:
> I've attached a spreadsheet with some of my findings. In the Summary 
> tab, I have a baseline with no affinity set. For other 5 tests, see below.
> 
> Vladislav Bolkhovitin wrote:
>> Try the following variants:
>>
>> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 and 
>> 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 to 
>> CPU5, fct2-worker to CPU7
>>
>> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to 
>> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for other 
>> processes.
>>
>> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to 
>> all CPUs, except CPU0 and CPU4, no affinity for other processes.
> These are tests 1, 2 and 3, respectively
>> Or other similar variants you'd like (even CPUs relate to physical 
>> CPU0, odd CPUs relate to physical CPU1). For instance, you can try to 
>> affine IRQs 169 and 177 to CPU1.
> I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ 
> (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, 
> 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and scsi_tgt1 
> on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6.
>> No points to run for srptthread=1, for it just produce a baseline with 
>> no affinity at all.
> I ran with these anyway to look at differences among the tests. Having 
> this thread enabled always results in better performance.
>> Please do each run several times and write down an average result 
>> between runs and approximate variation between them in %%. Otherwise 
>> we can't make any reliable conclusions.
> I ran each test 3 times and took the averages. In order to get a quick 
> look at performance per run, I added a column in the summary that sums 
> the IOPs for each test with SRPT thread enabled and then not enabled. 
> Test 4 seems to give the best results. Here's a brief summary of that 
> summary with just SRPT thread=0:
> 
> Baseline: 356226.39
> Test 1:   371217.6533
> Test 2:   370553.78
> Test 3:   373295.2033
> Test 4:   399385.2233
> Test 5:   393204.5833

Linux CPU scheduler does really impressive job!

Interesting, will something change with:

1. The latest SVN. It has some changes, which might make a difference.

2. Pass-through dev handler instead of BLOCKIO, which you are using.

Thanks,
Vlad


From chien.tin.tung at intel.com  Fri Feb 13 13:24:31 2009
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 13 Feb 2009 15:24:31 -0600
Subject: [ofa-general] [PATCH] RDMA/nes: Inform hardware that asynchronous
	event has been handled
Message-ID: <20090213212431.GA7092@ctung-MOBL>

From: Don Wood <donald.e.wood at intel.com>

When asynchronous events are processed by software, it is necessary
to let the hardware know that software has handled the event.  This
frees up the entry in the asynchronous event queue.

Signed-off-by: Don Wood <donald.e.wood at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
---
diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c
index 5d139db..d612aec 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -2269,6 +2269,8 @@ static void nes_process_aeq(struct nes_device *nesdev, struct nes_hw_aeq *aeq)
 
 		if (++head >= aeq_size)
 			head = 0;
+
+		nes_write32(nesdev->regs+NES_AEQ_ALLOC, 1 << 16);
 	}
 	while (1);
 	aeq->aeq_head = head;
diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h
index bc0b4de..498d43e 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -61,6 +61,7 @@ enum pci_regs {
 	NES_CQ_ACK = 0x0034,
 	NES_WQE_ALLOC = 0x0040,
 	NES_CQE_ALLOC = 0x0044,
+	NES_AEQ_ALLOC = 0x0048
 };
 
 enum indexed_regs {
-- 
1.5.2.2


From sean.hefty at intel.com  Fri Feb 13 14:55:17 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 13 Feb 2009 14:55:17 -0800
Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF
Message-ID: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com>

Modify the openib_scm provider to support both OFED and WinOF releases.
This takes advantage of having a libibverbs compatibility library.*

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
* If only there were a sockets compatility layer... gurgle
This is only build tested for windows, but does run on Linux.
 
diff --git a/Makefile.am b/Makefile.am
index bfc93f7..5044e36 100755
--- a/Makefile.am
+++ b/Makefile.am
@@ -49,7 +49,8 @@ dapl_udapl_libdaploscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG
                                 -DOPENIB -DCQ_WAIT_OBJECT \
                                 -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \
                                 -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \
-                                -I$(srcdir)/dapl/openib_scm
+                                -I$(srcdir)/dapl/openib_scm \
+				-I$(srcdir)/dapl/openib_scm/linux
 
 if HAVE_LD_VERSION_SCRIPT
     dat_version_script = -Wl,--version-script=$(srcdir)/dat/udat/libdat2.map
diff --git a/dapl/openib_scm/README b/dapl/openib_scm/README
deleted file mode 100644
index 239dfe6..0000000
--- a/dapl/openib_scm/README
+++ /dev/null
@@ -1,40 +0,0 @@
-
-OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels.
-
-to build:
-
-cd dapl/udapl
-make VERBS=openib_scm clean
-make VERBS=openib_scm
-
-
-Modifications to common code:
-
-- added dapl/openib_scm directory 
-
-	dapl/udapl/Makefile
-
-New files for openib_scm provider
-
-	dapl/openib/dapl_ib_cq.c
-	dapl/openib/dapl_ib_dto.h
-	dapl/openib/dapl_ib_mem.c
-	dapl/openib/dapl_ib_qp.c
-	dapl/openib/dapl_ib_util.c
-	dapl/openib/dapl_ib_util.h
-	dapl/openib/dapl_ib_cm.c
-
-A simple dapl test just for openib_scm testing...
-
-	test/dtest/dtest.c
-	test/dtest/makefile
-
-	server:	dtest -s 
-	client:	dtest -h hostname
-
-known issues:
-
-	no memory windows support in ibverbs, dat_create_rmr fails.
-	
-
-
diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c
index 80a7d5e..9a15e42 100644
--- a/dapl/openib_scm/dapl_ib_cm.c
+++ b/dapl/openib_scm/dapl_ib_cm.c
@@ -52,26 +52,169 @@
 #include "dapl_cr_util.h"
 #include "dapl_name_service.h"
 #include "dapl_ib_util.h"
-
-#include <stdio.h>
-#include <unistd.h>
-#include <fcntl.h>
-#include <netinet/tcp.h>
-#include <byteswap.h>
-#include <poll.h>
-
-#include <sys/socket.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-
-#if __BYTE_ORDER == __LITTLE_ENDIAN
-static inline uint64_t cpu_to_be64(uint64_t x) {return bswap_64(x);}
-#elif __BYTE_ORDER == __BIG_ENDIAN
-static inline uint64_t cpu_to_be64(uint64_t x) {return x;}
-#endif
+#include "dapl_osd.h"
 
 extern int g_scm_pipe[2];
 
+#if defined(_WIN32) || defined(_WIN64)
+enum DAPL_FD_EVENTS {
+	DAPL_FD_READ	= 0x1,
+	DAPL_FD_WRITE	= 0x2,
+	DAPL_FD_ERROR	= 0x4
+};
+
+static int dapl_config_socket(DAPL_SOCKET s)
+{
+	unsigned long nonblocking = 1;
+	return ioctlsocket(s, FIONBIO, &nonblocking);
+}
+
+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr, 
+			       int addrlen)
+{
+	int err;
+
+	connect(s, addr, addrlen);
+	err = WSAGetLastError();
+	return (err == WSAEWOULDBLOCK) ? EAGAIN : err;
+}
+
+struct dapl_fd_set {
+	struct fd_set set[3];
+};
+
+static struct dapl_fd_set *dapl_alloc_fd_set(void)
+{
+	return dapl_os_alloc(sizeof(struct dapl_fd_set));
+}
+
+static void dapl_fd_zero(struct dapl_fd_set *set)
+{
+	FD_ZERO(&set->set[0]);
+	FD_ZERO(&set->set[1]);
+	FD_ZERO(&set->set[2]);
+}
+
+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
+			enum DAPL_FD_EVENTS event)
+{
+	FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]);
+	FD_SET(s, &set->set[2]);
+	return 0;
+}
+
+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event)
+{
+	struct fd_set rw_fds;
+	struct fd_set err_fds;
+	struct timeval tv;
+	int ret;
+
+	FD_ZERO(&rw_fds);
+	FD_ZERO(&err_fds);
+	FD_SET(s, &rw_fds);
+	FD_SET(s, &err_fds);
+
+	tv.tv_sec = 0;
+	tv.tv_usec = 0;
+
+	if (event == DAPL_FD_READ)
+		ret = select(1, &rw_fds, NULL, &err_fds, &tv);
+	else
+		ret = select(1, NULL, &rw_fds, &err_fds, &tv);
+
+	if (ret == 0)
+		return 0;
+	else if (FD_ISSET(s, &rw_fds))
+		return event;
+	else if (FD_ISSET(s, &err_fds))
+		return DAPL_FD_ERROR;
+	else
+		return WSAGetLastError();
+}
+
+static int dapl_select(struct dapl_fd_set *set)
+{
+	return select(0, &set->set[0], &set->set[1], &set->set[2], NULL);
+}
+#else // _WIN32 || _WIN64
+enum DAPL_FD_EVENTS {
+	DAPL_FD_READ	= POLLIN,
+	DAPL_FD_WRITE	= POLLOUT,
+	DAPL_FD_ERROR	= POLLERR
+};
+
+static int dapl_config_socket(DAPL_SOCKET s)
+{
+	int ret;
+
+	ret = fcntl(s, F_GETFL); 
+	if (ret >= 0)
+		ret = fcntl(s, F_SETFL, ret | O_NONBLOCK);
+	return ret;
+}
+
+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr, int addrlen)
+{
+	int ret;
+
+	ret = connect(s, addr, addrlen);
+
+	return (errno == EINPROGRESS) ? EAGAIN : ret;
+}
+
+struct dapl_fd_set {
+	int index;
+	struct pollfd set[DAPL_FD_SETSIZE];
+};
+
+static struct dapl_fd_set *dapl_alloc_fd_set(void)
+{
+	return dapl_os_alloc(sizeof(struct dapl_fd_set));
+}
+
+static void dapl_fd_zero(struct dapl_fd_set *set)
+{
+	set->index = 0;
+}
+
+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
+			enum DAPL_FD_EVENTS event)
+{
+	if (set->index == DAPL_FD_SETSIZE - 1) {
+		dapl_log(DAPL_DBG_TYPE_ERR, 
+			 "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n", 
+			 set->index + 1);
+		return -1;
+	}
+
+	set->set[set->index].fd = s;
+	set->set[set->index].revents = 0;
+	set->set[set->index++].events = event;
+	return 0;
+}
+
+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event)
+{
+	struct pollfd fds;
+	int ret;
+
+	fds.fd = s;
+	fds.events = event;
+	fds.revents = 0;
+	ret = poll(&fds, 1, 0);
+	if (ret <= 0)
+		return ret;
+
+	return fds.revents;
+}
+
+static int dapl_select(struct dapl_fd_set *set)
+{
+	return poll(set->set, set->index, -1);
+}
+#endif
+
 static struct ib_cm_handle *dapli_cm_create(void)
 { 
 	struct ib_cm_handle *cm_ptr;
@@ -85,7 +228,7 @@ static struct ib_cm_handle *dapli_cm_create(void)
 
 	(void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr));
 	cm_ptr->dst.ver = htons(DSCM_VER);
-	cm_ptr->socket = -1;
+	cm_ptr->socket = DAPL_INVALID_SOCKET;
 	return cm_ptr;
 bail:
 	dapl_os_free(cm_ptr, sizeof(*cm_ptr));
@@ -100,8 +243,8 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr)
 	
 	/* cleanup, never made it to work queue */
 	if (cm_ptr->state == SCM_INIT) {
-		if (cm_ptr->socket >= 0)  
-			close(cm_ptr->socket);
+		if (cm_ptr->socket != DAPL_INVALID_SOCKET)  
+			closesocket(cm_ptr->socket);
 		dapl_os_free(cm_ptr, sizeof(*cm_ptr));
 		return;
 	}
@@ -112,9 +255,9 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr)
 		cm_ptr->ep->cm_handle = IB_INVALID_HANDLE;
 
 	/* close socket if still active */
-	if (cm_ptr->socket >= 0) {
-		close(cm_ptr->socket);
-		cm_ptr->socket = -1;
+	if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
+		closesocket(cm_ptr->socket);
+		cm_ptr->socket = DAPL_INVALID_SOCKET;
 	}
 	dapl_os_unlock(&cm_ptr->lock);
 
@@ -172,14 +315,14 @@ dapli_socket_disconnect(dp_ib_cm_handle_t	cm_ptr)
 		return DAT_SUCCESS;
 	} else {
 		/* send disc date, close socket, schedule destroy */
-		if (cm_ptr->socket >= 0) { 
-			if (write(cm_ptr->socket, 
-				  &disc_data, sizeof(disc_data)) == -1)
+		if (cm_ptr->socket != DAPL_INVALID_SOCKET) { 
+			if (send(cm_ptr->socket, (char *) &disc_data,
+					sizeof(disc_data), 0) == -1)
 				dapl_log(DAPL_DBG_TYPE_WARN, 
 					 " cm_disc: write error = %s\n", 
 					 strerror(errno));
-			close(cm_ptr->socket);
-			cm_ptr->socket = -1;
+			closesocket(cm_ptr->socket);
+			cm_ptr->socket = DAPL_INVALID_SOCKET;
 		}
 		cm_ptr->state = SCM_DISCONNECTED;
 	}
@@ -211,7 +354,7 @@ void
 dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
 {
 	int		len, opt = 1;
-	struct iovec    iovec[2];
+	struct iovec iov[2];
 	struct dapl_ep	*ep_ptr = cm_ptr->ep;
 
 	if (err) {
@@ -226,18 +369,21 @@ dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
 		     " socket connected, write QP and private data\n"); 
 
 	/* no delay for small packets */
-	setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt));
+	setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY,
+		(char *) &opt, sizeof(opt));
 
 	/* send qp info and pdata to remote peer */
-	iovec[0].iov_base = &cm_ptr->dst;
-	iovec[0].iov_len  = sizeof(ib_qp_cm_t);
+	iov[0].iov_base = (void *) &cm_ptr->dst;
+	iov[0].iov_len = sizeof(ib_qp_cm_t);
 	if (cm_ptr->dst.p_size) {
-		iovec[1].iov_base = cm_ptr->p_data;
-		iovec[1].iov_len  = ntohl(cm_ptr->dst.p_size);
+		iov[1].iov_base = cm_ptr->p_data;
+		iov[1].iov_len = ntohl(cm_ptr->dst.p_size);
+		len = writev(cm_ptr->socket, iov, 2);
+	} else {
+		len = writev(cm_ptr->socket, iov, 1);
 	}
 
-	len = writev(cm_ptr->socket, iovec, (cm_ptr->dst.p_size ? 2:1));
-    	if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) {
+	if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			 " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n",
 			 strerror(errno), len,
@@ -253,9 +399,9 @@ dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
         dapl_dbg_log(DAPL_DBG_TYPE_CM,
                      " connected: sending SRC GID subnet %016llx id %016llx\n",
                      (unsigned long long) 
-			cpu_to_be64(cm_ptr->dst.gid.global.subnet_prefix),
+			htonll(cm_ptr->dst.gid.global.subnet_prefix),
                      (unsigned long long) 
-			cpu_to_be64(cm_ptr->dst.gid.global.interface_id));
+			htonll(cm_ptr->dst.gid.global.interface_id));
 
 	/* queue up to work thread to avoid blocking consumer */
 	cm_ptr->state = SCM_RTU_PENDING;
@@ -290,25 +436,23 @@ dapli_socket_connect(DAPL_EP		*ep_ptr,
 		return DAT_INSUFFICIENT_RESOURCES;
 
 	/* create, connect, sockopt, and exchange QP information */
-	if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) {
+	if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) == DAPL_INVALID_SOCKET) {
 		dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
 		return DAT_INSUFFICIENT_RESOURCES;
 	}
 
-	/* non-blocking */
-	ret = fcntl(cm_ptr->socket, F_GETFL); 
-        if (ret < 0 || fcntl(cm_ptr->socket,
-                              F_SETFL, ret | O_NONBLOCK) < 0) {
-                dapl_log(DAPL_DBG_TYPE_ERR,
-                         " socket connect: fcntl on socket %d ERR %d %s\n",
-                         cm_ptr->socket, ret,
-                         strerror(errno));
-                goto bail;
-        }
+	ret = dapl_config_socket(cm_ptr->socket); 
+	if (ret < 0) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			" socket connect: config socket %d ERR %d %s\n",
+			cm_ptr->socket, ret, strerror(errno));
+		goto bail;
+	}
 
 	((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual);
-	ret = connect(cm_ptr->socket, r_addr, sizeof(*r_addr));
-	if (ret && errno != EINPROGRESS) {
+	ret = dapl_connect_socket(cm_ptr->socket, (struct sockaddr *) r_addr,
+				sizeof(*r_addr));
+	if (ret && ret != EAGAIN) {
 		dapl_log(DAPL_DBG_TYPE_ERR,
 			 " socket connect ERROR: %s -> %s r_qual %d\n",
 			 strerror(errno), 
@@ -391,16 +535,13 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t	cm_ptr)
 {
 	DAPL_EP		*ep_ptr = cm_ptr->ep;
 	int		len;
-	struct iovec    iovec[2];
 	short		rtu_data = htons(0x0E0F);
 	ib_cm_events_t	event = IB_CME_DESTINATION_REJECT;
 
 	/* read DST information into cm_ptr, overwrite SRC info */
 	dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer QP data\n"); 
 
-	iovec[0].iov_base = &cm_ptr->dst;
-	iovec[0].iov_len  = sizeof(ib_qp_cm_t);
-	len = readv(cm_ptr->socket, iovec, 1);
+	len = recv(cm_ptr->socket, (char *) &cm_ptr->dst, sizeof(ib_qp_cm_t), 0);
 	if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver) != DSCM_VER) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 		     " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n",
@@ -456,9 +597,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t	cm_ptr)
 	/* read private data into cm_handle if any present */
 	dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read private data\n"); 
 	if (cm_ptr->dst.p_size) {
-		iovec[0].iov_base = cm_ptr->p_data;
-		iovec[0].iov_len  = cm_ptr->dst.p_size;
-		len = readv(cm_ptr->socket, iovec, 1);
+		len = recv(cm_ptr->socket, cm_ptr->p_data, cm_ptr->dst.p_size, 0);
 		if (len != cm_ptr->dst.p_size) {
 			dapl_log(DAPL_DBG_TYPE_ERR, 
 			    " CONN_RTU read pdata: ERR %s, rcnt=%d -> %s\n",
@@ -495,7 +634,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t	cm_ptr)
 	dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); 
 
 	/* complete handshake after final QP state change */
-	if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) {
+	if (send(cm_ptr->socket, (char *) &rtu_data, sizeof(rtu_data), 0) == -1) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			 " CONN_RTU: write error = %s\n", strerror(errno));
 		goto bail;
@@ -564,7 +703,7 @@ dapli_socket_listen(DAPL_IA		*ia_ptr,
 	cm_ptr->hca = ia_ptr->hca_ptr;
 	
 	/* bind, listen, set sockopt, accept, exchange data */
-	if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
+	if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) == DAPL_INVALID_SOCKET) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			 " ERR: listen socket create: %s\n", 
 			 strerror(errno));
@@ -572,7 +711,8 @@ dapli_socket_listen(DAPL_IA		*ia_ptr,
 		goto bail;
 	}
 
-	setsockopt(cm_ptr->socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt));
+	setsockopt(cm_ptr->socket, SOL_SOCKET, SO_REUSEADDR,
+		(char *) &opt, sizeof(opt));
 	addr.sin_port        = htons(serviceID);
 	addr.sin_family      = AF_INET;
 	addr.sin_addr.s_addr = INADDR_ANY;
@@ -625,7 +765,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
 
 	(void) dapl_os_memzero(acm_ptr, sizeof(*acm_ptr));
 	
-	acm_ptr->socket = -1;
+	acm_ptr->socket = DAPL_INVALID_SOCKET;
 	acm_ptr->sp = cm_ptr->sp;
 	acm_ptr->hca = cm_ptr->hca;
 
@@ -633,7 +773,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
 	acm_ptr->socket = accept(cm_ptr->socket, 
 				(struct sockaddr*)&acm_ptr->dst.ia_address, 
 				(socklen_t*)&len);
-	if (acm_ptr->socket < 0) {
+	if (acm_ptr->socket == DAPL_INVALID_SOCKET) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			" accept: ERR %s on FD %d l_cr %p\n",
 			strerror(errno),cm_ptr->socket,cm_ptr); 
@@ -664,7 +804,7 @@ dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
 	dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read QP data\n"); 
 
 	/* read in DST QP info, IA address. check for private data */
-	len = read(acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t));
+	len = recv(acm_ptr->socket, (char *) &acm_ptr->dst, sizeof(ib_qp_cm_t), 0);
 	if (len != sizeof(ib_qp_cm_t) || 
 	    ntohs(acm_ptr->dst.ver) != DSCM_VER) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
@@ -700,8 +840,7 @@ dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
 
 	/* read private data into cm_handle if any present */
 	if (acm_ptr->dst.p_size) {
-		len = read( acm_ptr->socket, 
-			    acm_ptr->p_data, acm_ptr->dst.p_size);
+		len = recv(acm_ptr->socket, acm_ptr->p_data, acm_ptr->dst.p_size, 0);
 		if (len != acm_ptr->dst.p_size) {
 			dapl_log(DAPL_DBG_TYPE_ERR, 
 				     " accept read pdata: ERR %s, rcnt=%d\n",
@@ -757,14 +896,14 @@ dapli_socket_accept_usr(DAPL_EP		*ep_ptr,
 	DAPL_IA		*ia_ptr = ep_ptr->header.owner_ia;
 	dp_ib_cm_handle_t  cm_ptr = cr_ptr->ib_cm_handle;
 	ib_qp_cm_t	local;
-	struct iovec    iovec[2];
+	struct iovec	iov[2];
 	int		len;
 
 	if (p_size > IB_MAX_REP_PDATA_SIZE) 
 		return DAT_LENGTH_ERROR;
 
 	/* must have a accepted socket */
-	if (cm_ptr->socket < 0)
+	if (cm_ptr->socket == DAPL_INVALID_SOCKET)
 		return DAT_INTERNAL_ERROR;
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, 
@@ -844,14 +983,17 @@ dapli_socket_accept_usr(DAPL_EP		*ep_ptr,
 
 	local.ia_address = ia_ptr->hca_ptr->hca_address;
 	local.p_size = htonl(p_size);
-	iovec[0].iov_base = &local;
-	iovec[0].iov_len  = sizeof(ib_qp_cm_t);
+	iov[0].iov_base = (void *) &local;
+	iov[0].iov_len = sizeof(ib_qp_cm_t);
 	if (p_size) {
-		iovec[1].iov_base = p_data;
-		iovec[1].iov_len  = p_size;
+		iov[1].iov_base = p_data;
+		iov[1].iov_len = p_size;
+		len = writev(cm_ptr->socket, iov, 2);
+	} else {
+		len = writev(cm_ptr->socket, iov, 1);
 	}
-	len = writev(cm_ptr->socket, iovec, (p_size ? 2:1));
-    	if (len != (p_size + sizeof(ib_qp_cm_t))) {
+
+	if (len != (p_size + sizeof(ib_qp_cm_t))) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			 " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n",
 			 strerror(errno), len,
@@ -859,6 +1001,7 @@ dapli_socket_accept_usr(DAPL_EP		*ep_ptr,
 			     &cm_ptr->dst.ia_address)->sin_addr)); 
 		goto bail;
 	}
+
 	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
 		     " ACCEPT_USR: local port=0x%x lid=0x%x"
 		     " qpn=0x%x psize=%d\n",
@@ -867,9 +1010,9 @@ dapli_socket_accept_usr(DAPL_EP		*ep_ptr,
         dapl_dbg_log(DAPL_DBG_TYPE_CM,
                      " ACCEPT_USR SRC GID subnet %016llx id %016llx\n",
                      (unsigned long long) 
-			cpu_to_be64(local.gid.global.subnet_prefix),
+			htonll(local.gid.global.subnet_prefix),
                      (unsigned long long) 
-			cpu_to_be64(local.gid.global.interface_id));
+			htonll(local.gid.global.interface_id));
 
 	/* save state and reference to EP, queue for RTU data */
 	cm_ptr->ep = ep_ptr;
@@ -894,7 +1037,7 @@ dapli_socket_accept_rtu(dp_ib_cm_handle_t	cm_ptr)
 	short		rtu_data = 0;
 
 	/* complete handshake after final QP state change */
-	len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data));
+	len = recv(cm_ptr->socket, (char *) &rtu_data, sizeof(rtu_data), 0);
 	if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) {
 		dapl_log(DAPL_DBG_TYPE_ERR, 
 			 " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n",
@@ -1108,9 +1251,9 @@ dapls_ib_remove_conn_listener (
 
 	/* close accepted socket, free cm_srvc_handle and return */
 	if (cm_ptr != NULL) {
-		if (cm_ptr->socket >= 0) {
-			close(cm_ptr->socket );
-			cm_ptr->socket = -1;
+		if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
+			closesocket(cm_ptr->socket);
+			cm_ptr->socket = DAPL_INVALID_SOCKET;
 		}
 	    	/* cr_thread will free */
 		cm_ptr->state = SCM_DESTROY;
@@ -1195,27 +1338,29 @@ dapls_ib_reject_connection(
 	IN DAT_COUNT psize,
 	IN const DAT_PVOID pdata)
 {
-	struct iovec iovec[2];
+	struct iovec iov[2];
 
 	dapl_dbg_log (DAPL_DBG_TYPE_EP,
 		      " reject(cm %p reason %x, pdata %p, psize %d)\n",
 		      cm_ptr, reason, pdata, psize);
 
 	/* write reject data to indicate reject */
-	if (cm_ptr->socket >= 0) {
+	if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
 		cm_ptr->dst.rej = (uint16_t)reason;
 		cm_ptr->dst.rej = htons(cm_ptr->dst.rej);
-		iovec[0].iov_base = &cm_ptr->dst;
-		iovec[0].iov_len  = sizeof(ib_qp_cm_t);
+
+		iov[0].iov_base = (void *) &cm_ptr->dst;
+		iov[0].iov_len = sizeof(ib_qp_cm_t);
 		if (psize) {
-			iovec[1].iov_base = pdata;
-			iovec[2].iov_len = psize;
-			writev(cm_ptr->socket, &iovec[0], 2);
-		} else
-			writev(cm_ptr->socket, &iovec[0], 1);
-
-		close(cm_ptr->socket);
-		cm_ptr->socket = -1;
+			iov[1].iov_base = pdata;
+			iov[1].iov_len = psize;
+			writev(cm_ptr->socket, iov, 2);
+		} else {
+			writev(cm_ptr->socket, iov, 1);
+		}
+
+		closesocket(cm_ptr->socket);
+		cm_ptr->socket = DAPL_INVALID_SOCKET;
 	}
 
 	/* cr_thread will destroy CR */
@@ -1444,138 +1589,141 @@ dapls_ib_get_cm_event (
 }
 
 /* outbound/inbound CR processing thread to avoid blocking applications */
-#define SCM_MAX_CONN 8192
 void cr_thread(void *arg) 
 {
-    struct dapl_hca	*hca_ptr = arg;
-    dp_ib_cm_handle_t	cr, next_cr;
-    int 		opt,ret,idx;
-    socklen_t		opt_len;
-    char		rbuf[2];
-    struct pollfd	ufds[SCM_MAX_CONN];
-     
-    dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n",hca_ptr);
-
-    dapl_os_lock( &hca_ptr->ib_trans.lock );
-    hca_ptr->ib_trans.cr_state = IB_THREAD_RUN;
-    while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) {
-	idx=0;
-	ufds[idx].fd = g_scm_pipe[0]; /* wakeup and process work */
-        ufds[idx].events = POLLIN;
-	ufds[idx].revents = 0;
-
-	if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
-            next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list);
-	else
-	    next_cr = NULL;
-
-	while (next_cr) {
-	    cr = next_cr;
-	    if ((cr->socket == -1 && cr->state == SCM_DESTROY) ||
-		hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
-
-		dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: Free %p\n", cr);
-		next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list,
-						(DAPL_LLIST_ENTRY*)&cr->entry );
-		dapl_llist_remove_entry(&hca_ptr->ib_trans.list, 
-					(DAPL_LLIST_ENTRY*)&cr->entry);
-		dapl_os_free(cr, sizeof(*cr));
-		continue;
-	    }
-
-	    if (idx==SCM_MAX_CONN-1) {
-		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n",idx+1);
-		continue;
-	    }
-		
-	    /* Add to ufds for poll, check for immediate work */
-	    ufds[++idx].fd = cr->socket; /* add listen or cr */
-	    ufds[idx].revents = 0;
-	    if (cr->state == SCM_CONN_PENDING)
-	    	ufds[idx].events = POLLOUT;
-	    else
-		ufds[idx].events = POLLIN;
-
-	    /* check socket for event, accept in or connect out */
-	    dapl_dbg_log(DAPL_DBG_TYPE_CM," poll cr=%p, fd=%d,%d\n", 
-				cr, cr->socket, ufds[idx].fd);
-	    dapl_os_unlock(&hca_ptr->ib_trans.lock);
-	    ret = poll(&ufds[idx],1,0);
-	    dapl_dbg_log(DAPL_DBG_TYPE_CM,
-			 " poll wakeup ret=%d cr->st=%d"
-			 " ev=0x%x fd=%d\n",
-			 ret,cr->state,ufds[idx].revents,ufds[idx].fd);
-
-	    /* data on listen, qp exchange, and on disconnect request */
-	    if ((ret == 1) && ufds[idx].revents == POLLIN) {
-		if (cr->socket > 0) {
-			if (cr->state == SCM_LISTEN)
-				dapli_socket_accept(cr);
-			else if (cr->state == SCM_ACCEPTING)
-				dapli_socket_accept_data(cr);
-			else if (cr->state == SCM_ACCEPTED)
-				dapli_socket_accept_rtu(cr);
-			else if (cr->state == SCM_RTU_PENDING)
-				dapli_socket_connect_rtu(cr);
-			else if (cr->state == SCM_CONNECTED)
-				dapli_socket_disconnect(cr);
+	struct dapl_hca	*hca_ptr = arg;
+	dp_ib_cm_handle_t cr, next_cr;
+	int opt, ret;
+	socklen_t opt_len;
+	char rbuf[2];
+	struct dapl_fd_set *set;
+	enum DAPL_FD_EVENTS event;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n", hca_ptr);
+	set = dapl_alloc_fd_set();
+	if (!set)
+		goto out;
+
+	dapl_os_lock(&hca_ptr->ib_trans.lock);
+	hca_ptr->ib_trans.cr_state = IB_THREAD_RUN;
+
+	while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) {
+		dapl_fd_zero(set);
+		dapl_fd_set(g_scm_pipe[0], set, DAPL_FD_READ);
+
+		if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
+			next_cr = dapl_llist_peek_head(&hca_ptr->ib_trans.list);
+		else
+			next_cr = NULL;
+
+		while (next_cr) {
+			cr = next_cr;
+			if ((cr->socket == DAPL_INVALID_SOCKET && cr->state == SCM_DESTROY) ||
+				hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
+				next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list,
+						(DAPL_LLIST_ENTRY*)&cr->entry);
+				dapl_llist_remove_entry(&hca_ptr->ib_trans.list, 
+						(DAPL_LLIST_ENTRY*)&cr->entry);
+				dapl_os_free(cr, sizeof(*cr));
+				continue;
+			}
+
+			event = (cr->state == SCM_CONN_PENDING) ?
+				DAPL_FD_WRITE : DAPL_FD_READ;
+			if (dapl_fd_set(cr->socket, set, event)) {
+				dapl_log(DAPL_DBG_TYPE_ERR,
+					 " cr_thread: DESTROY CR st=%d fd %d"
+					 " -> %s\n", cr->state, cr->socket,
+					 inet_ntoa(((struct sockaddr_in*)
+					     &cr->dst.ia_address)->sin_addr));
+				dapli_cm_destroy(cr);
+				continue;
+			}
+
+			dapl_dbg_log(DAPL_DBG_TYPE_CM, " poll cr=%p, fd=%d\n",
+				cr, cr->socket);
+			dapl_os_unlock(&hca_ptr->ib_trans.lock);
+
+			ret = dapl_poll(cr->socket, event);
+
+			dapl_dbg_log(DAPL_DBG_TYPE_CM,
+				" poll wakeup ret=%d cr->st=%d fd=%d\n",
+				ret, cr->state, cr->socket);
+
+			/* data on listen, qp exchange, and on disconnect request */
+			if (ret == DAPL_FD_READ) {
+				if (cr->socket != DAPL_INVALID_SOCKET) {
+					switch (cr->state) {
+					case SCM_LISTEN:
+						dapli_socket_accept(cr);
+						break;
+					case SCM_ACCEPTING:
+						dapli_socket_accept_data(cr);
+						break;
+					case SCM_ACCEPTED:
+						dapli_socket_accept_rtu(cr);
+						break;
+					case SCM_RTU_PENDING:
+						dapli_socket_connect_rtu(cr);
+						break;
+					case SCM_CONNECTED:
+						dapli_socket_disconnect(cr);
+						break;
+					default:
+						break;
+					}
+				}
+			/* connect socket is writable, check status */
+			} else if (ret == DAPL_FD_WRITE || ret == DAPL_FD_ERROR) {
+				if (cr->state == SCM_CONN_PENDING) {
+					opt = 0;
+					ret = getsockopt(cr->socket, SOL_SOCKET,
+						SO_ERROR, (char *) &opt, &opt_len);
+					if (!ret)
+						dapli_socket_connected(cr, opt);
+					else
+						dapli_socket_connected(cr, errno);
+				} else {
+					dapl_log(DAPL_DBG_TYPE_CM,
+						" CM poll ERR, wrong state(%d) -> %s SKIP\n", cr->state,
+						inet_ntoa(((struct sockaddr_in*)&cr->dst.ia_address)->sin_addr));
+				}
+			} else if (ret != 0) {
+				dapl_log(DAPL_DBG_TYPE_CM,
+					" CM poll warning %s, ret=%d st=%d -> %s\n",
+					strerror(errno), ret, cr->state,
+					inet_ntoa(((struct sockaddr_in*)
+						&cr->dst.ia_address)->sin_addr));
+
+				/* POLLUP, NVAL, or poll error, issue event if connected */
+				if (cr->state == SCM_CONNECTED)
+					dapli_socket_disconnect(cr);
+			} 
+
+			dapl_os_lock(&hca_ptr->ib_trans.lock);
+			next_cr =  dapl_llist_next_entry(&hca_ptr->ib_trans.list,
+				(DAPL_LLIST_ENTRY*)&cr->entry);
 		}
-	    /* connect socket is writable, check status */
-	    } else if ((ret == 1) && 
-			(ufds[idx].revents & POLLOUT ||
-			 ufds[idx].revents & POLLERR)) {
-		if (cr->state == SCM_CONN_PENDING) {
-			opt = 0;
-			ret = getsockopt(cr->socket, SOL_SOCKET, 
-					 SO_ERROR, &opt, &opt_len);
-			if (!ret)
-				dapli_socket_connected(cr,opt);
-			else
-				dapli_socket_connected(cr,errno);
-		} else {
-			dapl_log(DAPL_DBG_TYPE_CM,
-				 " CM poll ERR, wrong state(%d) -> %s SKIP\n",
-				 cr->state,
-				 inet_ntoa(((struct sockaddr_in*)
-					&cr->dst.ia_address)->sin_addr));
+
+		dapl_os_unlock(&hca_ptr->ib_trans.lock);
+		dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, fds=%d\n",
+			     set->index+1);
+		dapl_select(set);
+		dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n");
+
+		/* if pipe used to wakeup, consume */
+		if (dapl_poll(g_scm_pipe[0], DAPL_FD_READ) == DAPL_FD_READ) {
+			if (read(g_scm_pipe[0], rbuf, 2) == -1)
+				dapl_log(DAPL_DBG_TYPE_CM,
+					 " cr_thread: read pipe error = %s\n", 
+					 strerror(errno));
 		}
-	    } else if (ret != 0) {
-		dapl_log(DAPL_DBG_TYPE_CM,
-			 " CM poll warning %s, ret=%d revnt=%x st=%d -> %s\n",
-			 strerror(errno), ret, ufds[idx].revents, cr->state,
-			 inet_ntoa(((struct sockaddr_in*)
-				&cr->dst.ia_address)->sin_addr));
-
-		/* POLLUP, NVAL, or poll error, issue event if connected */
-		if (cr->state == SCM_CONNECTED)
-			dapli_socket_disconnect(cr);
-	    } 
-	    dapl_os_lock(&hca_ptr->ib_trans.lock);
-	    next_cr =  dapl_llist_next_entry(&hca_ptr->ib_trans.list,
-					     (DAPL_LLIST_ENTRY*)&cr->entry);
+		dapl_os_lock(&hca_ptr->ib_trans.lock);
 	} 
+
 	dapl_os_unlock(&hca_ptr->ib_trans.lock);
-	dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, %d\n", idx+1);
-	poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */
-	/* if pipe used to wakeup, consume */
-	if (ufds[0].revents == POLLIN)
-		if (read(g_scm_pipe[0], rbuf, 2) == -1)
-			dapl_log(DAPL_DBG_TYPE_CM,
-				 " cr_thread: read pipe error = %s\n",
-				 strerror(errno));
-	dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n");
-	dapl_os_lock(&hca_ptr->ib_trans.lock);
-    } 
-    dapl_os_unlock(&hca_ptr->ib_trans.lock);	
-    hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT;
-    dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr);
+	free(set);
+out:
+	hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT;
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr);
 }
-
-/*
- * Local variables:
- *  c-indent-level: 4
- *  c-basic-offset: 4
- *  tab-width: 8
- * End:
- */
diff --git a/dapl/openib_scm/dapl_ib_cq.c b/dapl/openib_scm/dapl_ib_cq.c
index 7d6bd4f..59fff11 100644
--- a/dapl/openib_scm/dapl_ib_cq.c
+++ b/dapl/openib_scm/dapl_ib_cq.c
@@ -46,97 +46,111 @@
  *
  **************************************************************************/
 
+#include "openib_osd.h"
 #include "dapl.h"
 #include "dapl_adapter_util.h"
 #include "dapl_lmr_util.h"
 #include "dapl_evd_util.h"
 #include "dapl_ring_buffer_util.h"
-#include <sys/poll.h>
-#include <signal.h>
 
-int dapli_cq_thread_init(struct dapl_hca *hca_ptr)
+#if defined(_WIN64) || defined(_WIN32)
+void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr)
 {
-        DAT_RETURN dat_status;
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr);
 
-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr);
+	if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN)
+		return;
 
-        /* create thread to process inbound connect request */
-	hca_ptr->ib_trans.cq_state = IB_THREAD_INIT;
-        dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread);
-        if (dat_status != DAT_SUCCESS)
-        {
-                dapl_dbg_log(DAPL_DBG_TYPE_ERR,
-                             " cq_thread_init: failed to create thread\n");
-                return 1;
-        }
+	/* destroy cr_thread and lock */
+	hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
+	SetEvent(hca_ptr->ib_trans.ib_cq->event);
+	dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr);
+	while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
+		dapl_os_sleep_usec(20000);
+	}
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",dapl_os_getpid());
+}
+
+static void cq_thread(void *arg)
+{
+	struct dapl_hca *hca_ptr = arg;
+	struct dapl_evd *evd_ptr;
+	struct ibv_cq   *ibv_cq = NULL;
+
+	hca_ptr->ib_trans.cq_state = IB_THREAD_RUN;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr);
 	
-	/* wait for thread to start */
-	while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) {
-                struct timespec sleep, remain;
-                sleep.tv_sec = 0;
-                sleep.tv_nsec = 20000000; /* 20 ms */
-                dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-                             " cq_thread_init: waiting for cq_thread\n");
-                nanosleep (&sleep, &remain);
-        }
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",getpid());
-        return 0;
+	/* wait on DTO event, or signal to abort */
+	while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
+		if (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq, (void*)&evd_ptr)) {
+
+			if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) {
+				ibv_ack_cq_events(ibv_cq, 1);
+				return;
+			}
+
+			/* process DTO event via callback */
+			dapl_evd_dto_callback(hca_ptr->ib_hca_handle, evd_ptr->ib_cq_handle,
+				(void*)evd_ptr );
+
+			ibv_ack_cq_events(ibv_cq, 1);
+		}
+	}
+	hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr);
 }
 
+#else // _WIN32 || _WIN64
+
 void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr)
 {
-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr);
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr);
 
 	if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN)
 		return;
 
-        /* destroy cr_thread and lock */
-        hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
-        pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1);
-        dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr);
-        while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
-                struct timespec sleep, remain;
-                sleep.tv_sec = 0;
-                sleep.tv_nsec = 2000000; /* 2 ms */
-                dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
-                             " cq_thread_destroy: waiting for cq_thread\n");
-                nanosleep (&sleep, &remain);
-        }
-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid());
+	/* destroy cr_thread and lock */
+	hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
+	pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1);
+	dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr);
+	while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
+		dapl_os_sleep_usec(20000);
+	}
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",dapl_os_getpid());
 }
 
 /* catch the signal */
 static void ib_cq_handler(int signum)
 {
-        return;
+	return;
 }
 
-void cq_thread( void *arg )
+static void cq_thread(void *arg)
 {
-        struct dapl_hca *hca_ptr = arg;
-        struct dapl_evd *evd_ptr;
-        struct ibv_cq   *ibv_cq = NULL;
+	struct dapl_hca *hca_ptr = arg;
+	struct dapl_evd *evd_ptr;
+	struct ibv_cq   *ibv_cq = NULL;
 	sigset_t	sigset;
 
 	sigemptyset(&sigset);
-        sigaddset(&sigset,SIGUSR1);
-        pthread_sigmask(SIG_UNBLOCK, &sigset, NULL);
-        signal(SIGUSR1, ib_cq_handler);
+	sigaddset(&sigset,SIGUSR1);
+	pthread_sigmask(SIG_UNBLOCK, &sigset, NULL);
+	signal(SIGUSR1, ib_cq_handler);
 
 	hca_ptr->ib_trans.cq_state = IB_THREAD_RUN;
-	
+
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr);
 	
-        /* wait on DTO event, or signal to abort */
-        while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
-                struct pollfd cq_fd = {
-                        .fd      = hca_ptr->ib_trans.ib_cq->fd,
-                        .events  = POLLIN,
-                        .revents = 0
-                };
+	/* wait on DTO event, or signal to abort */
+	while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
+		struct pollfd cq_fd = {
+			.fd      = hca_ptr->ib_trans.ib_cq->fd,
+			.events  = POLLIN,
+			.revents = 0
+		};
 		if ((poll(&cq_fd, 1, -1) == 1) &&
-			(!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq,  
-				   &ibv_cq, (void*)&evd_ptr))) {
+			(!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq, (void*)&evd_ptr))) {
 
 			if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) {
 				ibv_ack_cq_events(ibv_cq, 1);
@@ -144,15 +158,40 @@ void cq_thread( void *arg )
 			}
 
 			/* process DTO event via callback */
-			dapl_evd_dto_callback ( hca_ptr->ib_hca_handle,
-						evd_ptr->ib_cq_handle,
-						(void*)evd_ptr );
+			dapl_evd_dto_callback(hca_ptr->ib_hca_handle,
+				evd_ptr->ib_cq_handle, (void*)evd_ptr );
 
 			ibv_ack_cq_events(ibv_cq, 1);
 		} 
-        }
-        hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr);
+	}
+	hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr);
+}
+
+#endif // _WIN32 || _WIN64
+
+
+int dapli_cq_thread_init(struct dapl_hca *hca_ptr)
+{
+	DAT_RETURN dat_status;
+
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr);
+
+	/* create thread to process inbound connect request */
+	hca_ptr->ib_trans.cq_state = IB_THREAD_INIT;
+	dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread);
+	if (dat_status != DAT_SUCCESS) {
+		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
+			" cq_thread_init: failed to create thread\n");
+		return 1;
+	}
+
+	/* wait for thread to start */
+	while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) {
+		dapl_os_sleep_usec(20000);
+	}
+	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",dapl_os_getpid());
+	return 0;
 }
 
 
@@ -308,11 +347,11 @@ dapls_ib_cq_alloc (
 	IN  DAPL_EVD		*evd_ptr,
 	IN  DAT_COUNT		*cqlen )
 {
+	struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq;
+
 	dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, 
 		"dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen );
 
-	struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq;
-
 #ifdef CQ_WAIT_OBJECT
 	if (evd_ptr->cq_wait_obj_handle)
 		channel = evd_ptr->cq_wait_obj_handle;
diff --git a/dapl/openib_scm/dapl_ib_dto.h b/dapl/openib_scm/dapl_ib_dto.h
index 45000b9..fa19d01 100644
--- a/dapl/openib_scm/dapl_ib_dto.h
+++ b/dapl/openib_scm/dapl_ib_dto.h
@@ -147,12 +147,6 @@ dapls_ib_post_send (
 	IN  const DAT_RMR_TRIPLET	*remote_iov,
 	IN  DAT_COMPLETION_FLAGS	completion_flags)
 {
-	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " post_snd: ep %p op %d ck %p sgs",
-		     "%d l_iov %p r_iov %p f %d\n",
-		     ep_ptr, op_type, cookie, segments, local_iov, 
-		     remote_iov, completion_flags);
-
 	ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES];
 	ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL;
 	struct ibv_send_wr wr;
@@ -163,6 +157,12 @@ dapls_ib_post_send (
 	int ret;
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " post_snd: ep %p op %d ck %p sgs",
+		     "%d l_iov %p r_iov %p f %d\n",
+		     ep_ptr, op_type, cookie, segments, local_iov, 
+		     remote_iov, completion_flags);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
 		     " post_snd: ep %p cookie %p segs %d l_iov %p\n",
 		     ep_ptr, cookie, segments, local_iov);
 
@@ -317,12 +317,6 @@ dapls_ib_post_ext_send (
 	IN  DAT_COMPLETION_FLAGS	completion_flags,
 	IN  DAT_IB_ADDR_HANDLE		*remote_ah)
 {
-	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " post_ext_snd: ep %p op %d ck %p sgs",
-		     "%d l_iov %p r_iov %p f %d\n",
-		     ep_ptr, op_type, cookie, segments, local_iov, 
-		     remote_iov, completion_flags, remote_ah);
-
 	ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES];
 	ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL;
 	struct ibv_send_wr wr;
@@ -331,6 +325,12 @@ dapls_ib_post_ext_send (
 	int ret;
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
+		     " post_ext_snd: ep %p op %d ck %p sgs",
+		     "%d l_iov %p r_iov %p f %d\n",
+		     ep_ptr, op_type, cookie, segments, local_iov, 
+		     remote_iov, completion_flags, remote_ah);
+
+	dapl_dbg_log(DAPL_DBG_TYPE_EP,
 		     " post_snd: ep %p cookie %p segs %d l_iov %p\n",
 		     ep_ptr, cookie, segments, local_iov);
 
diff --git a/dapl/openib_scm/dapl_ib_mem.c b/dapl/openib_scm/dapl_ib_mem.c
index 54340ed..9a97e5e 100644
--- a/dapl/openib_scm/dapl_ib_mem.c
+++ b/dapl/openib_scm/dapl_ib_mem.c
@@ -1,4 +1,4 @@
-/*
+	/*
  * Copyright (c) 2005-2007 Intel Corporation.  All rights reserved.
  *
  * This Software is licensed under one of the following licenses:
@@ -35,13 +35,6 @@
  *
  **********************************************************************/
 
-#include <sys/ioctl.h>  /* for IOCTL's */
-#include <sys/types.h>  /* for socket(2) and related bits and pieces */
-#include <sys/socket.h> /* for socket(2) */
-#include <net/if.h>     /* for struct ifreq */
-#include <net/if_arp.h> /* for ARPHRD_ETHER */
-#include <unistd.h>		/* for _SC_CLK_TCK */
-
 #include "dapl.h"
 #include "dapl_adapter_util.h"
 #include "dapl_lmr_util.h"
@@ -215,10 +208,9 @@ dapls_ib_mr_register(IN  DAPL_IA *ia_ptr,
 	lmr->param.registered_address = (DAT_VADDR)(uintptr_t)virt_addr;
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
-		     " mr_register: mr=%p addr=%p h %x pd %p ctx %p "
+		     " mr_register: mr=%p addr=%p pd %p ctx %p "
 		     "lkey=0x%x rkey=0x%x priv=%x\n", 
 		     lmr->mr_handle, lmr->mr_handle->addr, 
-		     lmr->mr_handle->handle,	
 		     lmr->mr_handle->pd, lmr->mr_handle->context,
 		     lmr->mr_handle->lkey, lmr->mr_handle->rkey, 
 		     length, dapls_convert_privileges(privileges));
diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c
index 92b45d5..d82d3f5 100644
--- a/dapl/openib_scm/dapl_ib_util.c
+++ b/dapl/openib_scm/dapl_ib_util.c
@@ -49,17 +49,13 @@
 static const char rcsid[] = "$Id:  $";
 #endif
 
+#include "openib_osd.h"
 #include "dapl.h"
 #include "dapl_adapter_util.h"
 #include "dapl_ib_util.h"
+#include "dapl_osd.h"
 
 #include <stdlib.h>
-#include <netinet/tcp.h>
-#include <sys/utsname.h>
-#include <sys/socket.h>
-#include <arpa/inet.h>
-#include <unistd.h>	
-#include <fcntl.h>
 
 int g_dapl_loopback_connection = 0;
 int g_scm_pipe[2];
@@ -88,52 +84,43 @@ char *dapl_ib_mtu_str(enum ibv_mtu mtu)
 	}
 }
 
-/* just get IP address for hostname */
-DAT_RETURN getipaddr( char *addr, int addr_len)
+static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len)
 {
-	struct sockaddr_in	*ipv4_addr = (struct sockaddr_in*)addr;
-	struct hostent		*h_ptr;
-	struct utsname		ourname;
+	struct sockaddr_in *sin;
+	struct addrinfo *res, hint, *ai;
+	int ret;
+	char hostname[256];
 
-	if (uname(&ourname) < 0)  {
-		 dapl_log(DAPL_DBG_TYPE_ERR, 
-			  " open_hca: uname err=%s\n", strerror(errno));
+	if (addr_len < sizeof(*sin)) {
 		return DAT_INTERNAL_ERROR;
 	}
 
-	h_ptr = gethostbyname(ourname.nodename);
-	if (h_ptr == NULL) {
-		 dapl_log(DAPL_DBG_TYPE_ERR, 
-			  " open_hca: gethostbyname err=%s\n", 
-			  strerror(errno));
-		return DAT_INTERNAL_ERROR;
+	ret = gethostname(hostname,256);
+	if (ret) 
+		return ret;
+
+	memset(&hint, 0, sizeof hint);
+	hint.ai_flags = AI_PASSIVE; 
+	hint.ai_family = AF_INET;
+	hint.ai_socktype = SOCK_STREAM;
+	hint.ai_protocol = IPPROTO_TCP;
+
+	ret = getaddrinfo(hostname, NULL, &hint, &res);
+	if (ret) 
+		return ret;
+
+	ret = DAT_INVALID_ADDRESS;
+	for (ai = res; ai; ai = ai->ai_next) {
+		sin = (struct sockaddr_in *) ai->ai_addr;
+		if (*((uint32_t *) &sin->sin_addr) != htonl(0x7f000001)) {
+			*((struct sockaddr_in *) addr) = *sin;
+			ret = DAT_SUCCESS;
+			break;
+		}
 	}
 
-	if (h_ptr->h_addrtype == AF_INET) {
-		int i;
-		struct in_addr  **alist =
-			(struct in_addr **)h_ptr->h_addr_list;
-
-		*(uint32_t*)&ipv4_addr->sin_addr = 0;
-		ipv4_addr->sin_family = AF_INET;
-		
-		/* Walk the list of addresses for host */
-		for (i=0; alist[i] != NULL; i++) {
-		       /* first non-loopback address */			
-		       if (*(uint32_t*)alist[i] != htonl(0x7f000001)) {
-                               dapl_os_memcpy(&ipv4_addr->sin_addr,
-                                              h_ptr->h_addr_list[i],
-                                              4);
-                               break;
-                       }
-               }
-               /* if no acceptable address found */
-               if (*(uint32_t*)&ipv4_addr->sin_addr == 0)
-			return DAT_INVALID_ADDRESS;
-	} else 
-		return DAT_INVALID_ADDRESS;
-
-	return DAT_SUCCESS;
+	freeaddrinfo(res);
+	return ret;
 }
 
 /*
@@ -165,6 +152,28 @@ int32_t dapls_ib_release (void)
 	return 0;
 }
 
+#if defined(_WIN64) || defined(_WIN32)
+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	return 0;
+}
+#else // _WIN64 || WIN32
+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
+{
+	int opts;
+
+	opts = fcntl(channel->fd, F_GETFL); /* uCQ */
+	if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) {
+		dapl_log(DAPL_DBG_TYPE_ERR, 
+			 " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n", 
+			 channel->fd, opts, strerror(errno));
+		return errno;
+	}
+
+	return 0;
+}
+#endif
+
 /*
  * dapls_ib_open_hca
  *
@@ -187,7 +196,6 @@ DAT_RETURN dapls_ib_open_hca (
         IN   DAPL_HCA		*hca_ptr)
 {
 	struct ibv_device **dev_list;
-	int		opts;
 	int		i;
 	DAT_RETURN	dat_status = DAT_SUCCESS;
 
@@ -219,7 +227,7 @@ found:
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", 
 		     ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
 		     (unsigned long long)
-		     bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev)));
+		     ntohll(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev)));
 
 	hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev);
 	if (!hca_ptr->ib_hca_handle) {
@@ -268,13 +276,7 @@ found:
 		goto bail;
 	}
 
-	opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */
-	if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, 
-			      F_SETFL, opts | O_NONBLOCK) < 0) {
-		dapl_log(DAPL_DBG_TYPE_ERR, 
-			 " open_hca: fcntl on ib_cq->fd %d ERR %d %s\n", 
-			 hca_ptr->ib_trans.ib_cq->fd, opts,
-			 strerror(errno));
+	if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) {
 		goto bail;
 	}
 
@@ -309,16 +311,11 @@ found:
 	
 	/* wait for thread */
 	while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
-		struct timespec	sleep, remain;
-		sleep.tv_sec = 0;
-		sleep.tv_nsec = 2000000; /* 2 ms */
-		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
-			     " open_hca: waiting for cr_thread\n");
-		nanosleep (&sleep, &remain);
+		dapl_os_sleep_usec(20000);
 	}
 
 	/* get the IP address of the device */
-	dat_status = getipaddr((char*)&hca_ptr->hca_address, 
+	dat_status = getlocalipaddr((DAT_SOCK_ADDR*) &hca_ptr->hca_address,
 				sizeof(DAT_SOCK_ADDR6));
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
@@ -376,16 +373,13 @@ DAT_RETURN dapls_ib_close_hca (	IN   DAPL_HCA	*hca_ptr )
 			 " thread_destroy: thread wakeup err = %s\n", 
 			 strerror(errno));
 	while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) {
-		struct timespec	sleep, remain;
-		sleep.tv_sec = 0;
-		sleep.tv_nsec = 2000000; /* 2 ms */
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
 			     " close_hca: waiting for cr_thread\n");
 		if (write(g_scm_pipe[1], "w", sizeof "w") == -1)
 			dapl_log(DAPL_DBG_TYPE_UTIL, 
 				 " thread_destroy: thread wakeup err = %s\n", 
 				 strerror(errno));
-		nanosleep (&sleep, &remain);
+		dapl_os_sleep_usec(20000);
 	}
 	dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
 
diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h
index 863da2b..fd1c24e 100644
--- a/dapl/openib_scm/dapl_ib_util.h
+++ b/dapl/openib_scm/dapl_ib_util.h
@@ -49,8 +49,8 @@
 #ifndef _DAPL_IB_UTIL_H_
 #define _DAPL_IB_UTIL_H_
 
+#include "openib_osd.h"
 #include <infiniband/verbs.h>
-#include <byteswap.h>
 
 #ifdef DAT_EXTENSIONS
 #include <dat2/dat_ib_extensions.h>
@@ -73,8 +73,6 @@ typedef	struct ibv_wc		ib_work_completion_t;
 typedef	struct ibv_context	*ib_hca_handle_t;
 typedef ib_hca_handle_t		dapl_ibal_ca_t;
 
-/* CM mappings, user CM not complete use SOCKETS */
-
 /* destination info to exchange, define wire protocol version */
 #define DSCM_VER 3
 typedef struct _ib_qp_cm
@@ -86,7 +84,7 @@ typedef struct _ib_qp_cm
 	uint32_t		qpn;
 	uint32_t		p_size;
 	DAT_SOCK_ADDR6		ia_address;
-        union ibv_gid		gid;
+	union ibv_gid		gid;
 	uint16_t		qp_type; 
 } ib_qp_cm_t;
 
@@ -110,20 +108,18 @@ struct ib_cm_handle
 	struct dapl_llist_entry	entry;
 	DAPL_OS_LOCK		lock;
 	SCM_STATE		state;
-	int			socket;
+	DAPL_SOCKET		socket;
 	struct dapl_hca		*hca;
 	struct dapl_sp		*sp;	
-	struct dapl_ep 		*ep;	
+	struct dapl_ep 		*ep;
 	ib_qp_cm_t		dst;
-	unsigned char		p_data[256];
+	unsigned char		p_data[256];	/* must follow ib_qp_cm_t */
 	struct ibv_ah		*ah;
 };
 
 typedef struct ib_cm_handle	*dp_ib_cm_handle_t;
 typedef dp_ib_cm_handle_t	ib_cm_srvc_handle_t;
 
-DAT_RETURN getipaddr(char *addr, int addr_len);
-
 /* CM events */
 typedef enum 
 {
@@ -141,9 +137,6 @@ typedef enum
 
 } ib_cm_events_t;
 
-/* prototype for cm thread */
-void cr_thread (void *arg);
-
 /* Operation and state mappings */
 typedef enum	ibv_send_flags	ib_send_op_type_t;
 typedef	struct	ibv_sge		ib_data_segment_t;
@@ -289,7 +282,7 @@ typedef struct _ib_hca_transport
 	DAPL_OS_LOCK		cq_lock;	
 	int			max_inline_send;
 	ib_thread_state_t       cq_state;
-	DAPL_OS_THREAD          cq_thread;
+	DAPL_OS_THREAD			cq_thread;
 	struct ibv_comp_channel *ib_cq;
 	int			cr_state;
 	DAPL_OS_THREAD		thread;
@@ -317,7 +310,6 @@ typedef uint32_t ib_shm_transport_t;
 /* prototypes */
 int32_t	dapls_ib_init (void);
 int32_t	dapls_ib_release (void);
-void cq_thread (void *arg);
 void cr_thread(void *arg);
 int dapli_cq_thread_init(struct dapl_hca *hca_ptr);
 void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr);
@@ -349,7 +341,7 @@ dapl_convert_errno( IN int err, IN const char *str )
     if (!err)	return DAT_SUCCESS;
     	
 #if DAPL_DBG
-    if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT))
+    if ((err != EAGAIN) && (err != ETIMEDOUT))
 	dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err));
 #endif 
 
@@ -357,24 +349,15 @@ dapl_convert_errno( IN int err, IN const char *str )
     {
 	case EOVERFLOW	: return DAT_LENGTH_ERROR;
 	case EACCES	: return DAT_PRIVILEGES_VIOLATION;
-	case ENXIO	: 
-	case ERANGE	: 
 	case EPERM	: return DAT_PROTECTION_VIOLATION;		  
-	case EINVAL	:
-        case EBADF	: 
-	case ENOENT	:
-	case ENOTSOCK	: return DAT_INVALID_HANDLE;
+	case EINVAL	: return DAT_INVALID_HANDLE;
     	case EISCONN	: return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_CONNECTED;
     	case ECONNREFUSED : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_NOTREADY;
-	case ETIME	:	    
 	case ETIMEDOUT	: return DAT_TIMEOUT_EXPIRED;
     	case ENETUNREACH: return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_UNREACHABLE;
     	case EADDRINUSE	: return DAT_CONN_QUAL_IN_USE;
     	case EALREADY	: return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_ACTCONNPENDING;
-        case ENOSPC	: 
-	case ENOMEM	:
-        case E2BIG	:
-        case EDQUOT	: return DAT_INSUFFICIENT_RESOURCES;
+	case ENOMEM	: return DAT_INSUFFICIENT_RESOURCES;
         case EAGAIN	: return DAT_QUEUE_EMPTY;
 	case EINTR	: return DAT_INTERRUPTED_CALL;
     	case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_MALFORMED;
diff --git a/dapl/openib_scm/linux/openib_osd.h b/dapl/openib_scm/linux/openib_osd.h
new file mode 100644
index 0000000..235a82e
--- /dev/null
+++ b/dapl/openib_scm/linux/openib_osd.h
@@ -0,0 +1,21 @@
+#ifndef OPENIB_OSD_H
+#define OPENIB_OSD_H
+
+#include <endian.h>
+#include <netinet/in.h>
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#define htonll(x) (x)
+#define ntohll(x) (x)
+#elif __BYTE_ORDER == __LITTLE_ENDIAN
+#define htonll(x)  bswap_64(x)
+#define ntohll(x)  bswap_64(x)
+#endif
+
+#define DAPL_SOCKET int
+#define DAPL_INVALID_SOCKET -1
+#define DAPL_FD_SETSIZE 8192
+
+#define closesocket close
+
+#endif // OPENIB_OSD_H
diff --git a/dapl/openib_scm/windows/openib_osd.h b/dapl/openib_scm/windows/openib_osd.h
new file mode 100644
index 0000000..67c70ec
--- /dev/null
+++ b/dapl/openib_scm/windows/openib_osd.h
@@ -0,0 +1,39 @@
+#ifndef OPENIB_OSD_H
+#define OPENIB_OSD_H
+
+#ifndef FD_SETSIZE
+#define FD_SETSIZE 1024 /* Set before including winsock2 - see select help */
+#define DAPL_FD_SETSIZE FD_SETSIZE
+#endif
+
+#include <winsock2.h>
+#include <ws2tcpip.h>
+#include <io.h>
+#include <fcntl.h>
+
+#define ntohll _byteswap_uint64
+#define htonll _byteswap_uint64
+
+#define pipe(x) _pipe(x, 4096, _O_TEXT)
+#define read _read
+#define write _write
+#define DAPL_SOCKET SOCKET
+#define DAPL_INVALID_SOCKET INVALID_SOCKET
+
+/* allow casting to WSABUF */
+struct iovec
+{
+       u_long iov_len;
+       char FAR* iov_base;
+};
+
+static int writev(DAPL_SOCKET s, struct iovec *vector, int count)
+{
+       int len, ret;
+
+       ret = WSASend(s, (WSABUF *) vector, count, &len, 0, NULL, NULL);
+       return ret ? ret : len;
+}
+
+#endif // OPENIB_OSD_H
+
diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 6fef9af..ae02944 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -302,6 +302,15 @@ dapl_os_thread_create (
 	IN  void			*data,
 	OUT DAPL_OS_THREAD		*thread_id );
 
+STATIC _INLINE_ void
+dapl_os_sleep_usec(int usec)
+{
+	struct timespec sleep, remain;
+
+	sleep.tv_sec = 0;
+	sleep.tv_nsec = usec * 1000;
+	nanosleep(&sleep, &remain);
+}
 
 /*
  * Lock Functions


From vitto.giova at yahoo.it  Fri Feb 13 19:34:03 2009
From: vitto.giova at yahoo.it (Vittorio)
Date: Sat, 14 Feb 2009 04:34:03 +0100
Subject: [ofa-general] ***SPAM*** troubleshooting with infinband
Message-ID: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>

Hello!
This is my first message on the list so i hope that i'm not going to ask
silly or already answered question

i'm a student and i'm porting an electromagnetic field simulator to a
parallel and distributed linux cluster for final thesis; i'm using both
OpenMP and MPI over Infiniband to achieve speed improvements

the openmp part is done and now i'm facing problem with setting up MPI over
Infinband
i have correctly set up the kernel modules
installed the right drivers for the board (mellanox hca) and userspace
programs
installed mpavich2 mpi implementation

however i fail to run all of this together:
for example ibhost correctly find the two nodes connected

Ca    : 0x0002c90300018b8e ports 2 " HCA-1"
Ca    : 0x0002c90300018b12 ports 2 "localhost HCA-1"

but ibping doens't receive responses

ibwarn: [32052] ibping: Ping..
ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
ibwarn: [32052] main: ibping to Lid 2 failed

subsequently any other operation with MPI fails
strangely enough however IPoIB works very well and i can ping and connect
with no problems

the two machines are identical and they use a crossover cable (point to
point)
lspci identifies the boards as
03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0
2.5GT/s] (rev a0)

what can be the cause of all of this? am i forgetting something?
any help is greatly appreciated
Thank you
Vittorio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090214/b5906b22/attachment.html>

From dotanba at gmail.com  Fri Feb 13 23:23:40 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 14 Feb 2009 09:23:40 +0200
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband
In-Reply-To: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>
References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>
Message-ID: <4996717C.8000005@gmail.com>

Vittorio wrote:
> Hello!
> This is my first message on the list so i hope that i'm not going to 
> ask silly or already answered question
>
> i'm a student and i'm porting an electromagnetic field simulator to a 
> parallel and distributed linux cluster for final thesis; i'm using 
> both OpenMP and MPI over Infiniband to achieve speed improvements
>
> the openmp part is done and now i'm facing problem with setting up MPI 
> over Infinband
> i have correctly set up the kernel modules
> installed the right drivers for the board (mellanox hca) and userspace 
> programs
> installed mpavich2 mpi implementation
>
> however i fail to run all of this together:
> for example ibhost correctly find the two nodes connected
>
> Ca    : 0x0002c90300018b8e ports 2 " HCA-1"
> Ca    : 0x0002c90300018b12 ports 2 "localhost HCA-1"
>
> but ibping doens't receive responses
>
> ibwarn: [32052] ibping: Ping..
> ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
> ibwarn: [32052] main: ibping to Lid 2 failed
>
> subsequently any other operation with MPI fails
> strangely enough however IPoIB works very well and i can ping and 
> connect with no problems
>
> the two machines are identical and they use a crossover cable (point 
> to point)
> lspci identifies the boards as
> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, 
> PCIe 2.0 2.5GT/s] (rev a0)
>
> what can be the cause of all of this? am i forgetting something?
> any help is greatly appreciated
> Thank you
> Vittorio
I suggest that you will execute the ibv_rc_pingpong  and see that the IB 
connectivity is o.k..
Then try to execute rping to check that the ib_cma is o.k..

Those will be a good start point to find the problem
(do it for all of the active ports that you have).


Dotan


From dotanba at gmail.com  Fri Feb 13 23:33:49 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 14 Feb 2009 09:33:49 +0200
Subject: ***SPAM*** Re: [ofa-general] non zero lkey in send(),
	write() with  num_sge > 1?
In-Reply-To: <809230.93598.qm@web111213.mail.gq1.yahoo.com>
References: <809230.93598.qm@web111213.mail.gq1.yahoo.com>
Message-ID: <499673DD.6090008@gmail.com>

Bill N wrote:
>>> Can stack pass num_sge > 1, and lkey !=0 as part of
>>>       
>> sg_list[] elements, in post_send() call?
>>     
>>>   
>>>       
>> What are you trying to achieve?
>>     
> [Bill]
> I just wanted to confirm, that even when Stag !=0,
> (a) there can be multiple SGEs in the list with different lkey and TO.
> And
> (b) HCAs have to validate each of the SGE entry against the lkey.
>
> Want to ensure that 
> - As RDMA ULP I can invoke post_send() with multiple lkeys and utilize the allocated MRs, HCAs are designed to handle that.
>
> Any example ULP we are aware of that does this?
>
> Regards,
> Bill
>   
If we are talking about the following scenario:
For example: num_sge = 3.

sg_list[0].lkey=A
sg_list[1].lkey=B
sg_list[2].lkey=C


so, here is the answer:

I checked the ULPs code which are part of the Linux kernel and I noticed 
that there isn't any ULP that uses several
SGEs from different memory regions:
Most of the ULPs uses only one SGE, and those who use more than one, use 
the same lkey.

 From my experience, I can tell you that the OFED stack support this feature
(and many HCAs support it too).

If you know otherwise, there is a bug somewhere..

Dotan 


From dotanba at gmail.com  Fri Feb 13 23:42:26 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 14 Feb 2009 09:42:26 +0200
Subject: [ofa-general] ib_create_qp and ib_get_err_str weirdness
In-Reply-To: <01fa01c98df0$47baed30$0100000a@DIEGO>
References: <01fa01c98df0$47baed30$0100000a@DIEGO>
Message-ID: <499675E2.3060703@gmail.com>

Hi.

Diego Guella wrote:
> Hello,
>
> I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR single-port card.
> I noticed a strange behavior of ib_create_qp function:
>
> -----
> memset(&qp_create, 0, sizeof(qp_create));
> qp_create.qp_type = IB_QPT_RELIABLE_CONN; // Reliable Connected
> qp_create.sq_depth = ctx->qdepth;
> qp_create.rq_depth = ctx->qdepth;
> qp_create.sq_sge = ctx->hca_attr->max_sges;
> qp_create.rq_sge = ctx->hca_attr->max_sges;
> qp_create.h_sq_cq = ctx->cq_h;
> qp_create.h_rq_cq = ctx->cq_h;
> qp_create.h_srq = NULL;
> qp_create.sq_signaled = 1;
> ctx->qp_h = 0;
> rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h);
> -----
> return value ("rc") is 3 (=IB_INVALID_PARAMETER).
>
> I spent some time figuring out the problem was the SQ SGE value:
> http://lists.openfabrics.org/pipermail/general/2006-June/023417.html
>
> According to iba/ib_al.h:
> -----
> * IB_INVALID_MAX_SGE
> * The requested maximum number of scatter-gather entries for the send or
> * receive queue could not be supported.
> -----
> so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)?
>
> In the discussion I mentioned, it turned out that even using 
> hca_attr->max_sges there is the possibility that ib_create_qp fails.
> Which is my case.
> I have the need to send some audio buffers (32 or more) from an IO 
> node to a computing node using RDMA WRITE.
> The ownership of the buffers is of the audio driver, and I haven't the 
> guarantee that the audio buffers are contiguous.
> I was trying and send them using the lowest possible number of WR, 
> each one with the highest possible number of sge.
> But, given the hca_attr->max_sge unreliability, how do you recommend 
> to achieve this goal?
I saw code that is aware to this problem and try to create a QP with the 
maximum number of sge, and upon failures, decrease this value
until the QP can be created.

If you will use maximum supported number of sge minus a constant (let's 
say: 2), it should be always o.k..

> Should I post a WR for each buffer I'd want to send through RDMA WRITE?
If you are talking about local buffers, than you can use send data from 
several buffers using the same SR.
If you are talking about remote buffers, than you have to use different 
SR for every remote buffer that you want to fill.
>
>
> Another less-related problem:
> ib_get_err_str is not correct for every input value, for example I 
> noticed that for
> ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is 
> IB_INVALID_MR_HANDLE
>
>
> I don't know if these problems apply to linux too, so I'm including 
> general list.
In Linux the return values are different (usually, -1 means that there 
is an error and that's all...).
I believe that the error exists only in the win-ofa code.

Dotan


From vlad at lists.openfabrics.org  Sat Feb 14 03:14:08 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 14 Feb 2009 03:14:08 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090214-0200 daily build status
Message-ID: <20090214111408.6399DE6101C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Sat Feb 14 04:07:57 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 14 Feb 2009 07:07:57 -0500
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband
In-Reply-To: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>
References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>
Message-ID: <f0e08f230902140407j7f8e1effke902d695456e44ef@mail.gmail.com>

On Fri, Feb 13, 2009 at 10:34 PM, Vittorio <vitto.giova at yahoo.it> wrote:
> Hello!
> This is my first message on the list so i hope that i'm not going to ask
> silly or already answered question
>
> i'm a student and i'm porting an electromagnetic field simulator to a
> parallel and distributed linux cluster for final thesis; i'm using both
> OpenMP and MPI over Infiniband to achieve speed improvements
>
> the openmp part is done and now i'm facing problem with setting up MPI over
> Infinband
> i have correctly set up the kernel modules
> installed the right drivers for the board (mellanox hca) and userspace
> programs
> installed mpavich2 mpi implementation
>
> however i fail to run all of this together:
> for example ibhost correctly find the two nodes connected
>
> Ca    : 0x0002c90300018b8e ports 2 " HCA-1"
> Ca    : 0x0002c90300018b12 ports 2 "localhost HCA-1"
>
> but ibping doens't receive responses
>
> ibwarn: [32052] ibping: Ping..
> ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
> ibwarn: [32052] main: ibping to Lid 2 failed

This would be expected if no ibping server was running on the lid 2 machine.

-- Hal

> subsequently any other operation with MPI fails
> strangely enough however IPoIB works very well and i can ping and connect
> with no problems

> the two machines are identical and they use a crossover cable (point to
> point)
> lspci identifies the boards as
> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0
> 2.5GT/s] (rev a0)
>
> what can be the cause of all of this? am i forgetting something?
> any help is greatly appreciated
> Thank you
> Vittorio
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From hnrose at comcast.net  Sat Feb 14 04:37:36 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 07:37:36 -0500
Subject: [ofa-general] Test
Message-ID: <20090214123736.GA25106@comcast.net>

Please ignore.

-- Hal


From vitto.giova at yahoo.it  Sat Feb 14 05:49:54 2009
From: vitto.giova at yahoo.it (Vittorio)
Date: Sat, 14 Feb 2009 14:49:54 +0100
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband
In-Reply-To: <4996717C.8000005@gmail.com>
References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>
	<4996717C.8000005@gmail.com>
Message-ID: <4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com>

thanks for the suggestion, but i can't understand which kind of address i
should put for the two commands
i tried ibping with the server (like suggested) and it works with -G <port>
or with lid

but what should i put as argument of ibv_rc_pingpong and rping?

thanks a lot
Vittorio

On Sat, Feb 14, 2009 at 8:23 AM, Dotan Barak <dotanba at gmail.com> wrote:

> Vittorio wrote:
>
>> Hello!
>> This is my first message on the list so i hope that i'm not going to ask
>> silly or already answered question
>>
>> i'm a student and i'm porting an electromagnetic field simulator to a
>> parallel and distributed linux cluster for final thesis; i'm using both
>> OpenMP and MPI over Infiniband to achieve speed improvements
>>
>> the openmp part is done and now i'm facing problem with setting up MPI
>> over Infinband
>> i have correctly set up the kernel modules
>> installed the right drivers for the board (mellanox hca) and userspace
>> programs
>> installed mpavich2 mpi implementation
>>
>> however i fail to run all of this together:
>> for example ibhost correctly find the two nodes connected
>>
>> Ca    : 0x0002c90300018b8e ports 2 " HCA-1"
>> Ca    : 0x0002c90300018b12 ports 2 "localhost HCA-1"
>>
>> but ibping doens't receive responses
>>
>> ibwarn: [32052] ibping: Ping..
>> ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
>> ibwarn: [32052] main: ibping to Lid 2 failed
>>
>> subsequently any other operation with MPI fails
>> strangely enough however IPoIB works very well and i can ping and connect
>> with no problems
>>
>> the two machines are identical and they use a crossover cable (point to
>> point)
>> lspci identifies the boards as
>> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
>> 2.0 2.5GT/s] (rev a0)
>>
>> what can be the cause of all of this? am i forgetting something?
>> any help is greatly appreciated
>> Thank you
>> Vittorio
>>
> I suggest that you will execute the ibv_rc_pingpong  and see that the IB
> connectivity is o.k..
> Then try to execute rping to check that the ib_cma is o.k..
>
> Those will be a good start point to find the problem
> (do it for all of the active ports that you have).
>
>
> Dotan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090214/1694fea9/attachment.html>

From hnrose at comcast.net  Sat Feb 14 05:53:08 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:53:08 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_helper.c: Add port
	counters to __osm_disp_msg_str
Message-ID: <20090214135308.GB25402@comcast.net>


>From d9c17a8251b874c33542a19a51d1332ea3196713 Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Thu, 12 Feb 2009 09:27:46 -0500
Subject: [PATCH] opensm/osm_helper.c: Add port counters to  __osm_disp_msg_str

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_helper.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index e2ad4e7..c56f5b2 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -2101,6 +2101,7 @@ static const char *const __osm_disp_msg_str[] = {
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
 	"OSM_MSG_MAD_MULTIPATH_RECORD",
 #endif
+	"OSM_MSG_MAD_PORT_COUNTERS",
 	"UNKNOWN!!"
 };
 
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 05:51:39 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:51:39 -0500
Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for
	some OSM_LOG prin
Message-ID: <20090214135139.GA25402@comcast.net>


>From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Tue, 10 Feb 2009 07:14:32 -0500
Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_ucast_mgr.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 7232fbc..e404c91 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -786,7 +786,7 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t *m)
 	int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl);
 	void **s = malloc(num * sizeof(*s));
 	if (!s) {
-		OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: "
+		OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR 3A0C: "
 			"No memory, skip by switch load sorting.\n");
 		return;
 	}
@@ -814,7 +814,7 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr)
 
 		if (parse_node_map(p_mgr->p_subn->opt.guid_routing_order_file,
 				   add_guid_to_order_list, p_mgr))
-			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : "
+			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0D: "
 				"cannot parse guid routing order file \'%s\'\n",
 				p_mgr->p_subn->opt.guid_routing_order_file);
 	} else
@@ -825,7 +825,7 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr)
 				   clear_prof_ignore_flag, NULL);
 		if (parse_node_map(p_mgr->p_subn->opt.port_prof_ignore_file,
 				   mark_ignored_port, p_mgr)) {
-			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : "
+			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0E: "
 				"cannot parse port prof ignore file \'%s\'\n",
 				p_mgr->p_subn->opt.port_prof_ignore_file);
 		}
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 05:55:50 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:55:50 -0500
Subject: [ofa-general] [PATCH] opensm/osm_console.c: Add missing command in
	help_perfmgr
Message-ID: <20090214135550.GE25402@comcast.net>


>From 7faaf4e757c42a8f57fd5b02f425266f2eb853b2 Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Fri, 13 Feb 2009 13:32:43 -0500
Subject: [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_console.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index fe5994b..a66a7d3 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -204,7 +204,7 @@ static void help_dump_conf(FILE *out, int detail)
 static void help_perfmgr(FILE * out, int detail)
 {
 	fprintf(out,
-		"perfmgr [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n");
+		"perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n");
 	if (detail) {
 		fprintf(out,
 			"perfmgr -- print the performance manager state\n");
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 05:57:00 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:57:00 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim_net.c: In new_node,
	fix nodetype in nodeinfo for router nodes
Message-ID: <20090214135700.GF25402@comcast.net>


>From 17350f5a17ec5ec821607aae7bf94a88b84d6e74 Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Thu, 12 Feb 2009 10:57:20 -0500
Subject: [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 ibsim/sim_net.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index 7a42cb6..f0628ec 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -322,6 +322,8 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports)
 		guids[type]++;	// reserve single guid;
 	} else {
 		memcpy(nd->nodeinfo, hcanodeinfo, sizeof(nd->nodeinfo));
+		if (type == ROUTER_NODE)
+			mad_set_field(nd->nodeinfo, 0, IB_NODE_TYPE_F, ROUTER_NODE);
 		guids[type] += nodeports + 1;	// reserve guids;
 	}
 
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 05:54:09 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:54:09 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_console.c: Add list of
	SMs to status command
Message-ID: <20090214135409.GC25402@comcast.net>


>From debc6e1f5bd225449ca897264948b08ccf69de38 Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Fri, 13 Feb 2009 09:49:36 -0500
Subject: [PATCH] opensm/osm_console.c: Add list of SMs to status command

Also, add SM priority into status command

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_console.c |   38 ++++++++++++++++++++++++++++++++++----
 1 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 5bc1079..f06eb52 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2005-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -303,13 +304,13 @@ static char *sm_state_str(int state)
 	case IB_SMINFO_STATE_DISCOVERING:
 		return ("Discovering");
 	case IB_SMINFO_STATE_STANDBY:
-		return ("Standby");
+		return ("Standby    ");
 	case IB_SMINFO_STATE_NOTACTIVE:
-		return ("Not Active");
+		return ("Not Active ");
 	case IB_SMINFO_STATE_MASTER:
-		return ("Master");
+		return ("Master     ");
 	}
-	return ("UNKNOWN");
+	return ("UNKNOWN    ");
 }
 
 static char *sa_state_str(osm_sa_state_t state)
@@ -323,6 +324,32 @@ static char *sa_state_str(osm_sa_state_t state)
 	return ("UNKNOWN");
 }
 
+static void dump_sms(osm_opensm_t * p_osm, FILE * out)
+{
+	osm_subn_t *p_subn = &p_osm->subn;
+	osm_remote_sm_t *p_rsm;
+
+	fprintf(out, "\n   Known SMs\n"
+		     "   ---------\n");
+	fprintf(out, "   Port GUID       SM State    Priority\n");
+	fprintf(out, "   ---------       --------    --------\n");
+	fprintf(out, "   0x%" PRIx64 " %s %d        SELF\n",
+		cl_ntoh64(p_subn->sm_port_guid),
+		sm_state_str(p_subn->sm_state),
+		p_subn->opt.sm_priority);
+
+	CL_PLOCK_ACQUIRE(p_osm->sm.p_lock);
+	p_rsm = (osm_remote_sm_t *) cl_qmap_head(&p_subn->sm_guid_tbl);
+	while (p_rsm != (osm_remote_sm_t *) cl_qmap_end(&p_subn->sm_guid_tbl)) {
+		fprintf(out, "   0x%" PRIx64 " %s %d\n",
+			cl_ntoh64(p_rsm->smi.guid),
+			sm_state_str(ib_sminfo_get_state(&p_rsm->smi)),
+			ib_sminfo_get_priority(&p_rsm->smi));
+		p_rsm = (osm_remote_sm_t *) cl_qmap_next(&p_rsm->map_item);
+	}
+	CL_PLOCK_RELEASE(p_osm->sm.p_lock);
+}
+
 static void print_status(osm_opensm_t * p_osm, FILE * out)
 {
 	cl_list_item_t *item;
@@ -332,6 +359,8 @@ static void print_status(osm_opensm_t * p_osm, FILE * out)
 		fprintf(out, "   OpenSM Version       : %s\n", p_osm->osm_version);
 		fprintf(out, "   SM State             : %s\n",
 			sm_state_str(p_osm->subn.sm_state));
+		fprintf(out, "   SM Priority          : %d\n",
+			p_osm->subn.opt.sm_priority);
 		fprintf(out, "   SA State             : %s\n",
 			sa_state_str(p_osm->sa.state));
 		fprintf(out, "   Routing Engine       : %s\n",
@@ -391,6 +420,7 @@ static void print_status(osm_opensm_t * p_osm, FILE * out)
 			p_osm->subn.in_sweep_hop_0,
 			p_osm->subn.first_time_master_sweep,
 			p_osm->subn.coming_out_of_standby);
+		dump_sms(p_osm, out);
 		fprintf(out, "\n");
 		cl_plock_release(&p_osm->lock);
 	}
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 05:55:04 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 08:55:04 -0500
Subject: [ofa-general] [PATCH] opensm/osm_console.c: Eliminate some
	extraneous parentheses
Message-ID: <20090214135504.GD25402@comcast.net>


>From 8d6c1b61e43059ed80885131c0bbce51baf4eddf Mon Sep 17 00:00:00 2001
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Fri, 13 Feb 2009 10:35:39 -0500
Subject: [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_console.c |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index f06eb52..fe5994b 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -381,8 +381,8 @@ static void print_status(osm_opensm_t * p_osm, FILE * out)
 
 #ifdef ENABLE_OSM_PERF_MGR
 		fprintf(out, "\n   PerfMgr state/sweep state : %s/%s\n",
-			osm_perfmgr_get_state_str(&(p_osm->perfmgr)),
-			osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr)));
+			osm_perfmgr_get_state_str(&p_osm->perfmgr),
+			osm_perfmgr_get_sweep_state_str(&p_osm->perfmgr));
 #endif
 		fprintf(out, "\n   MAD stats\n"
 			"   ---------\n"
@@ -1135,26 +1135,26 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 	p_cmd = next_token(p_last);
 	if (p_cmd) {
 		if (strcmp(p_cmd, "enable") == 0) {
-			osm_perfmgr_set_state(&(p_osm->perfmgr),
+			osm_perfmgr_set_state(&p_osm->perfmgr,
 					      PERFMGR_STATE_ENABLED);
 		} else if (strcmp(p_cmd, "disable") == 0) {
-			osm_perfmgr_set_state(&(p_osm->perfmgr),
+			osm_perfmgr_set_state(&p_osm->perfmgr,
 					      PERFMGR_STATE_DISABLE);
 		} else if (strcmp(p_cmd, "clear_counters") == 0) {
-			osm_perfmgr_clear_counters(&(p_osm->perfmgr));
+			osm_perfmgr_clear_counters(&p_osm->perfmgr);
 		} else if (strcmp(p_cmd, "dump_counters") == 0) {
 			p_cmd = next_token(p_last);
 			if (p_cmd && (strcmp(p_cmd, "mach") == 0)) {
-				osm_perfmgr_dump_counters(&(p_osm->perfmgr),
+				osm_perfmgr_dump_counters(&p_osm->perfmgr,
 							  PERFMGR_EVENT_DB_DUMP_MR);
 			} else {
-				osm_perfmgr_dump_counters(&(p_osm->perfmgr),
+				osm_perfmgr_dump_counters(&p_osm->perfmgr,
 							  PERFMGR_EVENT_DB_DUMP_HR);
 			}
 		} else if (strcmp(p_cmd, "print_counters") == 0) {
 			p_cmd = next_token(p_last);
 			if (p_cmd) {
-				osm_perfmgr_print_counters(&(p_osm->perfmgr),
+				osm_perfmgr_print_counters(&p_osm->perfmgr,
 							   p_cmd, out);
 			} else {
 				fprintf(out,
@@ -1164,7 +1164,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 			p_cmd = next_token(p_last);
 			if (p_cmd) {
 				uint16_t time_s = atoi(p_cmd);
-				osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr),
+				osm_perfmgr_set_sweep_time_s(&p_osm->perfmgr,
 							     time_s);
 			} else {
 				fprintf(out,
@@ -1179,9 +1179,9 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 			"sweep state             : %s\n"
 			"sweep time              : %us\n"
 			"outstanding queries/max : %d/%u\n",
-			osm_perfmgr_get_state_str(&(p_osm->perfmgr)),
-			osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr)),
-			osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)),
+			osm_perfmgr_get_state_str(&p_osm->perfmgr),
+			osm_perfmgr_get_sweep_state_str(&p_osm->perfmgr),
+			osm_perfmgr_get_sweep_time_s(&p_osm->perfmgr),
 			p_osm->perfmgr.outstanding_queries,
 			p_osm->perfmgr.max_outstanding_queries);
 	}
-- 
1.5.6.4


From hal.rosenstock at gmail.com  Sat Feb 14 06:08:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 14 Feb 2009 09:08:36 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad/rpc.c: In 
	mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response
In-Reply-To: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com>
References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca>
	<15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com>
Message-ID: <f0e08f230902140608ma96d0acuca6b7d6665527e34@mail.gmail.com>

Or,

On Fri, Feb 6, 2009 at 6:02 PM, Or Gerlitz <or.gerlitz at gmail.com> wrote:
> On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock
> <halr at obsidianresearch.com> wrote:
>> Sasha,
>> This patch sets the attribute ID based on what is in the response.
>
> Hal,
>
> Your patches can't really be reviewed when being sent as attachment,

Yes, it is more work for the reviewer in this case.

> any reason not
> to send them embedded within the email message?

Sendmail is just more fun than one should be allowed to have. FWIW, I
think I have this resolved now but we'll see...

-- Hal

> Or.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Sat Feb 14 07:25:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:25:33 +0200
Subject: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
	<20090212200025.GC14416@sashak.voltaire.com>
	<f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
Message-ID: <20090214152533.GG14416@sashak.voltaire.com>

Hi Hal,

On 19:41 Thu 12 Feb     , Hal Rosenstock wrote:
> >
> > It is already supplied by libibumad - by umad_get_ca()
> > (ca.ports[i]->pkeys). I think you just need to copy this to
> > ib_port_attr_t structure.
> 
> Yes but rather than using supplied pointers (as inputs for the per
> port pkey/guid tables), the other vendor layers require a large enough
> buffer for these tables and set the port pointers appropriately (on
> output) rather than supplying these pointers as input parameters. So
> if we use these as input, then we definitely break the other vendor
> layers.

Ok, if you already have an usage example, this is even simpler - just
alloc mem and copy pkey table.

Sasha


From sashak at voltaire.com  Sat Feb 14 07:26:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:26:18 +0200
Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <f0e08f230902130358g23e4d8ddqf896ab24eb97390d@mail.gmail.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
	<20090212200025.GC14416@sashak.voltaire.com>
	<f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
	<f0e08f230902130358g23e4d8ddqf896ab24eb97390d@mail.gmail.com>
Message-ID: <20090214152618.GH14416@sashak.voltaire.com>

On 06:58 Fri 13 Feb     , Hal Rosenstock wrote:
> 
> Another choice is to ifdef these differences between Linux and Windows
> at least until umad is used there.

Please try to avoid #ifdef(s).

Sasha


From sashak at voltaire.com  Sat Feb 14 07:28:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:28:04 +0200
Subject: [ofa-general] [ib-mgmt] ibdiag_common.h question
In-Reply-To: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com>
References: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com>
Message-ID: <20090214152804.GI14416@sashak.voltaire.com>

Hi Sean,

On 16:56 Thu 12 Feb     , Sean Hefty wrote:
> I noticed the following in ibdiag_common.h:
> 
> #define	DEBUG	if (ibdebug || ibverbose) IBWARN
> #define	VERBOSE	if (ibdebug || ibverbose > 1) IBWARN
> 
> This allows for else statements to mismatch when defined.

Sure, we can wrap it with 'do { ... } while (0)'.

Sasha


From sashak at voltaire.com  Sat Feb 14 07:37:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:37:34 +0200
Subject: [ofa-general] [PATCH] infiniabnd-diags/common: wrap debug macros
	with do {} while (0)
In-Reply-To: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
Message-ID: <20090214153734.GJ14416@sashak.voltaire.com>


Wrap debug macros which use 'if () {}' with 'do { .. } while (0)' to
prevent potential 'else' statement mismatching. Also use portable
__VA_ARGS__ macro.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/include/ibdiag_common.h |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h
index 4783b8e..52fd147 100644
--- a/infiniband-diags/include/ibdiag_common.h
+++ b/infiniband-diags/include/ibdiag_common.h
@@ -50,9 +50,13 @@ extern int ibd_timeout;
 /*========================================================*/
 
 #undef DEBUG
-#define	DEBUG	if (ibdebug || ibverbose) IBWARN
-#define	VERBOSE	if (ibdebug || ibverbose > 1) IBWARN
-#define IBERROR(fmt, args...)	iberror(__FUNCTION__, fmt, ## args)
+#define DEBUG(fmt, ...) do { \
+	if (ibdebug || ibverbose) IBWARN(fmt, ## __VA_ARGS__); \
+} while (0)
+#define VERBOSE(fmt, ...) do { \
+	if (ibdebug || ibverbose > 1) IBWARN(fmt, ## __VA_ARGS__); \
+} while (0)
+#define IBERROR(fmt, ...) iberror(__FUNCTION__, fmt, ## __VA_ARGS__)
 
 struct ibdiag_opt {
 	const char *name;
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 14 07:40:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:40:45 +0200
Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
Message-ID: <20090214154045.GK14416@sashak.voltaire.com>

On 10:39 Fri 13 Feb     , Sean Hefty wrote:
> >diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-
> >diags/src/ibdiag_common.c
> >index bda1efa..154e00c 100644
> >--- a/infiniband-diags/src/ibdiag_common.c
> >+++ b/infiniband-diags/src/ibdiag_common.c
> >@@ -43,15 +43,14 @@
> > #include <stdlib.h>
> > #include <stdarg.h>
> > #include <sys/types.h>
> >-#include <unistd.h>
> > #include <ctype.h>
> >-#include <config.h>
> > #include <getopt.h>
> >
> > #include <infiniband/umad.h>
> > #include <infiniband/mad.h>
> > #include <ibdiag_common.h>
> > #include <ibdiag_version.h>
> >+#include "ibdiag_osd.h"
> 
> I think it'll be easier to just put this include in ibdiag_common.h...

What about to add files inttypes.h and unistd.h in winof tree? It could
be wrapper similars to ibdiag_osd.h.

Sasha

> 
> >diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
> >index e96c782..7767668 100644
> >--- a/infiniband-diags/src/sminfo.c
> >+++ b/infiniband-diags/src/sminfo.c
> >@@ -37,14 +37,13 @@
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> >-#include <unistd.h>
> >-#include <inttypes.h>
> > #include <getopt.h>
> >
> > #include <infiniband/umad.h>
> > #include <infiniband/mad.h>
> >
> > #include "ibdiag_common.h"
> >+#include "ibdiag_osd.h"
> 
> ...and avoid adding it to all the source files.  I'll update my patches, but
> wait for comments against this patch before re-submitting.
> 
> - Sean
> 


From sashak at voltaire.com  Sat Feb 14 07:56:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:56:01 +0200
Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
Message-ID: <20090214155601.GL14416@sashak.voltaire.com>

On 23:21 Thu 12 Feb     , Sean Hefty wrote:
> Allow sminfo to build and run on both Linux and Windows.  Window
> build files are maintained in the WinOF respository.  These changes
> allow dropping the infiniband-diags into the WinOF build environment.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> Would there be any objection to including the windows source files (.c and .h)
> in the mgmt tree?

Which files? Basically I prefer to not have unrelated things in my tree,
but let's see specific needs.

> 
>  infiniband-diags/Makefile.am                |    2 +
>  infiniband-diags/include/ibdiag_common.h    |    2 +
>  infiniband-diags/include/linux/ibdiag_osd.h |   43 +++++++++++++++++++++++++++
>  infiniband-diags/src/ibdiag_common.c        |   13 ++++----
>  infiniband-diags/src/sminfo.c               |   15 ++++-----
>  5 files changed, 58 insertions(+), 17 deletions(-)
> 
> diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
> index f9cc5bd..0d32abd 100644
> --- a/infiniband-diags/Makefile.am
> +++ b/infiniband-diags/Makefile.am
> @@ -1,5 +1,5 @@
>  
> -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband
> +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband -I$(srcdir)/include/linux
>  
>  if DEBUG
>  DBGFLAGS = -ggdb -D_DEBUG_
> diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h
> index 4783b8e..2dea873 100644
> --- a/infiniband-diags/include/ibdiag_common.h
> +++ b/infiniband-diags/include/ibdiag_common.h
> @@ -52,7 +52,7 @@ extern int ibd_timeout;
>  #undef DEBUG
>  #define	DEBUG	if (ibdebug || ibverbose) IBWARN
>  #define	VERBOSE	if (ibdebug || ibverbose > 1) IBWARN
> -#define IBERROR(fmt, args...)	iberror(__FUNCTION__, fmt, ## args)
> +#define IBERROR(fmt, ...)	iberror(__FUNCTION__, fmt, ## __VA_ARGS__)
>  
>  struct ibdiag_opt {
>  	const char *name;
> diff --git a/infiniband-diags/include/linux/ibdiag_osd.h b/infiniband-diags/include/linux/ibdiag_osd.h
> new file mode 100644
> index 0000000..5c6faa9
> --- /dev/null
> +++ b/infiniband-diags/include/linux/ibdiag_osd.h
> @@ -0,0 +1,43 @@
> +/*
> + * Copyright (c) 2009 Intel Corp, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +#ifndef _IBDIAG_OSD_H_
> +#define _IBDIAG_OSD_H_
> +
> +#include <unistd.h>
> +#include <inttypes.h>
> +#include <config.h>
> +
> +#define CDECL
> +
> +#endif /* _IBDIAG_OSD_H_ */
> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
> index bda1efa..154e00c 100644
> --- a/infiniband-diags/src/ibdiag_common.c
> +++ b/infiniband-diags/src/ibdiag_common.c
> @@ -43,15 +43,14 @@
>  #include <stdlib.h>
>  #include <stdarg.h>
>  #include <sys/types.h>
> -#include <unistd.h>
>  #include <ctype.h>
> -#include <config.h>
>  #include <getopt.h>
>  
>  #include <infiniband/umad.h>
>  #include <infiniband/mad.h>
>  #include <ibdiag_common.h>
>  #include <ibdiag_version.h>
> +#include "ibdiag_osd.h"

Wouldn't it be easier (at least for linux developers :)) instead
of filtering out pretty standard header files to put such files under
winof tree? (Including config.h, this file is generated by autotools,
as far as I could see it is not used in WinOF, so it should be easy to
keep this as "osd" file).

>  
>  int ibdebug;
>  int ibverbose;
> @@ -204,7 +203,7 @@ static const struct ibdiag_opt common_opts[] = {
>  	{ "usage", 'u', 0, NULL, "usage message" },
>  	{ "help", 'h', 0, NULL, "help message" },
>  	{ "version", 'V', 0, NULL, "show version" },
> -	{}
> +	{ 0 }
>  };
>  
>  static void make_opt(struct option *l, const struct ibdiag_opt *o,
> @@ -254,11 +253,11 @@ static struct option *make_long_opts(const char *exclude_str,
>  
>  static void make_str_opts(const struct option *o, char *p, unsigned size)
>  {
> -	int i, n = 0;
> +	unsigned i, n = 0;
>  
>  	for (n = 0; o->name  && n + 2 + o->has_arg < size; o++) {
> -		p[n++] = o->val;
> -		for (i = 0; i < o->has_arg; i++)
> +		p[n++] = (char) o->val;
> +		for (i = 0; i < (unsigned) o->has_arg; i++)
>  			p[n++] = ':';
>  	}
>  	p[n] = '\0';
> @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt,
>  	char str_opts[1024];
>  	const struct ibdiag_opt *o;
>  
> -	memset(opts_map, 0, sizeof(opts_map));
> +	memset((void *) opts_map, 0, sizeof(opts_map));

Hmm, why is this casting needed?

>  
>  	prog_name = argv[0];
>  	prog_args = usage_args;
> diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
> index e96c782..7767668 100644
> --- a/infiniband-diags/src/sminfo.c
> +++ b/infiniband-diags/src/sminfo.c
> @@ -37,14 +37,13 @@
>  
>  #include <stdio.h>
>  #include <stdlib.h>
> -#include <unistd.h>
> -#include <inttypes.h>
>  #include <getopt.h>
>  
>  #include <infiniband/umad.h>
>  #include <infiniband/mad.h>
>  
>  #include "ibdiag_common.h"
> +#include "ibdiag_osd.h"
>  
>  static uint8_t sminfo[1024];
>  
> @@ -59,10 +58,10 @@ enum {
>  };
>  
>  char *statestr[] = {
> -	[SMINFO_NOTACT] "SMINFO_NOTACT",
> -	[SMINFO_DISCOVER] "SMINFO_DISCOVER",
> -	[SMINFO_STANDBY] "SMINFO_STANDBY",
> -	[SMINFO_MASTER] "SMINFO_MASTER",
> +	"SMINFO_NOTACT",
> +	"SMINFO_DISCOVER",
> +	"SMINFO_STANDBY",
> +	"SMINFO_MASTER",
>  };
>  
>  #define STATESTR(s)	(((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???")
> @@ -88,7 +87,7 @@ static int process_opt(void *context, int ch, char *optarg)
>  	return 0;
>  }
>  
> -int main(int argc, char **argv)
> +int CDECL main(int argc, char **argv)

Would compiler flag /Gd do the same without code modification?

(http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx)

Sasha

>  {
>  	int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS};
>  	int mod = 0;
> @@ -100,7 +99,7 @@ int main(int argc, char **argv)
>  		{ "state", 's', 1, "<0-3>", "set SM state"},
>  		{ "priority", 'p', 1, "<0-15>", "set SM priority"},
>  		{ "activity", 'a', 1, NULL, "set activity count"},
> -		{ }
> +		{ 0 }
>  	};
>  	char usage_args[] = "<sm_lid|sm_dr_path> [modifier]";
>  
> 
> 
> 


From dotanba at gmail.com  Sat Feb 14 07:53:14 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 14 Feb 2009 17:53:14 +0200
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband
In-Reply-To: <4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com>
References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com>	
	<4996717C.8000005@gmail.com>
	<4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com>
Message-ID: <4996E8EA.1000102@gmail.com>

Vittorio wrote:
> thanks for the suggestion, but i can't understand which kind of 
> address i should put for the two commands
> i tried ibping with the server (like suggested) and it works with -G 
> <port> or with lid
>
> but what should i put as argument of ibv_rc_pingpong and rping?
>
> thanks a lot
> Vittorio
Both of them man pages, so you can check it out.

In ibv_rc_pingpong:
Server side: ibv_rc_pingpong
Client: ibv_rc_pingpong <servers_ip>

Sorry, but I don't remember the rping parameters ...

Dotan


From sashak at voltaire.com  Sat Feb 14 07:58:17 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 17:58:17 +0200
Subject: [ofa-general] Re: [ibmad] libibmad: add MAD_EXPORT to exported calls
In-Reply-To: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com>
References: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com>
Message-ID: <20090214155817.GM14416@sashak.voltaire.com>

On 23:31 Thu 12 Feb     , Sean Hefty wrote:
> From: Stan Smith <stan.smith at intel.com>
> 
> ibtracert and ibroute need xdump and smp_query_via exported
> from the library.  Add MAD_EXPORT to the calls for Windows support.
> 
> Signed-off-by: Stan Smith <stan.smith at intel.com>
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:11:11 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:11:11 +0200
Subject: [ofa-general] Re: [PATCH] opensm: fix structure definition for trap
	257-258
In-Reply-To: <1234553462.3948.31.camel@chromite.mv.qlogic.com>
References: <1234553462.3948.31.camel@chromite.mv.qlogic.com>
Message-ID: <20090214161111.GN14416@sashak.voltaire.com>

On 11:31 Fri 13 Feb     , Ralph Campbell wrote:
> I was looking at a structure definition for trap messages in the opensm
> code and noticed this minor bug.
> Here is a patch to correct the problem.
> 
> Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:22:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:22:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers
	for some OSM_LOG prin
In-Reply-To: <20090214135139.GA25402@comcast.net>
References: <20090214135139.GA25402@comcast.net>
Message-ID: <20090214162254.GO14416@sashak.voltaire.com>

Hi Hal,

On 08:51 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001

Please don't put this line ("From ...") in patch message body - it marks
start of message in mbox file format and breaks things like 'git rebase'
and similar. (At least mask this line with '> ' character).

> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Tue, 10 Feb 2009 07:14:32 -0500
> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints

Actually there is no reason to repeat email header in a commit message.

> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:31:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:31:55 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Add port counters to
	__osm_disp_msg_str
In-Reply-To: <20090214135308.GB25402@comcast.net>
References: <20090214135308.GB25402@comcast.net>
Message-ID: <20090214163155.GP14416@sashak.voltaire.com>

On 08:53 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From d9c17a8251b874c33542a19a51d1332ea3196713 Mon Sep 17 00:00:00 2001
> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Thu, 12 Feb 2009 09:27:46 -0500
> Subject: [PATCH] opensm/osm_helper.c: Add port counters to  __osm_disp_msg_str
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:36:58 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:36:58 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Add list of SMs to
	status command
In-Reply-To: <20090214135409.GC25402@comcast.net>
References: <20090214135409.GC25402@comcast.net>
Message-ID: <20090214163658.GQ14416@sashak.voltaire.com>

On 08:54 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From debc6e1f5bd225449ca897264948b08ccf69de38 Mon Sep 17 00:00:00 2001
> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Fri, 13 Feb 2009 09:49:36 -0500
> Subject: [PATCH] opensm/osm_console.c: Add list of SMs to status command
> 
> Also, add SM priority into status command
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:38:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:38:32 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Eliminate some
	extraneous parentheses
In-Reply-To: <20090214135504.GD25402@comcast.net>
References: <20090214135504.GD25402@comcast.net>
Message-ID: <20090214163832.GR14416@sashak.voltaire.com>

On 08:55 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From 8d6c1b61e43059ed80885131c0bbce51baf4eddf Mon Sep 17 00:00:00 2001
> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Fri, 13 Feb 2009 10:35:39 -0500
> Subject: [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:40:51 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:40:51 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Add missing command
	in help_perfmgr
In-Reply-To: <20090214135550.GE25402@comcast.net>
References: <20090214135550.GE25402@comcast.net>
Message-ID: <20090214164051.GS14416@sashak.voltaire.com>

On 08:55 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From 7faaf4e757c42a8f57fd5b02f425266f2eb853b2 Mon Sep 17 00:00:00 2001
> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Fri, 13 Feb 2009 13:32:43 -0500
> Subject: [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Feb 14 08:44:59 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 18:44:59 +0200
Subject: [ofa-general] Re: [PATCH] ibsim/sim_net.c: In new_node,
	fix nodetype in nodeinfo for router nodes
In-Reply-To: <20090214135700.GF25402@comcast.net>
References: <20090214135700.GF25402@comcast.net>
Message-ID: <20090214164459.GT14416@sashak.voltaire.com>

On 08:57 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> From 17350f5a17ec5ec821607aae7bf94a88b84d6e74 Mon Sep 17 00:00:00 2001
> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> Date: Thu, 12 Feb 2009 10:57:20 -0500
> Subject: [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Sat Feb 14 09:03:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 14 Feb 2009 12:03:12 -0500
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error 
	numbers for some OSM_LOG prin
In-Reply-To: <20090214162254.GO14416@sashak.voltaire.com>
References: <20090214135139.GA25402@comcast.net>
	<20090214162254.GO14416@sashak.voltaire.com>
Message-ID: <f0e08f230902140903x5b8095f2t3bf538059ebf8d2a@mail.gmail.com>

Hi Sasha,

On Sat, Feb 14, 2009 at 11:22 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 08:51 Sat 14 Feb     , hnrose at comcast.net wrote:
>>
>> From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001
>
> Please don't put this line ("From ...") in patch message body - it marks
> start of message in mbox file format and breaks things like 'git rebase'
> and similar. (At least mask this line with '> ' character).

Looks to me like it was >From but I'll try to remember to strip this.

>> From: Hal Rosenstock <hal.rosenstock at gmail.com>
>> Date: Tue, 10 Feb 2009 07:14:32 -0500
>> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints
>
> Actually there is no reason to repeat email header in a commit message.

So you just want the email subject and that stripped from the commit log ?

-- Hal

>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>
> Applied. Thanks.
>
> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Sat Feb 14 09:46:22 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 19:46:22 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error
	numbers for some OSM_LOG prin
In-Reply-To: <f0e08f230902140903x5b8095f2t3bf538059ebf8d2a@mail.gmail.com>
References: <20090214135139.GA25402@comcast.net>
	<20090214162254.GO14416@sashak.voltaire.com>
	<f0e08f230902140903x5b8095f2t3bf538059ebf8d2a@mail.gmail.com>
Message-ID: <20090214174622.GU14416@sashak.voltaire.com>

On 12:03 Sat 14 Feb     , Hal Rosenstock wrote:
> >
> > Please don't put this line ("From ...") in patch message body - it marks
> > start of message in mbox file format and breaks things like 'git rebase'
> > and similar. (At least mask this line with '> ' character).
> 
> Looks to me like it was >From but I'll try to remember to strip this.

I added '>' to 'From ...' by hand during commit using 'git commit --amend'
(for each patch).

> 
> >> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> >> Date: Tue, 10 Feb 2009 07:14:32 -0500
> >> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints
> >
> > Actually there is no reason to repeat email header in a commit message.
> 
> So you just want the email subject and that stripped from the commit log ?

Normally email subject is used as patch description and email up to '---'
line as commit message. You can put any text which is not part of
commit message under '---' and before diffstat lines.

You may want to look at

http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches;hb=HEAD

(or similar paper in kernel source tree) for more detailed explanations.

Sasha


From sashak at voltaire.com  Sat Feb 14 10:05:54 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 20:05:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error
	numbers for some OSM_LOG prin
In-Reply-To: <20090214174622.GU14416@sashak.voltaire.com>
References: <20090214135139.GA25402@comcast.net>
	<20090214162254.GO14416@sashak.voltaire.com>
	<f0e08f230902140903x5b8095f2t3bf538059ebf8d2a@mail.gmail.com>
	<20090214174622.GU14416@sashak.voltaire.com>
Message-ID: <20090214180554.GV14416@sashak.voltaire.com>

On 19:46 Sat 14 Feb     , Sasha Khapyorsky wrote:
> > 
> > >> From: Hal Rosenstock <hal.rosenstock at gmail.com>
> > >> Date: Tue, 10 Feb 2009 07:14:32 -0500
> > >> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints
> > >
> > > Actually there is no reason to repeat email header in a commit message.
> > 
> > So you just want the email subject and that stripped from the commit log ?
> 
> Normally email subject is used as patch description and email up to '---'
> line as commit message. You can put any text which is not part of
> commit message under '---' and before diffstat lines.

And if you need to change patch authorship put line:

From: Author Name <author at email.address>

(with ":") as first non-empty line in an email message body.

Sasha

> 
> You may want to look at
> 
> http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches;hb=HEAD
> 
> (or similar paper in kernel source tree) for more detailed explanations.
> 
> Sasha


From sean.hefty at intel.com  Sat Feb 14 10:21:20 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 10:21:20 -0800
Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <20090214154045.GK14416@sashak.voltaire.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
	<20090214154045.GK14416@sashak.voltaire.com>
Message-ID: <53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com>

>> >+#include "ibdiag_osd.h"
>>
>> I think it'll be easier to just put this include in ibdiag_common.h...
>
>What about to add files inttypes.h and unistd.h in winof tree? It could
>be wrapper similars to ibdiag_osd.h.

That could be done.  The files would just be empty.  As a thought, if you think
of the porting going the reverse direction, would you want to add a windows.h to
the linux side?

- Sean


From sean.hefty at intel.com  Sat Feb 14 10:40:57 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 10:40:57 -0800
Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <20090214155601.GL14416@sashak.voltaire.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
Message-ID: <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>

>> Would there be any objection to including the windows source files (.c and
>.h)
>> in the mgmt tree?
>
>Which files? Basically I prefer to not have unrelated things in my tree,
>but let's see specific needs.

So far, I have windows/ibdiag_osd.h, ibdiag_windows.c, and
windows/cl_nodenamemap.h.

My goal is to have the ib-diags support both Windows and Linux, so Windows files
are related in that respect.  Making an exception for the build files is
reasonable IMO, given the WinOF build environment.

>> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-
>diags/src/ibdiag_common.c
>> index bda1efa..154e00c 100644
>> --- a/infiniband-diags/src/ibdiag_common.c
>> +++ b/infiniband-diags/src/ibdiag_common.c
>> @@ -43,15 +43,14 @@
>>  #include <stdlib.h>
>>  #include <stdarg.h>
>>  #include <sys/types.h>
>> -#include <unistd.h>
>>  #include <ctype.h>
>> -#include <config.h>
>>  #include <getopt.h>
>>
>>  #include <infiniband/umad.h>
>>  #include <infiniband/mad.h>
>>  #include <ibdiag_common.h>
>>  #include <ibdiag_version.h>
>> +#include "ibdiag_osd.h"
>
>Wouldn't it be easier (at least for linux developers :)) instead
>of filtering out pretty standard header files to put such files under
>winof tree? (Including config.h, this file is generated by autotools,
>as far as I could see it is not used in WinOF, so it should be easy to
>keep this as "osd" file).

unistd.h is an 'osd' type file, so I think it makes more sense to isolate it to
an osd related area.  But if you really prefer, I can abstract these.  (Windows
provides an errno.h file, so at least there's some precedence.)

>> @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[],
>void *cxt,
>>       char str_opts[1024];
>>       const struct ibdiag_opt *o;
>>
>> -     memset(opts_map, 0, sizeof(opts_map));
>> +     memset((void *) opts_map, 0, sizeof(opts_map));
>
>Hmm, why is this casting needed?

opts_map is declared as const - (i.e. my compiler whined at me)

>> -int main(int argc, char **argv)
>> +int CDECL main(int argc, char **argv)
>
>Would compiler flag /Gd do the same without code modification?
>
>(http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx)

I'll see if I can get this to work.  My quick test gave me compiler option
conflicts, so I'll have to look into this more. 

- Sean


From sean.hefty at intel.com  Sat Feb 14 10:46:51 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 10:46:51 -0800
Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <20090214154045.GK14416@sashak.voltaire.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
	<20090214154045.GK14416@sashak.voltaire.com>
Message-ID: <D6E5B44CACCA4B08ABBF41EA16803F59@amr.corp.intel.com>

>What about to add files inttypes.h and unistd.h in winof tree? It could
>be wrapper similars to ibdiag_osd.h.

One advantage of using your approach is that the source files end up only
including those headers that it needs.  Moving everything into ibdiag_osd.h
means that the source files pick up other includes.  Anyway, just let me know
your preference, and I'll update the patches.

- Sean


From sashak at voltaire.com  Sat Feb 14 11:02:28 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 21:02:28 +0200
Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
	<20090214154045.GK14416@sashak.voltaire.com>
	<53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com>
Message-ID: <20090214190228.GW14416@sashak.voltaire.com>

On 10:21 Sat 14 Feb     , Sean Hefty wrote:
> >> >+#include "ibdiag_osd.h"
> >>
> >> I think it'll be easier to just put this include in ibdiag_common.h...
> >
> >What about to add files inttypes.h and unistd.h in winof tree? It could
> >be wrapper similars to ibdiag_osd.h.
> 
> That could be done.  The files would just be empty.

The files could be empty or as alternative to contain logically related
stuff there. (For example inttypes.h can contain PRI* macros
definitions). Another (at least hypothetical) advantage of such method
is that when (and if it will happen) WinOF will decide to use things
like cygwin then the "porting" will be pretty trivial.

> As a thought, if you think
> of the porting going the reverse direction, would you want to add a windows.h to
> the linux side?

Only in case when I would start a win-centric project porting. :)

About windows.h - I guess this file is actually included in all (or
almost) user space *.c files, assuming so we can put in in config.h.

Sasha


From sashak at voltaire.com  Sat Feb 14 11:04:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 21:04:10 +0200
Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <D6E5B44CACCA4B08ABBF41EA16803F59@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com>
	<20090214154045.GK14416@sashak.voltaire.com>
	<D6E5B44CACCA4B08ABBF41EA16803F59@amr.corp.intel.com>
Message-ID: <20090214190410.GX14416@sashak.voltaire.com>

On 10:46 Sat 14 Feb     , Sean Hefty wrote:
> 
> One advantage of using your approach is that the source files end up only
> including those headers that it needs.  Moving everything into ibdiag_osd.h
> means that the source files pick up other includes.  Anyway, just let me know
> your preference, and I'll update the patches.

I would prefer to have *nix/posix style files and to minimize the needed
changes in common sources.

Sasha


From sashak at voltaire.com  Sat Feb 14 11:11:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 21:11:01 +0200
Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
Message-ID: <20090214191101.GY14416@sashak.voltaire.com>

On 10:40 Sat 14 Feb     , Sean Hefty wrote:
> >> Would there be any objection to including the windows source files (.c and
> >.h)
> >> in the mgmt tree?
> >
> >Which files? Basically I prefer to not have unrelated things in my tree,
> >but let's see specific needs.
> 
> So far, I have windows/ibdiag_osd.h, ibdiag_windows.c, and
> windows/cl_nodenamemap.h.

Isn't cl_nodenamemap.h part of complib?

> 
> My goal is to have the ib-diags support both Windows and Linux, so Windows files
> are related in that respect.  Making an exception for the build files is
> reasonable IMO, given the WinOF build environment.
> 
> >> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-
> >diags/src/ibdiag_common.c
> >> index bda1efa..154e00c 100644
> >> --- a/infiniband-diags/src/ibdiag_common.c
> >> +++ b/infiniband-diags/src/ibdiag_common.c
> >> @@ -43,15 +43,14 @@
> >>  #include <stdlib.h>
> >>  #include <stdarg.h>
> >>  #include <sys/types.h>
> >> -#include <unistd.h>
> >>  #include <ctype.h>
> >> -#include <config.h>
> >>  #include <getopt.h>
> >>
> >>  #include <infiniband/umad.h>
> >>  #include <infiniband/mad.h>
> >>  #include <ibdiag_common.h>
> >>  #include <ibdiag_version.h>
> >> +#include "ibdiag_osd.h"
> >
> >Wouldn't it be easier (at least for linux developers :)) instead
> >of filtering out pretty standard header files to put such files under
> >winof tree? (Including config.h, this file is generated by autotools,
> >as far as I could see it is not used in WinOF, so it should be easy to
> >keep this as "osd" file).
> 
> unistd.h is an 'osd' type file, so I think it makes more sense to isolate it to
> an osd related area.  But if you really prefer, I can abstract these.  (Windows
> provides an errno.h file, so at least there's some precedence.)
> 
> >> @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[],
> >void *cxt,
> >>       char str_opts[1024];
> >>       const struct ibdiag_opt *o;
> >>
> >> -     memset(opts_map, 0, sizeof(opts_map));
> >> +     memset((void *) opts_map, 0, sizeof(opts_map));
> >
> >Hmm, why is this casting needed?
> 
> opts_map is declared as const - (i.e. my compiler whined at me)

Probably it is reasonable to just drop const then. I don't see what this
const really does.

Sasha

> 
> >> -int main(int argc, char **argv)
> >> +int CDECL main(int argc, char **argv)
> >
> >Would compiler flag /Gd do the same without code modification?
> >
> >(http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx)
> 
> I'll see if I can get this to work.  My quick test gave me compiler option
> conflicts, so I'll have to look into this more. 
> 
> - Sean
> 


From sean.hefty at intel.com  Sat Feb 14 11:26:39 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 11:26:39 -0800
Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <20090214191101.GY14416@sashak.voltaire.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
	<20090214191101.GY14416@sashak.voltaire.com>
Message-ID: <C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>

>Isn't cl_nodenamemap.h part of complib?

It's not available in windows.  (Yes, sadly, even the OS abstraction code
doesn't share a common codebase between the two platforms...)  I'm not even sure
nodenamemap is really at the same level of abstraction as other complib items,
but I didn't want to try changing that area of the code at this time.  (It seems
like adding a cl_map_insert_copy() type operation would provide the desired
funcationality.)

I guess I can try adding nodenamemap to the windows version of complib for now.
I didn't because I'm not convinced that it should be in complib.

>> opts_map is declared as const - (i.e. my compiler whined at me)
>
>Probably it is reasonable to just drop const then. I don't see what this
>const really does.

If I remember correctly, I tried that and heard a different whine out of the
compiler.  I'll re-examine what the problem was.


From sashak at voltaire.com  Sat Feb 14 12:04:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 14 Feb 2009 22:04:08 +0200
Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
	<20090214191101.GY14416@sashak.voltaire.com>
	<C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>
Message-ID: <20090214200408.GZ14416@sashak.voltaire.com>

On 11:26 Sat 14 Feb     , Sean Hefty wrote:
> >Isn't cl_nodenamemap.h part of complib?
> 
> It's not available in windows.  (Yes, sadly, even the OS abstraction code
> doesn't share a common codebase between the two platforms...)  I'm not even sure
> nodenamemap is really at the same level of abstraction as other complib items,
> but I didn't want to try changing that area of the code at this time.  (It seems
> like adding a cl_map_insert_copy() type operation would provide the desired
> funcationality.)
> 
> I guess I can try adding nodenamemap to the windows version of complib for now.
> I didn't because I'm not convinced that it should be in complib.
> 
> >> opts_map is declared as const - (i.e. my compiler whined at me)
> >
> >Probably it is reasonable to just drop const then. I don't see what this
> >const really does.
> 
> If I remember correctly, I tried that and heard a different whine out of the
> compiler.  I'll re-examine what the problem was.

Ok, I'm starting to understand (again :)) why 'const' is there:

static const struct ibdiag_opt *opts_map[256];

and later:

	memset(opts_map, 0, sizeof(opts_map));

opts_map is array of pointers which should refer read-only areas,
memset() initializes the array itself. As far as I understand there
should not be a "const violations".

Sasha


From hnrose at comcast.net  Sat Feb 14 12:36:03 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 15:36:03 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim_client.c: Eliminate
	unneeded qp param from sim_init
Message-ID: <20090214203603.GC32660@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 umad2sim/sim_client.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
index 3fffd24..59f81d5 100644
--- a/umad2sim/sim_client.c
+++ b/umad2sim/sim_client.c
@@ -202,7 +202,7 @@ static int sim_disconnect(struct sim_client *sc)
 	return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0);
 }
 
-static int sim_init(struct sim_client *sc, int qp, char *nodeid)
+static int sim_init(struct sim_client *sc, char *nodeid)
 {
 	union name_t name;
 	socklen_t size;
@@ -222,8 +222,7 @@ static int sim_init(struct sim_client *sc, int qp, char *nodeid)
 	if (connect_host && *connect_host)
 		remote_mode = 1;
 
-	DEBUG("init client pid=%d, qp=%d nodeid=%s",
-	      pid, qp, nodeid ? nodeid : "none");
+	DEBUG("init client pid=%d, nodeid=%s", pid, nodeid ? nodeid : "none");
 
 	if ((fd = socket(remote_mode ? PF_INET : PF_LOCAL, SOCK_DGRAM, 0)) < 0)
 		IBPANIC("can't get socket (fd)");
@@ -257,7 +256,7 @@ static int sim_init(struct sim_client *sc, int qp, char *nodeid)
 		IBPANIC("can't read data from bound socket");
 	port = ntohs(name.name_i.sin_port);
 
-	sc->clientid = sim_connect(sc, remote_mode ? port : pid, qp, nodeid);
+	sc->clientid = sim_connect(sc, remote_mode ? port : pid, 0, nodeid);
 	if (sc->clientid < 0)
 		IBPANIC("connect failed");
 
@@ -289,7 +288,7 @@ int sim_client_init(struct sim_client *sc)
 	char *nodeid;
 
 	nodeid = getenv("SIM_HOST");
-	if (sim_init(sc, 0, nodeid) < 0)
+	if (sim_init(sc, nodeid) < 0)
 		return -1;
 	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
 		    sizeof(sc->vendor)) < 0)
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 12:37:03 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 15:37:03 -0500
Subject: [ofa-general] [PATCH] ibsim/sim_client.c: In sim_client_init,
	return -1 on error
Message-ID: <20090214203703.GD32660@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 umad2sim/sim_client.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
index 59f81d5..06bb7a8 100644
--- a/umad2sim/sim_client.c
+++ b/umad2sim/sim_client.c
@@ -309,7 +309,7 @@ int sim_client_init(struct sim_client *sc)
   _exit:
 	sim_disconnect(sc);
 	sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1;
-	return 0;
+	return -1;
 }
 
 void sim_client_exit(struct sim_client *sc)
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 12:35:03 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 15:35:03 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim: Eliminate unneeded argument
	in sim_client_init
Message-ID: <20090214203503.GB32660@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 umad2sim/sim_client.c |   14 ++++++++------
 umad2sim/sim_client.h |    2 +-
 umad2sim/umad2sim.c   |    3 +--
 3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
index d86de7c..3fffd24 100644
--- a/umad2sim/sim_client.c
+++ b/umad2sim/sim_client.c
@@ -284,19 +284,21 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm)
 	return sim_ctl(sc, SIM_CTL_SET_ISSM, &issm, sizeof(int));
 }
 
-int sim_client_init(struct sim_client *sc, char *nodeid)
+int sim_client_init(struct sim_client *sc)
 {
-	if (!nodeid)
-		nodeid = getenv("SIM_HOST");
+	char *nodeid;
+
+	nodeid = getenv("SIM_HOST");
 	if (sim_init(sc, 0, nodeid) < 0)
 		return -1;
-	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, sizeof(sc->vendor)) <
-	    0)
+	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
+		    sizeof(sc->vendor)) < 0)
 		goto _exit;
 	if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo,
 		    sizeof(sc->nodeinfo)) < 0)
 		goto _exit;
-	sc->portinfo[0] = 0;
+
+	sc->portinfo[0] = 0;	// portno requested
 	if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo,
 		    sizeof(sc->portinfo)) < 0)
 		goto _exit;
diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h
index 605b305..80ed442 100644
--- a/umad2sim/sim_client.h
+++ b/umad2sim/sim_client.h
@@ -47,7 +47,7 @@ struct sim_client {
 };
 
 extern int sim_client_set_sm(struct sim_client *sc, unsigned issm);
-extern int sim_client_init(struct sim_client *sc, char *nodeid);
+extern int sim_client_init(struct sim_client *sc);
 extern void sim_client_exit(struct sim_client *sc);
 
 #endif				/* _SIM_CLIENT_H_ */
diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
index 1236b8f..8d83a24 100644
--- a/umad2sim/umad2sim.c
+++ b/umad2sim/umad2sim.c
@@ -53,7 +53,6 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
-#include <ibsim.h>
 #include <sim_client.h>
 
 #ifdef UMAD2SIM_NOISY_DEBUG
@@ -562,7 +561,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
 	dev->num = num;
 	strncpy(dev->name, name, sizeof(dev->name) - 1);
 
-	if (sim_client_init(&dev->sim_client, NULL) < 0)
+	if (sim_client_init(&dev->sim_client) < 0)
 		goto _error;
 
 	dev->port = mad_get_field(&dev->sim_client.portinfo, 0,
-- 
1.5.6.4


From hnrose at comcast.net  Sat Feb 14 12:37:53 2009
From: hnrose at comcast.net (hnrose at comcast.net)
Date: Sat, 14 Feb 2009 15:37:53 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim: Add better end port
	simulation support
Message-ID: <20090214203753.GE32660@comcast.net>


Add SIM_PORT environment variable to allow for end port selection

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 ibsim/ibsim.c         |    6 +-
 include/ibsim.h       |    2 +
 umad2sim/sim_client.c |   49 +++++++++-
 umad2sim/sim_client.h |    4 +-
 umad2sim/umad2sim.c   |  254 ++++++++++++++++++++++++++-----------------------
 5 files changed, 189 insertions(+), 126 deletions(-)

diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
index f48e1f0..6a35fdc 100644
--- a/ibsim/ibsim.c
+++ b/ibsim/ibsim.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -187,7 +188,8 @@ static int sm_exists(Node * node)
 	return 0;
 }
 
-static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from)
+static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl,
+			      union name_t *from)
 {
 	union name_t name;
 	size_t size;
@@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f
 			ctl->type = SIM_CTL_ERROR;
 			return -1;
 		}
-		cl->port = node_get_port(node, 0);
+		cl->port = node_get_port(node, scl->portnum);
 		VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64,
 		     i, node->nodeid, cl->port->portguid);
 	} else {
diff --git a/include/ibsim.h b/include/ibsim.h
index 15fc37c..66ba6f9 100644
--- a/include/ibsim.h
+++ b/include/ibsim.h
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -100,6 +101,7 @@ struct sim_client_info {
 	uint32_t qp;
 	uint32_t issm;		/* accept request for qp 0 & 1 */
 	char nodeid[32];
+	uint32_t portnum;
 };
 
 union name_t {
diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
index 06bb7a8..1c35109 100644
--- a/umad2sim/sim_client.c
+++ b/umad2sim/sim_client.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid)
 	info.id = id;
 	info.issm = 0;
 	info.qp = qp;
+	info.portnum = sc->portnum;
 
 	if (nodeid)
 		strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1);
@@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc)
 	return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0);
 }
 
-static int sim_init(struct sim_client *sc, char *nodeid)
+static int sim_init(struct sim_client *sc, char *nodeid, int portnum)
 {
 	union name_t name;
 	socklen_t size;
@@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid)
 	DEBUG("init %d: opened ctl fd %d as \'%s\'",
 	      pid, ctlfd, get_name(&name));
 
+	sc->portnum = portnum;
 	port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT;
 	size = make_name(&name, connect_host, port, "%s:ctl", socket_basename);
 
@@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm)
 int sim_client_init(struct sim_client *sc)
 {
 	char *nodeid;
+	char *portno;
+	int i, j = 0, portnum = 0, startport = 1, endport;
+	uint8_t numports, nodetype;
+	uint8_t *portinfo;
 
 	nodeid = getenv("SIM_HOST");
-	if (sim_init(sc, nodeid) < 0)
+	portno = getenv("SIM_PORT");
+	if (portno)
+		portnum = atoi(portno);
+
+	if (sim_init(sc, nodeid, portnum) < 0)
 		return -1;
 	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
 		    sizeof(sc->vendor)) < 0)
@@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc)
 	if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo,
 		    sizeof(sc->nodeinfo)) < 0)
 		goto _exit;
+	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
+	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
+	if (nodetype == 2) { // switch
+		startport = 0;
+		endport = 0;
+	} else {
+		if (portnum == 0) {
+			IBWARN("portnum 0 is not valid end port on non switch node");
+			goto _exit;
+		}
+		endport = numports;
+	}
+	if (portnum > endport) {
+		IBWARN("portnum %d is not a valid end port number (%d)",
+		       portnum, endport);
+		goto _exit;
+	}
 
-	sc->portinfo[0] = 0;	// portno requested
-	if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo,
-		    sizeof(sc->portinfo)) < 0)
+	sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1));	// portinfo size x number of ports starting at 0
+	if (!sc->portinfo)
 		goto _exit;
+
+	// loop through end ports
+	for (i = startport; i <= endport ; i++, j++) {
+		portinfo = sc->portinfo + 64 * j;
+		*portinfo = i + 1; // portno requested
+		if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0)
+			goto _exit;
+	}
+
+	// although pkeys also per port, current config same on all end ports
 	if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0)
 		goto _exit;
 	if (getenv("SIM_SET_ISSM"))
@@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc)
 void sim_client_exit(struct sim_client *sc)
 {
 	sim_disconnect(sc);
+	if (sc->portinfo)
+		free(sc->portinfo);
 	sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1;
 }
diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h
index 80ed442..0faca80 100644
--- a/umad2sim/sim_client.h
+++ b/umad2sim/sim_client.h
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -41,8 +42,9 @@ struct sim_client {
 	int clientid;
 	int fd_pktin, fd_pktout, fd_ctl;
 	struct sim_vendor vendor;
+	int portnum;
 	uint8_t nodeinfo[64];
-	uint8_t portinfo[64];
+	uint8_t *portinfo;
 	uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)];
 };
 
diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
index 8d83a24..6e3c269 100644
--- a/umad2sim/umad2sim.c
+++ b/umad2sim/umad2sim.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
 	struct sim_client *sc = &dev->sim_client;
 	char *str;
 	uint8_t *portinfo;
-	int i;
+	char *ports_path_end;
+	int i, j;
+	int startport = 1, endport;
+	uint8_t numports, nodetype;
 
 	/* /sys/class/infiniband_mad/abi_version */
 	snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir);
@@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
 	strncat(path, "/ports", sizeof(path) - 1);
 	make_path(path);
 
-	portinfo = sc->portinfo;
-
-	/* /sys/class/infiniband/mthca0/ports/1/ */
-	val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
-	snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
-	make_path(path);
-
-	/* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */
-	val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
-	file_printf(path, SYS_PORT_LMC, "%d", val);
-
-	/* /sys/class/infiniband/mthca0/ports/1/sm_lid */
-	val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
-	file_printf(path, SYS_PORT_SMLID, "0x%x", val);
-
-	/* /sys/class/infiniband/mthca0/ports/1/sm_sl */
-	val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
-	file_printf(path, SYS_PORT_SMSL, "%d", val);
-
-	/* /sys/class/infiniband/mthca0/ports/1/lid */
-	val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
-	file_printf(path, SYS_PORT_LID, "0x%x", val);
-
-	/* /sys/class/infiniband/mthca0/ports/1/state */
-	val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
-	if (val == 0)
-		str = "NOP";
-	else if (val == 1)
-		str = "DOWN";
-	else if (val == 2)
-		str = "INIT";
-	else if (val == 3)
-		str = "ARMED";
-	else if (val == 4)
-		str = "ACTIVE";
-	else if (val == 5)
-		str = "ACTIVE_DEFER";
-	else
-		str = "<unknown>";
-	file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
-
-	/* /sys/class/infiniband/mthca0/ports/1/phys_state */
-	val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
-	if (val == 1)
-		str = "Sleep";
-	else if (val == 2)
-		str = "Polling";
-	else if (val == 3)
-		str = "Disabled";
-	else if (val == 4)
-		str = "PortConfigurationTraining";
-	else if (val == 5)
-		str = "LinkUp";
-	else if (val == 6)
-		str = "LinkErrorRecovery";
-	else if (val == 7)
-		str = "Phy Test";
-	else
-		str = "<unknown>";
-	file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
-
-	/* /sys/class/infiniband/mthca0/ports/1/rate */
-	val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
-	speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
-	if (val == 1)
-		val = 1;
-	else if (val == 2)
-		val = 4;
-	else if (val == 4)
-		val = 8;
-	else if (val == 8)
-		val = 12;
-	else
-		val = 0;
-	if (speed == 2)
-		str = " DDR";
-	else if (speed == 4)
-		str = " QDR";
-	else
-		str = "";
-	file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
-		    (val * speed * 25) / 10,
-		    (val * speed * 25) % 10 ? ".5" : "", val, str);
-
-	/* /sys/class/infiniband/mthca0/ports/1/cap_mask */
-	val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
-	file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
-
-	/* /sys/class/infiniband/mthca0/ports/1/gids/0 */
-	str = path + strlen(path);
-	strncat(path, "/gids", sizeof(path) - 1);
-	make_path(path);
-	*str = '\0';
-	gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
-	guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) +
-	    mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
-	file_printf(path, SYS_PORT_GID,
-		    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
-		    (uint16_t) ((gid >> 48) & 0xffff),
-		    (uint16_t) ((gid >> 32) & 0xffff),
-		    (uint16_t) ((gid >> 16) & 0xffff),
-		    (uint16_t) ((gid >> 0) & 0xffff),
-		    (uint16_t) ((guid >> 48) & 0xffff),
-		    (uint16_t) ((guid >> 32) & 0xffff),
-		    (uint16_t) ((guid >> 16) & 0xffff),
-		    (uint16_t) ((guid >> 0) & 0xffff));
+	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
+	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
+        if (nodetype == 2) { // switch
+		startport = 0;
+		endport = 0;
+	} else
+		endport = numports;
+
+	ports_path_end = path + strlen(path);
+
+	// loop through end ports
+	for (j = startport; j <= endport; j++) {
+
+		portinfo = sc->portinfo + 64 * j;
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/ */
+		val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
+		snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
+		make_path(path);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/lid_mask_count */
+		val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
+		file_printf(path, SYS_PORT_LMC, "%d", val);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/sm_lid */
+		val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
+		file_printf(path, SYS_PORT_SMLID, "0x%x", val);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/sm_sl */
+		val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
+		file_printf(path, SYS_PORT_SMSL, "%d", val);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/lid */
+		val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
+		file_printf(path, SYS_PORT_LID, "0x%x", val);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/state */
+		val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
+		if (val == 0)
+			str = "NOP";
+		else if (val == 1)
+			str = "DOWN";
+		else if (val == 2)
+			str = "INIT";
+		else if (val == 3)
+			str = "ARMED";
+		else if (val == 4)
+			str = "ACTIVE";
+		else if (val == 5)
+			str = "ACTIVE_DEFER";
+		else
+			str = "<unknown>";
+		file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/phys_state */
+		val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
+		if (val == 1)
+			str = "Sleep";
+		else if (val == 2)
+			str = "Polling";
+		else if (val == 3)
+			str = "Disabled";
+		else if (val == 4)
+			str = "PortConfigurationTraining";
+		else if (val == 5)
+			str = "LinkUp";
+		else if (val == 6)
+			str = "LinkErrorRecovery";
+		else if (val == 7)
+			str = "Phy Test";
+		else
+			str = "<unknown>";
+		file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/rate */
+		val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
+		speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
+		if (val == 1)
+			val = 1;
+		else if (val == 2)
+			val = 4;
+		else if (val == 4)
+			val = 8;
+		else if (val == 8)
+			val = 12;
+		else
+			val = 0;
+		if (speed == 2)
+			str = " DDR";
+		else if (speed == 4)
+			str = " QDR";
+		else
+			str = "";
+		file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
+			    (val * speed * 25) / 10,
+			    (val * speed * 25) % 10 ? ".5" : "", val, str);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/cap_mask */
+		val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
+		file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/gids/0 */
+		str = path + strlen(path);
+		strncat(path, "/gids", sizeof(path) - 1);
+		make_path(path);
+		*str = '\0';
+		gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
+		guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j;
+		file_printf(path, SYS_PORT_GID,
+			    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
+			    (uint16_t) ((gid >> 48) & 0xffff),
+			    (uint16_t) ((gid >> 32) & 0xffff),
+			    (uint16_t) ((gid >> 16) & 0xffff),
+			    (uint16_t) ((gid >> 0) & 0xffff),
+			    (uint16_t) ((guid >> 48) & 0xffff),
+			    (uint16_t) ((guid >> 32) & 0xffff),
+			    (uint16_t) ((guid >> 16) & 0xffff),
+			    (uint16_t) ((guid >> 0) & 0xffff));
+
+		/* /sys/class/infiniband/mthca0/ports/<n>/pkeys/0 */
+		str = path + strlen(path);
+		strncat(path, "/pkeys", sizeof(path) - 1);
+		make_path(path);
+		for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
+			char name[8];
+			snprintf(name, sizeof(name), "%u", i);
+			file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
+		}
+		*str = '\0';
 
-	/* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */
-	str = path + strlen(path);
-	strncat(path, "/pkeys", sizeof(path) - 1);
-	make_path(path);
-	for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
-		char name[8];
-		snprintf(name, sizeof(name), "%u", i);
-		file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
+		*ports_path_end = '\0';
 	}
-	*str = '\0';
 
 	/* /sys/class/infiniband_mad/umad0/ */
 	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
@@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
 	if (sim_client_init(&dev->sim_client) < 0)
 		goto _error;
 
-	dev->port = mad_get_field(&dev->sim_client.portinfo, 0,
-				  IB_PORT_LOCAL_PORT_F);
+	dev->port = dev->sim_client.portnum;
 	for (i = 0; i < arrsize(dev->agents); i++)
 		dev->agents[i].id = (uint32_t)(-1);
 	for (i = 0; i < arrsize(dev->agent_idx); i++)
-- 
1.5.6.4


From sean.hefty at intel.com  Sat Feb 14 17:59:47 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 17:59:47 -0800
Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <20090214200408.GZ14416@sashak.voltaire.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
	<20090214191101.GY14416@sashak.voltaire.com>
	<C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>
	<20090214200408.GZ14416@sashak.voltaire.com>
Message-ID: <EAEC9757911B43429446DA4BE0E7F904@amr.corp.intel.com>

>static const struct ibdiag_opt *opts_map[256];
>
>and later:
>
>	memset(opts_map, 0, sizeof(opts_map));

With the above:

warnings in directory c:\mshefty\scm\winof\branches\winverbs\tools\infiniband-di
ags\src\sminfo
c:\mshefty\scm\winof\branches\winverbs\tools\infiniband-diags\src\ibdiag_common.
c(276) : warning C4090: 'function' : different 'const' qualifiers

(line 276 is the memset)

With const removed, Windows builds fine, but then Linux build gives:

src/ibdiag_common.c: In function `ibdiag_process_opts':
src/ibdiag_common.c:282: warning: passing arg 3 of `make_long_opts' from incompa
tible pointer type

So, I gave up and added a cast at the memset.

- Sean


From sean.hefty at intel.com  Sat Feb 14 18:04:33 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Sat, 14 Feb 2009 18:04:33 -0800
Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <EAEC9757911B43429446DA4BE0E7F904@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>	<20090214155601.GL14416@sashak.voltaire.com>	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>	<20090214191101.GY14416@sashak.voltaire.com>	<C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>	<20090214200408.GZ14416@sashak.voltaire.com>
	<EAEC9757911B43429446DA4BE0E7F904@amr.corp.intel.com>
Message-ID: <A39F0F6124B54C4C92A40FAE3B81E98D@amr.corp.intel.com>

>>static const struct ibdiag_opt *opts_map[256];
>>
>>and later:
>>
>>	memset(opts_map, 0, sizeof(opts_map));

And on second thought... can we just drop the memset completely?


From sashak at voltaire.com  Sat Feb 14 21:07:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Feb 2009 07:07:25 +0200
Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF
In-Reply-To: <A39F0F6124B54C4C92A40FAE3B81E98D@amr.corp.intel.com>
References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com>
	<20090214155601.GL14416@sashak.voltaire.com>
	<4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com>
	<20090214191101.GY14416@sashak.voltaire.com>
	<C6EF6813561D428B9D7E75D07F710B41@amr.corp.intel.com>
	<20090214200408.GZ14416@sashak.voltaire.com>
	<EAEC9757911B43429446DA4BE0E7F904@amr.corp.intel.com>
	<A39F0F6124B54C4C92A40FAE3B81E98D@amr.corp.intel.com>
Message-ID: <20090215050725.GA14416@sashak.voltaire.com>

On 18:04 Sat 14 Feb     , Sean Hefty wrote:
> >>static const struct ibdiag_opt *opts_map[256];
> >>
> >>and later:
> >>
> >>	memset(opts_map, 0, sizeof(opts_map));
> 
> And on second thought... can we just drop the memset completely?

Yes, this is static and already initialized.

Sasha


From sashak at voltaire.com  Sat Feb 14 23:45:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Feb 2009 09:45:42 +0200
Subject: [ofa-general] [PATCH RESEND] ibutils: remove -libcommon linkage flag
Message-ID: <20090215074542.GA7189@sashak.voltaire.com>


Remove -libcommon linkage flag - libibumad doesn't depend from
libibcommon anymore and libibcommon will be removed from management tree
soon.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

libibcommon is already removed and ibutils against OpenSM master is
broken.

 config/osm.m4          |    2 +-
 ibis/config/osm.m4     |    2 +-
 ibmgtsim/config/osm.m4 |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/config/osm.m4 b/config/osm.m4
index da9ae81..f8d92d7 100644
--- a/config/osm.m4
+++ b/config/osm.m4
@@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then
       elif test -L $with_osm_libs/libopensm.so; then
          OSM_VENDOR=openib
          osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB "
-         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon"
+         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad"
       else
          AC_MSG_ERROR([OSM: Fail to recognize vendor type])
       fi
diff --git a/ibis/config/osm.m4 b/ibis/config/osm.m4
index da9ae81..f8d92d7 100644
--- a/ibis/config/osm.m4
+++ b/ibis/config/osm.m4
@@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then
       elif test -L $with_osm_libs/libopensm.so; then
          OSM_VENDOR=openib
          osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB "
-         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon"
+         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad"
       else
          AC_MSG_ERROR([OSM: Fail to recognize vendor type])
       fi
diff --git a/ibmgtsim/config/osm.m4 b/ibmgtsim/config/osm.m4
index da9ae81..f8d92d7 100644
--- a/ibmgtsim/config/osm.m4
+++ b/ibmgtsim/config/osm.m4
@@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then
       elif test -L $with_osm_libs/libopensm.so; then
          OSM_VENDOR=openib
          osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB "
-         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon"
+         OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad"
       else
          AC_MSG_ERROR([OSM: Fail to recognize vendor type])
       fi
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 14 23:47:01 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Feb 2009 09:47:01 +0200
Subject: [ofa-general] [PATCH] ibuitls: use osm_config.h file instead of
	osm_build_id.h
Message-ID: <20090215074701.GB7189@sashak.voltaire.com>


Use standard osm_config.h file for OpenSM build mode detection rather
than not valid osm_build_id.h junk which will be removed very soon from
OpenSM tree.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 config/osm.m4          |    7 ++++---
 ibis/config/osm.m4     |    7 ++++---
 ibmgtsim/config/osm.m4 |    7 ++++---
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/config/osm.m4 b/config/osm.m4
index f8d92d7..cc50fdf 100644
--- a/config/osm.m4
+++ b/config/osm.m4
@@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then
 
 
    dnl validate the defined path - so the build id header is there
-   AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,,
-      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h]))
+   AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,,
+      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h]))
 
    dnl now figure out somehow if the build was for debug or not
-   if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then
+   grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null
+   if test $? -eq 0 ; then
       dnl why did they need so many ???
       osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG'
       AC_MSG_NOTICE(OSM: compiled in DEBUG mode)
diff --git a/ibis/config/osm.m4 b/ibis/config/osm.m4
index f8d92d7..cc50fdf 100644
--- a/ibis/config/osm.m4
+++ b/ibis/config/osm.m4
@@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then
 
 
    dnl validate the defined path - so the build id header is there
-   AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,,
-      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h]))
+   AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,,
+      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h]))
 
    dnl now figure out somehow if the build was for debug or not
-   if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then
+   grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null
+   if test $? -eq 0 ; then
       dnl why did they need so many ???
       osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG'
       AC_MSG_NOTICE(OSM: compiled in DEBUG mode)
diff --git a/ibmgtsim/config/osm.m4 b/ibmgtsim/config/osm.m4
index f8d92d7..cc50fdf 100644
--- a/ibmgtsim/config/osm.m4
+++ b/ibmgtsim/config/osm.m4
@@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then
 
 
    dnl validate the defined path - so the build id header is there
-   AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,,
-      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h]))
+   AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,,
+      AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h]))
 
    dnl now figure out somehow if the build was for debug or not
-   if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then
+   grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null
+   if test $? -eq 0 ; then
       dnl why did they need so many ???
       osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG'
       AC_MSG_NOTICE(OSM: compiled in DEBUG mode)
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sun Feb 15 00:25:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Feb 2009 10:25:40 +0200
Subject: [ofa-general] [PATCH v2] ibutils/ibis: link ibis dynamically
Message-ID: <20090215082540.GC7189@sashak.voltaire.com>


Otherwise when running against ibsim with libumad2sim.so preloaded it
has two instances (static and dynamic resulted by libumad2sim.so
preloading) of libibumad with different internal initializations, etc.,
which makes it impossible to use ibutils in ibsim environment.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

The difference against previous version of the patch is noinst_LIBRARIES
use, so libibiscom will not be installed.

 ibis/src/Makefile.am |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am
index e0b512f..cfa22f6 100644
--- a/ibis/src/Makefile.am
+++ b/ibis/src/Makefile.am
@@ -54,9 +54,10 @@ AM_CXXFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing -fPIC  -
 LIB_VER_TRIPLET="1:0:0"
 LIB_FILE_TRIPLET=1.0.0
 
-lib_LTLIBRARIES = libibiscom.la libibis.la
+lib_LTLIBRARIES = libibis.la
+noinst_LIBRARIES = libibiscom.a
 
-libibiscom_la_SOURCES = ibbbm.c ibcr.c	ibis.c ibis_gsi_mad_ctrl.c \
+libibiscom_a_SOURCES = ibbbm.c ibcr.c	ibis.c ibis_gsi_mad_ctrl.c \
 	ibpm.c ibsac.c ibsm.c ibvs.c ibcc.c
 
 # client library to be used by IBIS TCL package:
@@ -70,11 +71,9 @@ bin_PROGRAMS = ibis
 
 # this is used for the libraries link
 LDADD = $(OSM_LDFLAGS)
-# AM_LDFLAGS = -static
 
 ibis_SOURCES = ibissh_wrap.cpp
 
-ibis_LDFLAGS = -static
 # note the order of the libraries does matter as we static link
 ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS)
 
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sun Feb 15 00:27:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Feb 2009 10:27:12 +0200
Subject: [ofa-general] [PATCH] opensm/Makefile.am: remove osm_build_id.h junk
	file generation
Message-ID: <20090215082712.GD7189@sashak.voltaire.com>


osm_build_id.h is not a valid C file. This is only used for OpenSM debug
mode build determination, which is now available using OSM_DEBUG macro
from osm_config.h.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/Makefile.am |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/opensm/Makefile.am b/opensm/Makefile.am
index 2287edd..75b6dc5 100644
--- a/opensm/Makefile.am
+++ b/opensm/Makefile.am
@@ -7,12 +7,6 @@ ACLOCAL_AMFLAGS = -I config
 
 # we should provide a hint for other apps about the build mode of this project
 install-exec-hook:
-	mkdir -p $(DESTDIR)/$(includedir)
-if DEBUG
-	echo "define osm_build_type \"debug\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h
-else
-	echo "define osm_build_type \"free\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h
-endif
 	$(top_srcdir)/config/install-sh -m 755 -d $(DESTDIR)/$(sysconfdir)/init.d
 	cp $(top_builddir)/scripts/opensm.init $(DESTDIR)/$(sysconfdir)/init.d/opensmd
 	chmod 755 $(DESTDIR)/$(sysconfdir)/init.d/opensmd
-- 
1.6.1.2.319.gbd9e


From vlad at lists.openfabrics.org  Sun Feb 15 03:15:13 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 15 Feb 2009 03:15:13 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090215-0200 daily build status
Message-ID: <20090215111514.11A0EE301A9@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp

Failed:
Build failed on ia64 with linux-2.6.16
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.16.21-0.8-default
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16.21-0.8-default'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.17
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.17'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.21.1
Log:
Build failed on ia64 with linux-2.6.23
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.23'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.22
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.25
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.25'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.24'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.26
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.26'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.16
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.16'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.17
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.17'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.18-8.el5
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18-8.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown':
/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree'
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.o] Error 1
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From tziporet at dev.mellanox.co.il  Sun Feb 15 07:56:30 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 15 Feb 2009 17:56:30 +0200
Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp
In-Reply-To: <ada63jf6znf.fsf@cisco.com>
References: <4994A1FD.2060704@oracle.com>	<EC7160704069456F96A1BB2504513BDB@amr.corp.intel.com>	<4994A625.9060008@oracle.com>
	<ada63jf6znf.fsf@cisco.com>
Message-ID: <49983B2E.6000802@mellanox.co.il>

Roland Dreier wrote:
> Is SQD really not supported by ConnectX?  If so it is likely a temporary
> firmware issue I would think.
>
>
>   
Its FW but we do not plan to add it in the near future

Tziporet


From neutronsharc at gmail.com  Sun Feb 15 14:40:36 2009
From: neutronsharc at gmail.com (neutron)
Date: Sun, 15 Feb 2009 17:40:36 -0500
Subject: [ofa-general] IB function calls in kernel module fail
Message-ID: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>

Hi all,

I'm writing a kernel module that make use of basic IB verbs to
communicate, like:
ib_register_client,  ib_unregister_client,  ib_alloc_pd,
ib_create_qp,  ib_reg_phys_mr,  etc.

I can compile the code into a kernel module:  ib_rdma_lat.ko.   This
module is to test the RDMA write latency from kernel module.

But when I "insmod", I got error reports at /var/log/messages:

Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_unregister_client
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_unregister_client
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_create_cq
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_reg_phys_mr
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_dereg_mr
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_register_client
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_register_client
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_destroy_cq
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_query_port
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_alloc_pd
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_create_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_modify_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_destroy_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
symbol ib_dealloc_pd
Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd

I'm running rhel5.  I have rebooted the node many times but didn't
help at all.

[wci11-oib:~/dist_lock/ib_kernel]uname -a
Linux wci11-oib 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 07:18:46 EST 2008
x86_64 x86_64 x86_64 GNU/Linux


"ofed_info" is:
[wci11-oib:~/dist_lock/ib_kernel]/usr/bin/ofed_info
OFED-1.3.1
libibverbs:
git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3
commit 40b771aa6a9c0ad092b2e20775b4723d3b173792
libmthca:
git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3
commit 9501e698d257949acfab2edc90812602966dbcc9
libmlx4:
git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3
......


I'm pretty sure all IB modules are loaded already:
[wci11-oib:~/dist_lock/ib_kernel]lsmod | grep ib
ib_sdp                125020  0
rdma_cm                67348  2 rdma_ucm,ib_sdp
ib_addr                41992  1 rdma_cm
ib_ipoib              113248  0
ib_cm                  67368  3 qlgc_vnic,rdma_cm,ib_ipoib
ib_sa                  74632  4 qlgc_vnic,rdma_cm,ib_ipoib,ib_cm
ib_uverbs              75568  1 rdma_ucm
ib_umad                50600  0
ib_ipath              346316  0
mlx4_ib                95932  0
mlx4_core             109008  1 mlx4_ib
ib_mthca              159044  0
ib_mad                 70948  5 ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthca
ib_core                97664  15
rdma_ucm,qlgc_vnic,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,iw_cxgb3,ib_ipath,mlx4_ib,ib_mthca,ib_mad
libiscsi               61952  1 iscsi_tcp
scsi_transport_iscsi    67344  3 iscsi_tcp,libiscsi
ipoib_helper           35728  2 ib_ipoib
ipv6                  411425  43 ib_ipoib
libata                160849  1 ata_piix
scsi_mod              186361  6
iscsi_tcp,libiscsi,scsi_transport_iscsi,sg,libata,sd_mod


"service openibd status" reports the status is OK:
[wci11-oib:~/dist_lock/ib_kernel]sudo service openibd status

  HCA driver loaded

Configured devices:
ib0 ib1 ib2 ib3

Currently active devices:
ib0
ib2

The following OFED modules are loaded:

  rdma_ucm
  qlgc_vnic
  ib_sdp
  rdma_cm
  ib_addr
  ib_ipoib
  ib_ipath
  mlx4_core
  mlx4_ib
  ib_mthca
  ib_uverbs
  ib_umad
  ib_sa
  ib_cm
  ib_mad
  ib_core
  iw_cxgb3


I have no idea what's going on.    Any suggestions?


From wangwhao at cn.ibm.com  Sun Feb 15 17:29:30 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Mon, 16 Feb 2009 09:29:30 +0800
Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the
	first	configuration	on RHEL5.3
In-Reply-To: <1234541612.751.1.camel@firewall.xsintricity.com>
Message-ID: <OFE05F3CA8.254F9530-ON4825755F.00075E72-4825755F.0008310B@cn.ibm.com>


Wen Hao Wang (王文昊)

Software Engineer
IBM China Software Development Laboratory
Email: wangwhao at cn.ibm.com
Tel: 86-10-82451055
Fax: 86-10-82782244 ext. 2312
Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software
Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193,
P.R.China


Doug Ledford <dledford at redhat.com> 写于 2009-02-14 00:13:32:

> On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote:
> > Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:
> >
> > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > > > Wen Hao Wang wrote:
> > > > >
> > > > > Hi all:
> > > > >
> > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED
> > (shipped
> > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some
> > drivers and
> > > > > wrote network interface configuration file ifcfg-ib0. ifup ib0
> > also
> > > > > succeeded. But IB utilites report Connetion timed out.
> > > > >
> > > > >
> > > > > [root at xblade06 network-scripts]# sminfo
> > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out
> > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > > > > sminfo: iberror: failed: query
> > > > >
> > > > > I had to reboot the blade and rerun "openibd start". Then
> > sminfo
> > > > > reported correct contents. I do not suppose this reboot is
> > required.
> > > > > Did I miss any configuration step?
> > >
> > > There was an unintentional bug in the rhel5.2 openibd init script in
> > > that it automatically turned itself on during install (generally,
> > most
> > > init scripts should default to *not* turning themselves on during
> > > install of the package, nor should they start themselves during
> > install
> > > of the package...this is for security reasons, imagine if you
> > installed
> > > the bind name server on your box and it automatically started up
> > before
> > > you had a chance to configure it).  In rhel5.3 we fixed that bug.
> >  So,
> >
> > Yeah. I heard of this bug.
> >
> > > you may need to 'chkconfig --level 2345 openibd on' to make sure
> > openibd
> > > starts up each time.  The error you list above is consistent with
> > not
> > > all of the kernel modules being loaded when you tried to use the
> > sminfo
> > > program.
> >
> > Even after reboot, service openibd is not started automatically.
> > [root at xblade06 ~]# chkconfig --list openibd
> > openibd         0:off   1:off   2:off   3:off   4:off   5:off   6:off
>
> That's because you have to run the command I listed in my first email to
> turn it on.
>

I totally agree with this. But I am still confused why sminfo gave errors
before reboot, or which steps I should take for the first OFED usage before
reboot. As far as I can see, whether the service is added into system
runlevel DB is not related to the sminfo error. Please correct me if that
is not the case.

> > I agree with you that maybe some modules were not loaded. But what's
> > that?
> > Before reboot, I run "/etc/init.d/openibd start" and
> > "/etc/init.d/network
> > restart". No error was reported. "openibd status" also looked good.
>
> Running start on a service does not enable that service at the next
> reboot.  You must specifically enable the service in order for it to
> start automatically.
>
> > >
> > > > > Moreover, "openibd start" report one warning message about
> > hwconf.
> > > > > Anyone has comments about this?
> > > > >
> > > > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No
> > such
> > > > > file or directory
> > > > > [ OK ]
> > >
> > > Can you see if the kudzu package is installed on your machine?  The
> > > openib package uses this config file written by kudzu to determine
> > what
> > > hardware drivers to load.  I suppose I should put a specific
> > requires in
> > > the rpm for that.
> >
> > kudzu is installed.
> > [root at xblade06 ~]# rpm -q kudzu
> > kudzu-1.2.57.1.21-1
>
> Make sure kudzu has been run at least once then (it would appear to be
> turned off on your machine or else /etc/sysconfig/hwconf would exist).
> You can run it manually from the command line and that should be
> sufficient for the openibd init script's needs.
>

Yes. After kudza created the file on my machine, openibd script had no
error>
this time. I want to know in my scenario, is "openibd restart"
needed/required?

Many thanks!

Wen Hao Wang
Email: wangwhao at cn.ibm.com

> --
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband
>
> [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/3ef2fa67/attachment.html>

From subbukl at gmail.com  Mon Feb 16 01:51:13 2009
From: subbukl at gmail.com (subbu kl)
Date: Mon, 16 Feb 2009 15:21:13 +0530
Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
	working
In-Reply-To: <f3b32c250902120020y5d73f054nd38d00e3063f67b3@mail.gmail.com>
References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com>
	<f3b32c250902112152l3c04efc5m39fa4ce0b2f71385@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969A15@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112242m151a48f4pe71d3a46bb5c34fc@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AB2@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112258xd4c11fha7748129d0367907@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969AC4@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902112345v2e46dc93g9ff086d8159ceb6@mail.gmail.com>
	<E2263E4A5B2284449EEBD0AAB751098401C7969B96@PDSMSX501.ccr.corp.intel.com>
	<f3b32c250902120020y5d73f054nd38d00e3063f67b3@mail.gmail.com>
Message-ID: <f3b32c250902160151w7f7c1ee4qb00004bfec15ef4b@mail.gmail.com>

anyone any clue on this ?
As I am seeing the same issue with centos 5.2 HVM guest also with xen 3.4
unstable !

~subbu

On Thu, Feb 12, 2009 at 1:50 PM, subbu kl <subbukl at gmail.com> wrote:

> did a quick search,
> I believe its MMIO, as it is
>
> in file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_main.c <http://www.cs.fsu.edu/%7Ebaker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_main.c>
> mthca_QUERY_FW <http://www.cs.fsu.edu/%7Ebaker/devices/lxr/http/ident?i=mthca_QUERY_FW>() is resulting into
>
> mthca_QUERY_FW() which inturn will result into mthca_cmd_post_dbell()/mthca_cmd_post_hcr() which inturn results into
> __raw_writel((__force u32) cpu_to_be32(in_param >> 32),           ptr + offs[0]);
>
>
> in the file -  http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_cmd.c <http://www.cs.fsu.edu/%7Ebaker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_cmd.c>
>
> OFED people should be more helpful here to comment if I have missed out
> something. Roland any clue?
>
> ~subbu
>
>
> On Thu, Feb 12, 2009 at 1:31 PM, Jiang, Yunhong <yunhong.jiang at intel.com>wrote:
>
>>  Can you please share more information how will the ib_mthca do QUERY_FW?
>> Through config space access? Through MMIO access? I think more information
>> will be helpful. The only thing seems strange to me is, from "Memory at
>> fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO
>> is disabled?
>>
>> Thanks
>> Yunhong Jiang
>>
>>  ------------------------------
>> *From:* subbu kl [mailto:subbukl at gmail.com]
>> *Sent:* 2009年2月12日 15:46
>>
>> *To:* Jiang, Yunhong
>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>> general at lists.openfabrics.org
>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>> working
>>
>> so back to square one ?
>> Why QUERY_FW should fail in domU ?
>>
>> ~subbu
>>
>> On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong <yunhong.jiang at intel.com
>> > wrote:
>>
>>>  DomU access config space through pcibackend, so that message is ok.
>>>
>>>  ------------------------------
>>>  *From:* subbu kl [mailto:subbukl at gmail.com]
>>> *Sent:* 2009年2月12日 14:59
>>>
>>> *To:* Jiang, Yunhong
>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>> general at lists.openfabrics.org
>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>> working
>>>
>>>   So getting PCI config space access in domU will solve the problem ? if
>>> so how can I achieve that ?
>>>
>>> ~subbu
>>>
>>> On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong <
>>> yunhong.jiang at intel.com> wrote:
>>>
>>>>  Sorry that seems the original mail has tried the permissive already :$
>>>> How will So how will the card do the QEUREY_FW command?Through config
>>>> space or through MMIO? Following information is something strange, why all
>>>> the MMIO range is disabled?
>>>>
>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>
>>>> As for the following information, I think it should be harmless since
>>>> domU has no method of config spacess access method.
>>>>   PCI: Fatal: No PCI config space access function found
>>>>
>>>> Thanks
>>>> Yunhong Jiang
>>>>
>>>>  ------------------------------
>>>>  *From:* subbu kl [mailto:subbukl at gmail.com]
>>>> *Sent:* 2009年2月12日 14:43
>>>>
>>>> *To:* Jiang, Yunhong
>>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>>> general at lists.openfabrics.org
>>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>> working
>>>>
>>>>   oops missed it,
>>>>
>>>> well now I dont see that enable permissive...message. here goes the
>>>> messages what I got in dom0 while booting domU
>>>>
>>>> tap tap-1-51712: 2 getting info
>>>> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0
>>>> device vif1.0 entered promiscuous mode
>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>>> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi)
>>>> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002)
>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>> PCI: Setting latency timer of device 0000:0e:00.0 to 64
>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
>>>> xenbr0: topology change detected, propagating
>>>> xenbr0: port 3(vif1.0) entering forwarding state
>>>>
>>>> any suspicious message ?
>>>> any Idea why I get that :
>>>>  PCI: Fatal: No PCI config space access function found
>>>> rtc: IRQ 8 is not free.
>>>>
>>>> message in domU bootup message ?
>>>>
>>>> ~subbu
>>>>
>>>> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong <
>>>> yunhong.jiang at intel.com> wrote:
>>>>
>>>>>  So any changes in dom0's dmesg?
>>>>>
>>>>>
>>>>>  ------------------------------
>>>>> *From:* subbu kl [mailto:subbukl at gmail.com]
>>>>> *Sent:* 2009年2月12日 13:52
>>>>> *To:* Jiang, Yunhong
>>>>> *Cc:* David Brown; xen-devel at lists.xensource.com;
>>>>> general at lists.openfabrics.org
>>>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>>> working
>>>>>
>>>>>   no luck !
>>>>>  dmesg in XEN PV guest shows :
>>>>>
>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>>
>>>>> even after executingh the following in dom0:
>>>>>
>>>>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive
>>>>>
>>>>> I am getting the follwing messages on the console as part of the
>>>>> initial bootup messages of the guest:
>>>>>
>>>>> Started domain rhel52_64_3
>>>>> PCI: Fatal: No PCI config space access function found
>>>>> rtc: IRQ 8 is not free.
>>>>> i8042.c: No controller found.
>>>>>
>>>>> after executing the following in dom0 :
>>>>> #xm create -c rhel52_64_3
>>>>>
>>>>>
>>>>> so, problem persisits,
>>>>>
>>>>> ~subbu
>>>>>
>>>>>
>>>>> 2009/2/12 Jiang, Yunhong <yunhong.jiang at intel.com>
>>>>>
>>>>>>  Seems it is because PCI frontend try to write some configuration
>>>>>> space that PCIback has no config_field entry to support it.
>>>>>> I think you can firstly try to do as dom0's dmesg suggested: "see
>>>>>> permissive attribute in sysfs" (it should be "set permissive attribute...",
>>>>>> I think).
>>>>>>
>>>>>> BTW, where you got following log? That seems suggest config space
>>>>>> function not found.
>>>>>>
>>>>>> PCI: Fatal: No PCI config space access function found
>>>>>> rtc: IRQ 8 is not free.
>>>>>> i8042.c: No controller found."
>>>>>>
>>>>>> -- Yunhong Jiang
>>>>>>
>>>>>>  ------------------------------
>>>>>> *From:* xen-devel-bounces at lists.xensource.com [mailto:
>>>>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl
>>>>>> *Sent:* 2009年2月11日 22:18
>>>>>> *To:* David Brown
>>>>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org
>>>>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not
>>>>>> working
>>>>>>
>>>>>>   I am getting the same QUERY_FW failed on RHEL5.2 with xenxen
>>>>>> paravirtualized guest with pciback module.
>>>>>>
>>>>>> No one seems to have tried answering this question on the list, let me
>>>>>> ping xen-devel and ofed people again.
>>>>>>
>>>>>> after executing in dom0
>>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind
>>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot
>>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind
>>>>>>
>>>>>> #dmesg
>>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>>>> tap tap-1-51712: 2 getting info
>>>>>> tap tap-2-51712: 2 getting info
>>>>>> pciback 0000:0e:00.0: seizing device
>>>>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142)
>>>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled
>>>>>>
>>>>>> #xm create -c rhel52_64_3
>>>>>>
>>>>>> PCI: Fatal: No PCI config space access function found
>>>>>> rtc: IRQ 8 is not free.
>>>>>> i8042.c: No controller found.
>>>>>>
>>>>>>
>>>>>> GUEST dmesg:
>>>>>>
>>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>>>
>>>>>> in dom0:
>>>>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info
>>>>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to
>>>>>> virtual slot 0
>>>>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode
>>>>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not
>>>>>> ready
>>>>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9,
>>>>>> protocol 1 (x86_64-abi)
>>>>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to
>>>>>> write to a read-only configuration space field at offset 0x44, size 2. This
>>>>>> may be harmless, but if you have problems with your device:
>>>>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs
>>>>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel
>>>>>> mailing list along with details of your device obtained from lspci.
>>>>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000
>>>>>> -> 0002)
>>>>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] ->
>>>>>> GSI 16 (level, low) -> IRQ 16
>>>>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device
>>>>>> 0000:0e:00.0 disabled
>>>>>>
>>>>>>
>>>>>>
>>>>>> some more details - [root at p128 ~]# rpm -qa | grep xen
>>>>>> kernel-xen-2.6.18-92.1.22.el5
>>>>>> xen-3.0.3-64.el5_2.9
>>>>>> xen-libs-3.0.3-64.el5_2.9
>>>>>> xen-libs-3.0.3-64.el5_2.9
>>>>>>
>>>>>> [root at p128 ~]# ibv_devinfo
>>>>>> hca_id: mthca0
>>>>>>         fw_ver:                         5.3.0
>>>>>>         node_guid:                      0002:c902:0022:cd48
>>>>>>         sys_image_guid:                 0002:c902:0022:cd4b
>>>>>>         vendor_id:                      0x02c9
>>>>>>         vendor_part_id:                 25218
>>>>>>         hw_ver:                         0x20
>>>>>>         board_id:                       MT_0370130002
>>>>>>         phys_port_cnt:                  2
>>>>>>                 port:   1
>>>>>>                         state:                  PORT_INIT (2)
>>>>>>                         max_mtu:                2048 (4)
>>>>>>                         active_mtu:             512 (2)
>>>>>>                         sm_lid:                 0
>>>>>>                         port_lid:               0
>>>>>>                         port_lmc:               0x00
>>>>>>
>>>>>>                 port:   2
>>>>>>                         state:                  PORT_DOWN (1)
>>>>>>                         max_mtu:                2048 (4)
>>>>>>                         active_mtu:             512 (2)
>>>>>>                         sm_lid:                 0
>>>>>>                         port_lid:               0
>>>>>>                         port_lmc:               0x00
>>>>>>
>>>>>>
>>>>>> any help greatly appreciated.
>>>>>>
>>>>>> ~subbu
>>>>>>
>>>>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown <dmlb2000 at gmail.com>wrote:
>>>>>>
>>>>>>> Okay so my question to the openfabrics guys is, why would the OFED
>>>>>>> drivers fail to read the firmware?
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> - David Brown
>>>>>>>
>>>>>>>
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: David Brown <dmlb2000 at gmail.com>
>>>>>>> Date: Thu, Sep 11, 2008 at 2:24 PM
>>>>>>> Subject: pciback module not working
>>>>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com
>>>>>>>
>>>>>>>
>>>>>>> This issue was brought up about a year and a half ago. So I'll bring
>>>>>>> it up again and see if anything happens.
>>>>>>>
>>>>>>> I've got an infiniband network and am attempting to pass the
>>>>>>> infiniband card through the host and give it to the guest.
>>>>>>> I'm working with standard CentOS 5.2 on both guest and host with
>>>>>>> their
>>>>>>> provided xen (3.0.3 ish). I've also attempted to install the newest
>>>>>>> Xen 3.3 and use their standard host kernel and that did the same
>>>>>>> thing. The guest dmesg output in the guest is similar on both
>>>>>>> permissive and normal mode.
>>>>>>>
>>>>>>> I'm getting issues with detecting the firmware on the card for some
>>>>>>> reason...
>>>>>>>
>>>>>>> Any help would be appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> - David Brown
>>>>>>>
>>>>>>> === GUEST dmesg output ===
>>>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>>>>> ib_mthca: Initializing 0000:00:00.0
>>>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
>>>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64
>>>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting.
>>>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11
>>>>>>> =======================
>>>>>>>
>>>>>>> === Host modprobe.conf ===
>>>>>>> alias eth0 bnx2
>>>>>>> alias eth1 bnx2
>>>>>>> alias scsi_hostadapter cciss
>>>>>>> options pciback hide=(41:00.0)
>>>>>>> =====================
>>>>>>>
>>>>>>> === Host lspci output ===
>>>>>>> # lspci -vs 41:00.0
>>>>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>>>> HCA] (rev 20)
>>>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>>>       Flags: fast devsel, IRQ 16
>>>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>>>> [size=1M]
>>>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>>>       Capabilities: [40] Power Management version 2
>>>>>>>       Capabilities: [48] Vital Product Data
>>>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>>>>> Queue=0/5 Enable-
>>>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>>>> =====================
>>>>>>>
>>>>>>> This makes sure it get loaded first off before anything else.
>>>>>>> === Host mkinitrd cmd ===
>>>>>>> # mkinitrd -f --with=pciback --preload pciback
>>>>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen
>>>>>>> ====================
>>>>>>>
>>>>>>> === Host pciback dmesg ===
>>>>>>> pciback 0000:41:00.0: Driver tried to write to a read-only
>>>>>>> configuration space field at offset 0x44, size 2. This may be
>>>>>>> harmless, but if you have problems with your device:
>>>>>>> 1) see permissive attribute in sysfs
>>>>>>> 2) report problems to the xen-devel mailing list along with details
>>>>>>> of
>>>>>>> your device obtained from lspci.
>>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>>>> ======================
>>>>>>>
>>>>>>> === Host pciback dmesg (after setting it permissive) ===
>>>>>>> pciback 0000:41:00.0: enabling permissive mode configuration space
>>>>>>> accesses!
>>>>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe!
>>>>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0
>>>>>>> device vif1.0 entered promiscuous mode
>>>>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready
>>>>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi)
>>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002)
>>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16
>>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64
>>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled
>>>>>>> =========================================
>>>>>>>
>>>>>>> === Guest lspci output ===
>>>>>>> # lspci -v
>>>>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>>>>>>> HCA] (rev 20)
>>>>>>>       Subsystem: Hewlett-Packard Company Unknown device 170a
>>>>>>>       Flags: fast devsel, IRQ 16
>>>>>>>       Memory at fdc00000 (64-bit, non-prefetchable) [disabled]
>>>>>>> [size=1M]
>>>>>>>       Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M]
>>>>>>>       Capabilities: [40] Power Management version 2
>>>>>>>       Capabilities: [48] Vital Product Data
>>>>>>>       Capabilities: [90] Message Signalled Interrupts: 64bit+
>>>>>>> Queue=0/5 Enable-
>>>>>>>       Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>>>>>>>       Capabilities: [60] Express Endpoint IRQ 0
>>>>>>> =====================
>>>>>>> _______________________________________________
>>>>>>> general mailing list
>>>>>>> general at lists.openfabrics.org
>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>>
>>>>>>> To unsubscribe, please visit
>>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> . . . s u b b u
>>>>>> "You've got to be original, because if you're like someone else, what
>>>>>> do they need you for?"
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> . . . s u b b u
>>>>> "You've got to be original, because if you're like someone else, what
>>>>> do they need you for?"
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> . . . s u b b u
>>>> "You've got to be original, because if you're like someone else, what do
>>>> they need you for?"
>>>>
>>>>
>>>
>>>
>>> --
>>> . . . s u b b u
>>> "You've got to be original, because if you're like someone else, what do
>>> they need you for?"
>>>
>>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>>
>>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what do
> they need you for?"
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/e458b341/attachment.html>

From vlad at lists.openfabrics.org  Mon Feb 16 03:18:51 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 16 Feb 2009 03:18:51 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090216-0200 daily build status
Message-ID: <20090216111851.95C9AE6106E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at dev.mellanox.co.il  Mon Feb 16 03:19:14 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 16 Feb 2009 13:19:14 +0200
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
Message-ID: <49994BB2.3010206@mellanox.co.il>

neutron wrote:
> Hi all,
>
> I'm writing a kernel module that make use of basic IB verbs to
> communicate, like:
> ib_register_client,  ib_unregister_client,  ib_alloc_pd,
> ib_create_qp,  ib_reg_phys_mr,  etc.
>
> I can compile the code into a kernel module:  ib_rdma_lat.ko.   This
> module is to test the RDMA write latency from kernel module.
>
> But when I "insmod", I got error reports at /var/log/messages:
>
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_unregister_client
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_unregister_client
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_create_cq
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_reg_phys_mr
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_dereg_mr
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_register_client
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_register_client
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_destroy_cq
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_query_port
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_alloc_pd
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_create_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_modify_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_destroy_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
> symbol ib_dealloc_pd
> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd
>
> I'm running rhel5.  I have rebooted the node many times but didn't
> help at all.
>
>   
 From OFED_tips:
4. External Module Compilation Over OFED-1.4
===============================================================================

To build kernel modules depending on OFED's modules, take the 
Modules.symvers
file from <prefix>/src/openib/Module.symvers (part of the 
kernel-ib-devel RPM)
and copy it to the modules subdir and then compile your module.

If <prefix>/src/openib/Module.symvers does not exist or it is empty, use the
create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the
Module.symvers file.

See "Module versioning & Module.symvers" in the modules.txt from kernel
documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt).


Tziporet


From john.russo at qlogic.com  Mon Feb 16 05:59:18 2009
From: john.russo at qlogic.com (John Russo)
Date: Mon, 16 Feb 2009 07:59:18 -0600
Subject: [ofa-general] ***SPAM*** Clearing port counters
Message-ID: <A331668DC876334996266B5A7756A013134E2D8F71@MNEXMB2.qlogic.org>

 "When accessing port counters through /sys/class/infiniband/<hca_name>/ports/<port_number>/counters/<counter_name> is there a way to clear the value in a counter (or the values in multiple counters)?"

[cid:image001.jpg at 01C99014.D945DD70]
__________________________
John F. Russo
Manager, Engineering
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA 19406
Direct: 610-233-4866
Main: 610-233-4800
Fax: 610-233-4777
Cell: 610-246-9903
Email: John.Russo at qlogic.com<mailto:John.Russo at qlogic.com>
www.qlogic.com<http://www.qlogic.com>

True success is the undeniable truth that we have proved ourselves.
-Joe Luppino-Esposito
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/5f32b0e4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3677 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/5f32b0e4/attachment.jpg>

From hal.rosenstock at gmail.com  Mon Feb 16 06:09:56 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 16 Feb 2009 09:09:56 -0500
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Clearing port counters
In-Reply-To: <A331668DC876334996266B5A7756A013134E2D8F71@MNEXMB2.qlogic.org>
References: <AcmQPsK0PFL9rr4qS+mgo75L6N44pQ==>
	<A331668DC876334996266B5A7756A013134E2D8F71@MNEXMB2.qlogic.org>
Message-ID: <f0e08f230902160609v305b80caw9783116169649bbf@mail.gmail.com>

On Mon, Feb 16, 2009 at 8:59 AM, John Russo <john.russo at qlogic.com> wrote:
>  "When accessing port counters through
> /sys/class/infiniband/<hca_name>/ports/<port_number>/counters/<counter_name>
> is there a way to clear the value in a counter (or the values in multiple
> counters)?"

Only via MADs AFAIK.

-- Hal

> __________________________
> John F. Russo
> Manager, Engineering
> QLogic Corporation
> 780 Fifth Avenue, Suite 140
> King of Prussia, PA 19406
> Direct: 610-233-4866
> Main: 610-233-4800
> Fax: 610-233-4777
> Cell: 610-246-9903
> Email: John.Russo at qlogic.com
> www.qlogic.com
>
>
>
> True success is the undeniable truth that we have proved ourselves.
>
> -Joe Luppino-Esposito
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Mon Feb 16 06:52:37 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 16 Feb 2009 09:52:37 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which
	use pthread
In-Reply-To: <20081231170413.GD21950@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
Message-ID: <f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>

Sasha,

On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> I looked at implementation of safe_*() functions (safe_smp_query,
> safe_smp_set and safe_ca_call) and found that they are not actually
> "safe" as declared by its names. The only thread-unsafe thing which
> is used there is static 'mad_portid' structure (from rpc.c),

I'm not sure that the only thread unsafe thing in the mad rpc
mechanism is the portid.

> but modification of this structure is not protected by same mutex (actually
> not protected at all).

A first step would be removing the portid as static. If so, portid
would need to be a supplied parameter to various mad routines and the
existing ones relying on madrpc_portid would be deprecated. Does this
make sense to do ? Would you accept such a patch ?

-- Hal

> As far as I know nothing uses those safe_*() primitives right now outside
> libibmad, so I think it is better to remove this confused functions from
> API (with changing library version, etc.).
>
> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
> hidden static pthread mutex which is not controlled by caller
> application. I think that it will be more robust for multithreaded
> application to use its own synchronization methods (pthread mutex or any
> other) for better control. So let's remove madrpc_lock/unlock() too.
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  libibmad/include/infiniband/mad.h |   41 -------------------------------------
>  libibmad/libibmad.ver             |    2 +-
>  libibmad/src/libibmad.map         |    2 -
>  libibmad/src/rpc.c                |   15 -------------
>  libibmad/src/sa.c                 |    5 ++-
>  5 files changed, 4 insertions(+), 61 deletions(-)
>
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index eff6738..89b4be5 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp,
>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
>                    int num_classes);
>  void   madrpc_save_mad(void *madbuf, int len);
> -void   madrpc_lock(void);
> -void   madrpc_unlock(void);
>  void   madrpc_show_errors(int set);
>
>  void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t *id, unsigned attrid,
>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod,
>                      unsigned timeout, const void *srcport);
>
> -inline static uint8_t *
> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, unsigned mod,
> -              unsigned timeout)
> -{
> -       uint8_t *p;
> -
> -       madrpc_lock();
> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
> -       madrpc_unlock();
> -
> -       return p;
> -}
> -
> -inline static uint8_t *
> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, unsigned mod,
> -            unsigned timeout)
> -{
> -       uint8_t *p;
> -
> -       madrpc_lock();
> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
> -       madrpc_unlock();
> -
> -       return p;
> -}
> -
>  /* sa.c */
>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>                  unsigned timeout);
> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id,
>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
>                          ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf);
>
> -inline static uint8_t *
> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
> -            unsigned timeout)
> -{
> -       uint8_t *p;
> -
> -       madrpc_lock();
> -       p = sa_call(rcvbuf, portid, sa, timeout);
> -       madrpc_unlock();
> -
> -       return p;
> -}
> -
>  /* resolve.c */
>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
> index 7e93c16..23d2dc2 100644
> --- a/libibmad/libibmad.ver
> +++ b/libibmad/libibmad.ver
> @@ -6,4 +6,4 @@
>  # API_REV - advance on any added API
>  # RUNNING_REV - advance any change to the vendor files
>  # AGE - number of backward versions the API still supports
> -LIBVERSION=5:0:4
> +LIBVERSION=2:0:0
> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> index 927e51c..f944d86 100644
> --- a/libibmad/src/libibmad.map
> +++ b/libibmad/src/libibmad.map
> @@ -72,14 +72,12 @@ IBMAD_1.3 {
>                madrpc;
>                madrpc_def_timeout;
>                madrpc_init;
> -               madrpc_lock;
>                madrpc_portid;
>                madrpc_rmpp;
>                madrpc_save_mad;
>                madrpc_set_retries;
>                madrpc_set_timeout;
>                madrpc_show_errors;
> -               madrpc_unlock;
>                ib_path_query;
>                sa_call;
>                sa_rpc_call;
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index 5226540..670a936 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -38,7 +38,6 @@
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
> -#include <pthread.h>
>  #include <string.h>
>  #include <errno.h>
>
> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data)
>        return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
>  }
>
> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
> -
> -void
> -madrpc_lock(void)
> -{
> -       pthread_mutex_lock(&rpclock);
> -}
> -
> -void
> -madrpc_unlock(void)
> -{
> -       pthread_mutex_unlock(&rpclock);
> -}
> -
>  void
>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
>  {
> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> index 27b9d52..c601254 100644
> --- a/libibmad/src/sa.c
> +++ b/libibmad/src/sa.c
> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid,
>        if (srcport) {
>                p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
>        } else {
> -               p = safe_sa_call(buf, sm_id, &sa, 0);
> +               p = sa_call(buf, sm_id, &sa, 0);
>        }
>        if (!p) {
>                IBWARN("sa call path_query failed");
> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid,
>        mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
>        return dlid;
>  }
> +
>  int
>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf)
>  {
> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
>  }
> --
> 1.6.0.4.766.g6fc4a
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From tzachid at mellanox.co.il  Mon Feb 16 05:17:22 2009
From: tzachid at mellanox.co.il (Tzachi Dar)
Date: Mon, 16 Feb 2009 15:17:22 +0200
Subject: [ofa-general] RE: [ofw] ib_create_qp and ib_get_err_str weirdness
In-Reply-To: <01fa01c98df0$47baed30$0100000a@DIEGO>
References: <01fa01c98df0$47baed30$0100000a@DIEGO>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01C93FE9@mtlexch01.mtl.com>

Hi Diego,

It seems that if you know the hw that you are working with you can find
(by experiments) the maximum number of sge that you can use. (probably
around 29).
So, you can limit your work requests to this number of SGE.

Depending if you are in user or in kernel you can also use buffers that
have the same Contiguous memory. (I don't know the control that you have
on the buffers so this is just a suggestion).

Thanks
Tzachi

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Diego Guella
> Sent: Friday, February 13, 2009 5:32 PM
> To: ofw at lists.openfabrics.org; OpenIB General
> Subject: [ofw] ib_create_qp and ib_get_err_str weirdness
> 
> Hello,
> 
> I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR 
> single-port card.
> I noticed a strange behavior of ib_create_qp function:
> 
> -----
> memset(&qp_create, 0, sizeof(qp_create)); qp_create.qp_type = 
> IB_QPT_RELIABLE_CONN; // Reliable Connected 
> qp_create.sq_depth = ctx->qdepth; qp_create.rq_depth = 
> ctx->qdepth; qp_create.sq_sge = ctx->hca_attr->max_sges; 
> qp_create.rq_sge = ctx->hca_attr->max_sges; qp_create.h_sq_cq 
> = ctx->cq_h; qp_create.h_rq_cq = ctx->cq_h; qp_create.h_srq = 
> NULL; qp_create.sq_signaled = 1;
> ctx->qp_h = 0;
> rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h);
> -----
> return value ("rc") is 3 (=IB_INVALID_PARAMETER).
> 
> I spent some time figuring out the problem was the SQ SGE value:
> http://lists.openfabrics.org/pipermail/general/2006-June/023417.html
> 
> According to iba/ib_al.h:
> -----
> * IB_INVALID_MAX_SGE
> * The requested maximum number of scatter-gather entries for 
> the send or
> * receive queue could not be supported.
> -----
> so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)?
> 
> In the discussion I mentioned, it turned out that even using 
> hca_attr->max_sges there is the possibility that ib_create_qp fails.
> Which is my case.
> I have the need to send some audio buffers (32 or more) from 
> an IO node to a computing node using RDMA WRITE.
> The ownership of the buffers is of the audio driver, and I 
> haven't the guarantee that the audio buffers are contiguous.
> I was trying and send them using the lowest possible number 
> of WR, each one with the highest possible number of sge.
> But, given the hca_attr->max_sge unreliability, how do you 
> recommend to achieve this goal?
> Should I post a WR for each buffer I'd want to send through 
> RDMA WRITE?
> 
> 
> Another less-related problem:
> ib_get_err_str is not correct for every input value, for 
> example I noticed that for
> ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is 
> IB_INVALID_MR_HANDLE
> 
> 
> I don't know if these problems apply to linux too, so I'm 
> including general list.
> 
> Thanks and best regards,
> Diego
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 


From neutronsharc at gmail.com  Mon Feb 16 07:11:21 2009
From: neutronsharc at gmail.com (neutron)
Date: Mon, 16 Feb 2009 10:11:21 -0500
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <49994BB2.3010206@mellanox.co.il>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
	<49994BB2.3010206@mellanox.co.il>
Message-ID: <7d5928b30902160711rc24d11epd5827ad548a2256b@mail.gmail.com>

The problem solved following your advice.  Thanks a ton!!


On Mon, Feb 16, 2009 at 6:19 AM, Tziporet Koren
<tziporet at dev.mellanox.co.il> wrote:
> neutron wrote:
>>
>> Hi all,
>>
>> I'm writing a kernel module that make use of basic IB verbs to
>> communicate, like:
>> ib_register_client,  ib_unregister_client,  ib_alloc_pd,
>> ib_create_qp,  ib_reg_phys_mr,  etc.
>>
>> I can compile the code into a kernel module:  ib_rdma_lat.ko.   This
>> module is to test the RDMA write latency from kernel module.
>>
>> But when I "insmod", I got error reports at /var/log/messages:
>>
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_unregister_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol
>> ib_unregister_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_create_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_reg_phys_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_dereg_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_register_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol
>> ib_register_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_destroy_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_query_port
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_alloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_create_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_modify_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_destroy_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_dealloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd
>>
>> I'm running rhel5.  I have rebooted the node many times but didn't
>> help at all.
>>
>>
>
> From OFED_tips:
> 4. External Module Compilation Over OFED-1.4
> ===============================================================================
>
> To build kernel modules depending on OFED's modules, take the
> Modules.symvers
> file from <prefix>/src/openib/Module.symvers (part of the kernel-ib-devel
> RPM)
> and copy it to the modules subdir and then compile your module.
>
> If <prefix>/src/openib/Module.symvers does not exist or it is empty, use the
> create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the
> Module.symvers file.
>
> See "Module versioning & Module.symvers" in the modules.txt from kernel
> documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt).
>
>
> Tziporet
>
>


From neutronsharc at gmail.com  Mon Feb 16 07:32:50 2009
From: neutronsharc at gmail.com (neutron)
Date: Mon, 16 Feb 2009 10:32:50 -0500
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <49994BB2.3010206@mellanox.co.il>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
	<49994BB2.3010206@mellanox.co.il>
Message-ID: <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>

One remaining question.

In my code of kernel module,   do I need to #include the header files
from <ofed-prefix>/src/openib/include/....
Or I just include the header files from  <kernel_src_dir>/include/.....

Thanks!


On Mon, Feb 16, 2009 at 6:19 AM, Tziporet Koren
<tziporet at dev.mellanox.co.il> wrote:
> neutron wrote:
>>
>> Hi all,
>>
>> I'm writing a kernel module that make use of basic IB verbs to
>> communicate, like:
>> ib_register_client,  ib_unregister_client,  ib_alloc_pd,
>> ib_create_qp,  ib_reg_phys_mr,  etc.
>>
>> I can compile the code into a kernel module:  ib_rdma_lat.ko.   This
>> module is to test the RDMA write latency from kernel module.
>>
>> But when I "insmod", I got error reports at /var/log/messages:
>>
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_unregister_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol
>> ib_unregister_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_create_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_reg_phys_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_dereg_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_register_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol
>> ib_register_client
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_destroy_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_query_port
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_alloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_create_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_modify_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_destroy_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of
>> symbol ib_dealloc_pd
>> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd
>>
>> I'm running rhel5.  I have rebooted the node many times but didn't
>> help at all.
>>
>>
>
> From OFED_tips:
> 4. External Module Compilation Over OFED-1.4
> ===============================================================================
>
> To build kernel modules depending on OFED's modules, take the
> Modules.symvers
> file from <prefix>/src/openib/Module.symvers (part of the kernel-ib-devel
> RPM)
> and copy it to the modules subdir and then compile your module.
>
> If <prefix>/src/openib/Module.symvers does not exist or it is empty, use the
> create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the
> Module.symvers file.
>
> See "Module versioning & Module.symvers" in the modules.txt from kernel
> documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt).
>
>
> Tziporet
>
>


From dledford at redhat.com  Mon Feb 16 09:49:33 2009
From: dledford at redhat.com (Doug Ledford)
Date: Mon, 16 Feb 2009 12:49:33 -0500
Subject: [ofa-general] sminfo report iberror in the
	first	configuration	on RHEL5.3
In-Reply-To: <OFE05F3CA8.254F9530-ON4825755F.00075E72-4825755F.0008310B@cn.ibm.com>
References: <OFE05F3CA8.254F9530-ON4825755F.00075E72-4825755F.0008310B@cn.ibm.com>
Message-ID: <1234806573.751.74.camel@firewall.xsintricity.com>

On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote:
> 
> Wen Hao Wang (王文昊)
> 
> Software Engineer
> IBM China Software Development Laboratory
> Email: wangwhao at cn.ibm.com
> Tel: 86-10-82451055
> Fax: 86-10-82782244 ext. 2312
> Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software
> Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193,
> P.R.China
> 
> 
> Doug Ledford <dledford at redhat.com> 写于 2009-02-14 00:13:32:
> 
> > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote:
> > > Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:
> > > 
> > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > > > > Wen Hao Wang wrote:
> > > > > >
> > > > > > Hi all:
> > > > > >
> > > > > > I changed my blade OS to RHEL5.3 yesterday and installed
> OFED
> > > (shipped 
> > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some
> > > drivers and 
> > > > > > wrote network interface configuration file ifcfg-ib0. ifup
> ib0
> > > also 
> > > > > > succeeded. But IB utilites report Connetion timed out.
> > > > > >
> > > > > >
> > > > > > [root at xblade06 network-scripts]# sminfo
> > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed
> out
> > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > > > > > sminfo: iberror: failed: query
> > > > > >
> > > > > > I had to reboot the blade and rerun "openibd start". Then
> > > sminfo 
> > > > > > reported correct contents. I do not suppose this reboot is
> > > required. 
> > > > > > Did I miss any configuration step?
> > > > 
> > > > There was an unintentional bug in the rhel5.2 openibd init
> script in
> > > > that it automatically turned itself on during install
> (generally,
> > > most
> > > > init scripts should default to *not* turning themselves on
> during
> > > > install of the package, nor should they start themselves during
> > > install
> > > > of the package...this is for security reasons, imagine if you
> > > installed
> > > > the bind name server on your box and it automatically started up
> > > before
> > > > you had a chance to configure it).  In rhel5.3 we fixed that
> bug.
> > >  So,
> > > 
> > > Yeah. I heard of this bug.
> > > 
> > > > you may need to 'chkconfig --level 2345 openibd on' to make sure
> > > openibd
> > > > starts up each time.  The error you list above is consistent
> with
> > > not
> > > > all of the kernel modules being loaded when you tried to use the
> > > sminfo
> > > > program.
> > > 
> > > Even after reboot, service openibd is not started automatically.
> > > [root at xblade06 ~]# chkconfig --list openibd
> > > openibd         0:off   1:off   2:off   3:off   4:off   5:off
> 6:off
> > 
> > That's because you have to run the command I listed in my first
> email to
> > turn it on.
> >
> 
> I totally agree with this. But I am still confused why sminfo gave
> errors
> before reboot, or which steps I should take for the first OFED usage
> before
> reboot. As far as I can see, whether the service is added into system
> runlevel DB is not related to the sminfo error. Please correct me if
> that
> is not the case.

It is related.  The runlevel db is only consulted on boot up.  If the
openibd service was not enabled at startup, then adding it to the
runlevel startup does *not* start it at that time.  You have to both add
it to the runlevel startup and also start it manually if you want things
to work properly prior to reboot.  The sminfo errors you first posted
are consistent with some of the modules not being loaded, and it went
away after you started the openibd service, which is also consistent
with the problem.

> > > I agree with you that maybe some modules were not loaded. But
> what's
> > > that?
> > > Before reboot, I run "/etc/init.d/openibd start" and
> > > "/etc/init.d/network
> > > restart". No error was reported. "openibd status" also looked
> good.
> > 
> > Running start on a service does not enable that service at the next
> > reboot.  You must specifically enable the service in order for it to
> > start automatically.
> > 
> > > > 
> > > > > > Moreover, "openibd start" report one warning message about
> > > hwconf. 
> > > > > > Anyone has comments about this?
> > > > > >
> > > > > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf:
> No
> > > such 
> > > > > > file or directory
> > > > > > [ OK ]
> > > > 
> > > > Can you see if the kudzu package is installed on your machine?
>  The
> > > > openib package uses this config file written by kudzu to
> determine
> > > what
> > > > hardware drivers to load.  I suppose I should put a specific
> > > requires in
> > > > the rpm for that.
> > > 
> > > kudzu is installed.
> > > [root at xblade06 ~]# rpm -q kudzu
> > > kudzu-1.2.57.1.21-1
> > 
> > Make sure kudzu has been run at least once then (it would appear to
> be
> > turned off on your machine or else /etc/sysconfig/hwconf would
> exist).
> > You can run it manually from the command line and that should be
> > sufficient for the openibd init script's needs.
> > 
> 
> Yes. After kudza created the file on my machine, openibd script had no
> error
> this time. I want to know in my scenario, is "openibd restart"
> needed/required?

It would probably be advisable, but only if you haven't rebooted since
running kudzu for the first time.  If you've rebooted since then, then
it doesn't matter.

> Many thanks!
> 
> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
> 
> > -- 
> > Doug Ledford <dledford at redhat.com>
> >               GPG KeyID: CFBFF194
> >               http://people.redhat.com/dledford
> > 
> > Infiniband specific RPMs available at
> >               http://people.redhat.com/dledford/Infiniband
> > 
> > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
> 
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/5d6a91fa/attachment.sig>

From wangwhao at cn.ibm.com  Mon Feb 16 16:31:57 2009
From: wangwhao at cn.ibm.com (Wen Hao Wang)
Date: Tue, 17 Feb 2009 08:31:57 +0800
Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in
	the	first	configuration	on RHEL5.3
In-Reply-To: <1234806573.751.74.camel@firewall.xsintricity.com>
Message-ID: <OF4B2428FA.EC006E5E-ON48257560.00029E92-48257560.0002EC55@cn.ibm.com>

OK, Doug:

Thanks a lot for your detailed explanation! So if I donot want to reboot
the machine, I need run "chkconfig", "kudzu" and "openibd start".

Wen Hao Wang
Email: wangwhao at cn.ibm.com


Doug Ledford <dledford at redhat.com> wrote on 2009-02-17 01:49:33:

> On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote:
> >
> > Wen Hao Wang
> >
> > Software Engineer
> > IBM China Software Development Laboratory
> > Email: wangwhao at cn.ibm.com
> > Tel: 86-10-82451055
> > Fax: 86-10-82782244 ext. 2312
> > Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software
> > Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193,
> > P.R.China
> >
> >
> > Doug Ledford <dledford at redhat.com> 写于 2009-02-14 00:13:32:
> >
> > > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote:
> > > > Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:
> > > >
> > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > > > > > Wen Hao Wang wrote:
> > > > > > >
> > > > > > > Hi all:
> > > > > > >
> > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed
> > OFED
> > > > (shipped
> > > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some
> > > > drivers and
> > > > > > > wrote network interface configuration file ifcfg-ib0. ifup
> > ib0
> > > > also
> > > > > > > succeeded. But IB utilites report Connetion timed out.
> > > > > > >
> > > > > > >
> > > > > > > [root at xblade06 network-scripts]# sminfo
> > > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed
> > out
> > > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9)
> > > > > > > sminfo: iberror: failed: query
> > > > > > >
> > > > > > > I had to reboot the blade and rerun "openibd start". Then
> > > > sminfo
> > > > > > > reported correct contents. I do not suppose this reboot is
> > > > required.
> > > > > > > Did I miss any configuration step?
> > > > >
> > > > > There was an unintentional bug in the rhel5.2 openibd init
> > script in
> > > > > that it automatically turned itself on during install
> > (generally,
> > > > most
> > > > > init scripts should default to *not* turning themselves on
> > during
> > > > > install of the package, nor should they start themselves during
> > > > install
> > > > > of the package...this is for security reasons, imagine if you
> > > > installed
> > > > > the bind name server on your box and it automatically started up
> > > > before
> > > > > you had a chance to configure it).  In rhel5.3 we fixed that
> > bug.
> > > >  So,
> > > >
> > > > Yeah. I heard of this bug.
> > > >
> > > > > you may need to 'chkconfig --level 2345 openibd on' to make sure
> > > > openibd
> > > > > starts up each time.  The error you list above is consistent
> > with
> > > > not
> > > > > all of the kernel modules being loaded when you tried to use the
> > > > sminfo
> > > > > program.
> > > >
> > > > Even after reboot, service openibd is not started automatically.
> > > > [root at xblade06 ~]# chkconfig --list openibd
> > > > openibd         0:off   1:off   2:off   3:off   4:off   5:off
> > 6:off
> > >
> > > That's because you have to run the command I listed in my first
> > email to
> > > turn it on.
> > >
> >
> > I totally agree with this. But I am still confused why sminfo gave
> > errors
> > before reboot, or which steps I should take for the first OFED usage
> > before
> > reboot. As far as I can see, whether the service is added into system
> > runlevel DB is not related to the sminfo error. Please correct me if
> > that
> > is not the case.
>
> It is related.  The runlevel db is only consulted on boot up.  If the
> openibd service was not enabled at startup, then adding it to the
> runlevel startup does *not* start it at that time.  You have to both add
> it to the runlevel startup and also start it manually if you want things
> to work properly prior to reboot.  The sminfo errors you first posted
> are consistent with some of the modules not being loaded, and it went
> away after you started the openibd service, which is also consistent
> with the problem.
>
> > > > I agree with you that maybe some modules were not loaded. But
> > what's
> > > > that?
> > > > Before reboot, I run "/etc/init.d/openibd start" and
> > > > "/etc/init.d/network
> > > > restart". No error was reported. "openibd status" also looked
> > good.
> > >
> > > Running start on a service does not enable that service at the next
> > > reboot.  You must specifically enable the service in order for it to
> > > start automatically.
> > >
> > > > >
> > > > > > > Moreover, "openibd start" report one warning message about
> > > > hwconf.
> > > > > > > Anyone has comments about this?
> > > > > > >
> > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf:
> > No
> > > > such
> > > > > > > file or directory
> > > > > > > [ OK ]
> > > > >
> > > > > Can you see if the kudzu package is installed on your machine?
> >  The
> > > > > openib package uses this config file written by kudzu to
> > determine
> > > > what
> > > > > hardware drivers to load.  I suppose I should put a specific
> > > > requires in
> > > > > the rpm for that.
> > > >
> > > > kudzu is installed.
> > > > [root at xblade06 ~]# rpm -q kudzu
> > > > kudzu-1.2.57.1.21-1
> > >
> > > Make sure kudzu has been run at least once then (it would appear to
> > be
> > > turned off on your machine or else /etc/sysconfig/hwconf would
> > exist).
> > > You can run it manually from the command line and that should be
> > > sufficient for the openibd init script's needs.
> > >
> >
> > Yes. After kudza created the file on my machine, openibd script had no
> > error
> > this time. I want to know in my scenario, is "openibd restart"
> > needed/required?
>
> It would probably be advisable, but only if you haven't rebooted since
> running kudzu for the first time.  If you've rebooted since then, then
> it doesn't matter.
>
> > Many thanks!
> >
> > Wen Hao Wang
> > Email: wangwhao at cn.ibm.com
> >
> > > --
> > > Doug Ledford <dledford at redhat.com>
> > >               GPG KeyID: CFBFF194
> > >               http://people.redhat.com/dledford
> > >
> > > Infiniband specific RPMs available at
> > >               http://people.redhat.com/dledford/Infiniband
> > >
> > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
> >
> --
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband
>
> [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090217/ce33e363/attachment.html>

From dledford at redhat.com  Mon Feb 16 18:40:08 2009
From: dledford at redhat.com (Doug Ledford)
Date: Mon, 16 Feb 2009 21:40:08 -0500
Subject: [ofa-general] sminfo report iberror in
	the	first	configuration	on RHEL5.3
In-Reply-To: <OF4B2428FA.EC006E5E-ON48257560.00029E92-48257560.0002EC55@cn.ibm.com>
References: <OF4B2428FA.EC006E5E-ON48257560.00029E92-48257560.0002EC55@cn.ibm.com>
Message-ID: <1234838408.751.96.camel@firewall.xsintricity.com>

On Tue, 2009-02-17 at 08:31 +0800, Wen Hao Wang wrote:
> OK, Doug:
> 
> Thanks a lot for your detailed explanation! So if I donot want to
> reboot the machine, I need run "chkconfig", "kudzu" and "openibd
> start".

Correct.

> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
> 
> 
> Doug Ledford <dledford at redhat.com> wrote on 2009-02-17 01:49:33:
> 
> > On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote:
> > > 
> > > Wen Hao Wang
> > > 
> > > Software Engineer
> > > IBM China Software Development Laboratory
> > > Email: wangwhao at cn.ibm.com
> > > Tel: 86-10-82451055
> > > Fax: 86-10-82782244 ext. 2312
> > > Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun
> Software
> > > Park,No.8 Dong Bei Wang West Road, Haidian District Beijing,
> 100193,
> > > P.R.China
> > > 
> > > 
> > > Doug Ledford <dledford at redhat.com> 写于 2009-02-14 00:13:32:
> > > 
> > > > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote:
> > > > > Doug Ledford <dledford at redhat.com> 写于 2009-02-12 21:20:30:
> > > > > 
> > > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote:
> > > > > > > Wen Hao Wang wrote:
> > > > > > > >
> > > > > > > > Hi all:
> > > > > > > >
> > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed
> > > OFED
> > > > > (shipped 
> > > > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load
> some
> > > > > drivers and 
> > > > > > > > wrote network interface configuration file ifcfg-ib0.
> ifup
> > > ib0
> > > > > also 
> > > > > > > > succeeded. But IB utilites report Connetion timed out.
> > > > > > > >
> > > > > > > >
> > > > > > > > [root at xblade06 network-scripts]# sminfo
> > > > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection
> timed
> > > out
> > > > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid
> 9)
> > > > > > > > sminfo: iberror: failed: query
> > > > > > > >
> > > > > > > > I had to reboot the blade and rerun "openibd start".
> Then
> > > > > sminfo 
> > > > > > > > reported correct contents. I do not suppose this reboot
> is
> > > > > required. 
> > > > > > > > Did I miss any configuration step?
> > > > > > 
> > > > > > There was an unintentional bug in the rhel5.2 openibd init
> > > script in
> > > > > > that it automatically turned itself on during install
> > > (generally,
> > > > > most
> > > > > > init scripts should default to *not* turning themselves on
> > > during
> > > > > > install of the package, nor should they start themselves
> during
> > > > > install
> > > > > > of the package...this is for security reasons, imagine if
> you
> > > > > installed
> > > > > > the bind name server on your box and it automatically
> started up
> > > > > before
> > > > > > you had a chance to configure it).  In rhel5.3 we fixed that
> > > bug.
> > > > >  So,
> > > > > 
> > > > > Yeah. I heard of this bug.
> > > > > 
> > > > > > you may need to 'chkconfig --level 2345 openibd on' to make
> sure
> > > > > openibd
> > > > > > starts up each time.  The error you list above is consistent
> > > with
> > > > > not
> > > > > > all of the kernel modules being loaded when you tried to use
> the
> > > > > sminfo
> > > > > > program.
> > > > > 
> > > > > Even after reboot, service openibd is not started
> automatically.
> > > > > [root at xblade06 ~]# chkconfig --list openibd
> > > > > openibd         0:off   1:off   2:off   3:off   4:off   5:off
> > > 6:off
> > > > 
> > > > That's because you have to run the command I listed in my first
> > > email to
> > > > turn it on.
> > > >
> > > 
> > > I totally agree with this. But I am still confused why sminfo gave
> > > errors
> > > before reboot, or which steps I should take for the first OFED
> usage
> > > before
> > > reboot. As far as I can see, whether the service is added into
> system
> > > runlevel DB is not related to the sminfo error. Please correct me
> if
> > > that
> > > is not the case.
> > 
> > It is related.  The runlevel db is only consulted on boot up.  If
> the
> > openibd service was not enabled at startup, then adding it to the
> > runlevel startup does *not* start it at that time.  You have to both
> add
> > it to the runlevel startup and also start it manually if you want
> things
> > to work properly prior to reboot.  The sminfo errors you first
> posted
> > are consistent with some of the modules not being loaded, and it
> went
> > away after you started the openibd service, which is also consistent
> > with the problem.
> > 
> > > > > I agree with you that maybe some modules were not loaded. But
> > > what's
> > > > > that?
> > > > > Before reboot, I run "/etc/init.d/openibd start" and
> > > > > "/etc/init.d/network
> > > > > restart". No error was reported. "openibd status" also looked
> > > good.
> > > > 
> > > > Running start on a service does not enable that service at the
> next
> > > > reboot.  You must specifically enable the service in order for
> it to
> > > > start automatically.
> > > > 
> > > > > > 
> > > > > > > > Moreover, "openibd start" report one warning message
> about
> > > > > hwconf. 
> > > > > > > > Anyone has comments about this?
> > > > > > > >
> > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start
> > > > > > > > Loading OpenIB kernel
> modules:grep: /etc/sysconfig/hwconf:
> > > No
> > > > > such 
> > > > > > > > file or directory
> > > > > > > > [ OK ]
> > > > > > 
> > > > > > Can you see if the kudzu package is installed on your
> machine?
> > >  The
> > > > > > openib package uses this config file written by kudzu to
> > > determine
> > > > > what
> > > > > > hardware drivers to load.  I suppose I should put a specific
> > > > > requires in
> > > > > > the rpm for that.
> > > > > 
> > > > > kudzu is installed.
> > > > > [root at xblade06 ~]# rpm -q kudzu
> > > > > kudzu-1.2.57.1.21-1
> > > > 
> > > > Make sure kudzu has been run at least once then (it would appear
> to
> > > be
> > > > turned off on your machine or else /etc/sysconfig/hwconf would
> > > exist).
> > > > You can run it manually from the command line and that should be
> > > > sufficient for the openibd init script's needs.
> > > > 
> > > 
> > > Yes. After kudza created the file on my machine, openibd script
> had no
> > > error
> > > this time. I want to know in my scenario, is "openibd restart"
> > > needed/required?
> > 
> > It would probably be advisable, but only if you haven't rebooted
> since
> > running kudzu for the first time.  If you've rebooted since then,
> then
> > it doesn't matter.
> > 
> > > Many thanks!
> > > 
> > > Wen Hao Wang
> > > Email: wangwhao at cn.ibm.com
> > > 
> > > > -- 
> > > > Doug Ledford <dledford at redhat.com>
> > > >               GPG KeyID: CFBFF194
> > > >               http://people.redhat.com/dledford
> > > > 
> > > > Infiniband specific RPMs available at
> > > >               http://people.redhat.com/dledford/Infiniband
> > > > 
> > > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
> > > 
> > -- 
> > Doug Ledford <dledford at redhat.com>
> >               GPG KeyID: CFBFF194
> >               http://people.redhat.com/dledford
> > 
> > Infiniband specific RPMs available at
> >               http://people.redhat.com/dledford/Infiniband
> > 
> > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除]
> 
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090216/fb0cacf9/attachment.sig>

From sfr at canb.auug.org.au  Mon Feb 16 23:44:03 2009
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Tue, 17 Feb 2009 18:44:03 +1100
Subject: [ofa-general] linux-next: infiniband tree build warning
Message-ID: <20090217184403.7e1f18c5.sfr@canb.auug.org.au>

Hi Roland,

Today's linux-next build (x86_64 allmodconfig) produced these warnings:

drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'build_rdma_recv':
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift
drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift

Caused by commit 1557b4f052cb739a4ae1dd9641249b3e69fb6a0d ("RDMA/cxgb3:
Remove modulo math from build_rdma_recv()").

-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090217/243231f8/attachment.sig>

From tziporet at dev.mellanox.co.il  Tue Feb 17 01:57:52 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 17 Feb 2009 11:57:52 +0200
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>	
	<49994BB2.3010206@mellanox.co.il>
	<7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>
Message-ID: <499A8A20.1090507@mellanox.co.il>

neutron wrote:
> One remaining question.
>
> In my code of kernel module,   do I need to #include the header files
> from <ofed-prefix>/src/openib/include/....
> Or I just include the header files from  <kernel_src_dir>/include/.....
>
>   
You should use the headers from ofed if you wish to use OFED kernel modules.

Tziporet


From vlad at lists.openfabrics.org  Tue Feb 17 03:19:45 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 17 Feb 2009 03:19:45 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090217-0200 daily build status
Message-ID: <20090217111945.49D8AE61047@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From Bert.Wiegers at t-systems-sfr.com  Tue Feb 17 04:23:21 2009
From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert)
Date: Tue, 17 Feb 2009 13:23:21 +0100
Subject: [ofa-general] opensm logoutput
Message-ID: <F9BD5A2A5CEEEE4FB738EC67475D7BEF0242DB2B@sfrexbe01.acds.t-systems-sfr.com>


Hi,

we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
SUN.
As we are debugging our System I'm trying to understand the
opensm.log's.
(Where can I find any documentation to that?)


We see frequent messages as follows:

Feb 17 10:25:34 134964 [41802940] 0x01 ->
__osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
(Link state change) Producer:2 (Switch) from LID:111
TID:0x000000000000006e
Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:1 num:128 (Link state change) from LID:111
GID:fe80::14:4fa4:cff8:50
Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:65 (GID out of service) from LID:336
GID:fe80::3:ba00:100:3341
Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
node:MT25408 ConnectX Mellanox Technologies
Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
tables configured on all switches
Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
Feb 17 10:25:46 662611 [41802940] 0x01 ->
__osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
(Link state change) Producer:2 (Switch) from LID:111
TID:0x000000000000006f
Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:1 num:128 (Link state change) from LID:111
GID:fe80::14:4fa4:cff8:50
Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
tables configured on all switches
Feb 17 10:25:52 476653 [44007940] 0x01 ->
__osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x81
(SubnGetResp)
                                D bit...................0x1
                                status..................0x1C00
                                hop_ptr.................0x0
                                hop_count...............0x4
                                trans_id................0x18c08de
                                attr_id.................0x15 (PortInfo)
                                resv....................0x0
                                attr_mod................0x6
 
m_key...................0x0000000000000000
                                dr_slid.................65535
                                dr_dlid.................65535

                                Initial path: 0,1,10,15,23
                                Return path:  0,23,20,12,17
                                Reserved:     [0][0][0][0][0][0][0]

                                00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 11
03 03 02

                                34 52 00 23 40 40 00 08   08 04 F0 4C 00
00 00 00

                                00 00 00 00 00 88 00 00   00 00 00 00 00
00 00 00


Other issues I see with messages similar to the following ones:

__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
po

__osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
(IB_TIMEOUT)

osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
(Invalid argument)


I'm still googleing, but hopefully someone can give me some answers.


Thanks and best regards
Bert


From kliteyn at dev.mellanox.co.il  Tue Feb 17 04:41:12 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 17 Feb 2009 14:41:12 +0200
Subject: [ofa-general] [PATCH] opensm/osm_node_info_rcv.c: create physp for
 the newly discovered port of the known node
Message-ID: <499AB068.2020205@dev.mellanox.co.il>

Hi Sasha,

This patch fixes bugzilla issue #1515:

Topology:
                 |---------------|
                 |      SW2      |
                 |---------------|
                   |x |y    |z |v
              |----|  |     |  |----|
              |       |     |       |
              |  |----|     |----|  |
              |  |               |  |
             a| b|              c| d|
      |---------------|     |---------------|
      |       SW1     |     |     SW3       |
      |---------------|     |---------------|
          |                             |
          |                             |
       HCA with SM                      HCA

During the discovery:

SM sends NodeInfo request to SW1
SM sends NodeInfo request to SW2 through link a->x
SM discovers new node SW2:
  - updates DR to SW2 to go through link a->x
  - creates physp x
SM sends NodeInfo request to SW2 through link b->y
SM discovers a known node SW2
  - DOES NOT create physp y
  - updates DR to SW2 to go through link b->y

>From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
port y any more, leaving it uninitialized (no physp object for this port).

The fix is to create physp for the newly discovered port of the known
switch node, same way as it is done for HCAs.
I also added one log message for the case that showed the problem - when
one of the link sides is uninitialized (no valid ports check). Perhaps
this log message should be an error message instead?

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index c52c0d5..7da3103 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
 	 */
 	if (!osm_node_link_has_valid_ports(p_node, port_num,
 					   p_neighbor_node,
-					   p_ni_context->port_num))
+					   p_ni_context->port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
+			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
 		goto _exit;
+	}

 	if (osm_node_link_exists(p_node, port_num,
 				 p_neighbor_node, p_ni_context->port_num)) {
@@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
 				     IN osm_node_t * const p_node,
 				     IN const osm_madw_t * const p_madw)
 {
+
+	ib_smp_t *p_smp;
+	ib_node_info_t *p_ni;
+	uint8_t port_num;
+
 	OSM_LOG_ENTER(sm->p_log);

+	p_smp = osm_madw_get_smp_ptr(p_madw);
+	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	port_num = ib_node_info_get_local_port_num(p_ni);
+
+	if (!osm_node_get_physp_ptr(p_node, port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Creating physp for node GUID:0x%"
+			PRIx64 ", port %u\n",
+			cl_ntoh64(osm_node_get_node_guid(p_node)),
+			port_num);
+		osm_node_init_physp(p_node, p_madw);
+	}
+
 	/*
 	   If this switch has already been probed during this sweep,
 	   then don't bother reprobing it.
-- 
1.5.1.4


From Bert.Wiegers at t-systems-sfr.com  Tue Feb 17 05:54:00 2009
From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert)
Date: Tue, 17 Feb 2009 14:54:00 +0100
Subject: [ofa-general] osmtest fails
Message-ID: <F9BD5A2A5CEEEE4FB738EC67475D7BEF0242DB59@sfrexbe01.acds.t-systems-sfr.com>

Hi.
I can't start osmtest (using ofed 3.2.5 with opensm running on one node
- no other subnetmanagers)

# osmtest  -f c

Command Line Arguments
Done with args
        Flow = Create Inventory
Feb 17 14:46:49 769646 [AB76BF80] 0x7f -> Setting log level to: 0x03
Feb 17 14:46:49 769830 [AB76BF80] 0x02 -> osm_vendor_init: 1000 pending
umads specified
Feb 17 14:46:49 783700 [AB76BF80] 0x02 -> osm_vendor_bind: Binding to
port 0x3ba0001003341
Feb 17 14:46:49 801051 [AB76BF80] 0x02 ->
osmtest_validate_sa_class_port_info:
-----------------------------
SA Class Port Info:
 base_ver:1
 class_ver:2
 cap_mask:0x2602
 cap_mask2:0x0
 resp_time_val:0x10
-----------------------------
Feb 17 14:46:53 952476 [41001940] 0x01 -> umad_receiver: ERR 5409: send
completed with error (method=0x12 attr=0x35 trans_id=0x7300000004) --
dropping
Feb 17 14:46:53 952521 [41001940] 0x01 -> umad_receiver: ERR 5410: class
0x3 LID 0x150
Feb 17 14:46:53 952535 [41001940] 0x01 -> osmtest_query_res_cb: ERR
0003: Error on query (IB_TIMEOUT)
Feb 17 14:46:53 956429 [AB76BF80] 0x01 -> osmtest_get_all_recs: ERR
0004: ib_query failed (IB_TIMEOUT)
Feb 17 14:46:53 956475 [AB76BF80] 0x01 -> osmtest_write_all_path_recs:
ERR 0025: osmtest_get_all_recs failed (IB_TIMEOUT)
Feb 17 14:46:53 956543 [AB76BF80] 0x01 -> osmtest_run: ERR 0139:
Inventory file create failed (IB_TIMEOUT)
OSMTEST: TEST "Create Inventory" FAIL


In the Logfile opensm.log I can see:

Feb 17 14:46:54 412573 [42804940] 0x01 -> osm_vendor_send: ERR 5430:
Send p_madw = 0x8688560 of size 75064952 failed -5 (Invalid argument)
Feb 17 14:46:54 420846 [42804940] 0x01 -> osm_sa_send: ERR 4C04:
osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Feb 17 14:46:55 534830 [42003940] 0x01 -> osm_vendor_send: ERR 5430:
Send p_madw = 0x2aaab7f791d0 of size 75064952 failed -5 (Invalid
argument)
Feb 17 14:46:55 546577 [42003940] 0x01 -> osm_sa_send: ERR 4C04:
osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Feb 17 14:46:56 483555 [41001940] 0x01 -> osm_vendor_send: ERR 5430:
Send p_madw = 0x4b97a10 of size 75064952 failed -5 (Invalid argument)
Feb 17 14:46:56 493298 [41001940] 0x01 -> osm_sa_send: ERR 4C04:
osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Feb 17 14:47:02 042134 [44007940] 0x01 -> umad_receiver: ERR 5409: send
completed with error (method=0x92 attr=0x35 trans_id=0x7300000004) --
dropping
Feb 17 14:47:02 042168 [44007940] 0x01 -> umad_receiver: ERR 5410: class
0x3 LID 0x150
Feb 17 14:47:02 042187 [44007940] 0x01 -> umad_receiver: ERR 5412:
Failed to obtain request madw for timed out MAD(method=0x92 attr=0x35
tid=0x7300000004) -- dropping


Why can't it be initialized?

Best regards,
Bert


From neutronsharc at gmail.com  Tue Feb 17 06:50:21 2009
From: neutronsharc at gmail.com (neutron)
Date: Tue, 17 Feb 2009 09:50:21 -0500
Subject: [ofa-general] ***SPAM*** ib_reg_phys_mr( ) results in crash
Message-ID: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>

Hi all,

In my kernel module program,  a call to  ib_reg_phys_mr( ) always
results in a system crash.

My code is like:

   buf =  dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE,
&dma_addr, GFP_KERNEL);
    iovstart = (u64) send_buf;

   mr = ib_reg_phys_mr(ctx->pd, dma_addr, 1,  IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_READ
                           | IB_ACCESS_LOCAL_WRITE, &iovstart );


Before calling ib_reg_phys_mr,  printk() shows that all its arguments
are valid.  But the system always crashes immediately after entering
the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!

I'm using kernel 2.6.18-53.1.14.el5. My kernel module is built using
OFED-1.3.1 modules.


From jackm at dev.mellanox.co.il  Tue Feb 17 07:01:35 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 17 Feb 2009 17:01:35 +0200
Subject: [ofa-general] [PATCH] IPoIB: In unicast_arp,
	do path_free only for newly-created paths
Message-ID: <200902171701.36107.jackm@dev.mellanox.co.il>

If path_rec_start() returns error, call path_free() only if the path
was newly-created.  If we free an existing path whose valid flag was zero,
(but do not detach it from the list) we cause corruption of the
path list (of which it is a member), and get a kernel crash.

The simplest solution is to not free an existing path -- just leave it in the
list as-is (i.e., with its valid flag cleared).

Thanks to Yossi Etigin of Voltaire for identifying the problem flow
which caused the kernel crash.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
Signed-off-by: Moni Shua <monis at voltaire.com>

---

Roland,
I ran checkpatch.pl on this, and compiled it with Sparse.  However, I would still like to continue
using KMail.  If you have any editing/formatting problems with the patch, please let me know.
The patch was generated by git diff against your kernel git/master branch.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 0bd2a4f..2c8b15f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -660,8 +660,11 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 
 	path = __path_find(dev, phdr->hwaddr + 4);
 	if (!path || !path->valid) {
-		if (!path)
+		int new_path = 0;
+		if (!path) {
 			path = path_rec_create(dev, phdr->hwaddr + 4);
+			new_path = 1;
+		}
 		if (path) {
 			/* put pseudoheader back on for next time */
 			skb_push(skb, sizeof *phdr);
@@ -669,7 +672,8 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 
 			if (!path->query && path_rec_start(dev, path)) {
 				spin_unlock_irqrestore(&priv->lock, flags);
-				path_free(dev, path);
+				if (new_path)
+					path_free(dev, path);
 				return;
 			} else
 				__path_add(dev, path);


From jackm at dev.mellanox.co.il  Tue Feb 17 07:42:38 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 17 Feb 2009 17:42:38 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
Message-ID: <200902171742.38223.jackm@dev.mellanox.co.il>

We have found a race condition in sysfs.c which occurs when unloading low-level modules
(e.g., mlx4_ib) in the driver.
Specifically:

Although the kernel takes reference counts on sysfs files, it does not take such counts
on modules which implement attribute reads.

For example, we have:
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
			      char *buf)
{
	struct port_table_attribute *tab_attr =
		container_of(attr, struct port_table_attribute, attr);
	u16 pkey;
	ssize_t ret;
====>race condition HERE *****
	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
	if (ret)
		return ret;

	return sprintf(buf, "0x%04x\n", pkey);
}

The sysfs file /sys/class/infiniband/<device>/ports/1/pkey/<pkey number> is protected
from destruction while we are in show_port_pkey.
However, the underlying module which implements ib_query_pkey (in this case, mlx4_ib) is not.

Thus, if another process is busy unloading mlx4_ib, and the time-slice of the process
which is reading sysfs expires at the point indicated above in the code, ib_query_pkey()
will fail with a page-fault (kernel panic), since it will not find the code page which implements
ib_query_pkey() (inlined to the query_pkey() function in the low-level driver virtual function table).

Now, when a low-level driver is unloaded, the following procedure (in sysfs.c) is called:
void ib_device_unregister_sysfs(struct ib_device *device)
{
	struct kobject *p, *t;
	struct ib_port *port;

	list_for_each_entry_safe(p, t, &device->port_list, entry) {
		list_del(&p->entry);
		port = container_of(p, struct ib_port, kobj);
		mutex_lock(&port->mutex);
		port->valid = 0;
		sysfs_remove_group(p, &pma_group);
		sysfs_remove_group(p, &port->pkey_group);
		sysfs_remove_group(p, &port->gid_group);
		mutex_unlock(&port->mutex);
		kobject_put(p);
	}

	kobject_put(device->ports_parent);
	device_unregister(&device->dev);
}

After this call, the kernel continues with unloading the low-level module.
However, until device_unregister(&device->dev) is invoked, the
sysfs attribute path for the low-level device is still valid.  Hence the race condition -- 

Process A			    Process B
---------                       ---------------
1. starts unloading low-level mod
				2. cat /sys/class/infiniband/...
                                3. Time slice runs out just before accessing low-level
                                   module for requested info.
4. Low-level module is fully unloaded
				5. Page-fault panic when trying to access a procedure in
                                   the just-unloaded module.

Some attempt was made for some (but not all) of the "show" procedures to check if the module is alive:
	if (!ibdev_is_alive(p->ibdev))
		return -ENODEV;

This narrows the race window considerably, but does not eliminate it. (I put this fix in show_port_pkey(),
and was still able to generate the kernel panic).

The only way I was able to eliminate the kernel panic entirely was via a mutex (declaration and init not shown):
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
			      char *buf)
{
	struct port_table_attribute *tab_attr =
		container_of(attr, struct port_table_attribute, attr);
	u16 pkey;
	ssize_t ret;

	mutex_lock(&p->mutex);
==>	if (p->valid)
		ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
	else
		ret = -EINVAL;
==>	mutex_unlock(&p->mutex);
	if (ret)
		return ret;

	return sprintf(buf, "0x%04x\n", pkey);
}

and:
void ib_device_unregister_sysfs(struct ib_device *device)
{
	struct kobject *p, *t;
	struct ib_port *port;

	list_for_each_entry_safe(p, t, &device->port_list, entry) {
		list_del(&p->entry);
		port = container_of(p, struct ib_port, kobj);
==>		mutex_lock(&port->mutex);
		port->valid = 0;
		sysfs_remove_group(p, &pma_group);
		sysfs_remove_group(p, &port->pkey_group);
		sysfs_remove_group(p, &port->gid_group);
==>		mutex_unlock(&port->mutex);
		kobject_put(p);
	}

	kobject_put(device->ports_parent);
	device_unregister(&device->dev);
}

This is approach is fine for the port-based groups.  What about class-device attributes themselves?
I believe that the best approach is to add a sysfs_mutex to ib_device, and lock that for ALL "show" methods
in this file.

Opinions?

- Jack


From rdreier at cisco.com  Tue Feb 17 09:03:15 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Feb 2009 09:03:15 -0800
Subject: [ofa-general] linux-next: infiniband tree build warning
In-Reply-To: <20090217184403.7e1f18c5.sfr@canb.auug.org.au> (Stephen
	Rothwell's message of "Tue, 17 Feb 2009 18:44:03 +1100")
References: <20090217184403.7e1f18c5.sfr@canb.auug.org.au>
Message-ID: <adavdr9573g.fsf@cisco.com>

 > Today's linux-next build (x86_64 allmodconfig) produced these warnings:

 > drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'build_rdma_recv':
 > drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift

Thanks, should be fixed in the next pull of for-next.

 - R.


From arlin.r.davis at intel.com  Tue Feb 17 09:06:18 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 17 Feb 2009 09:06:18 -0800
Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF
In-Reply-To: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com>
References: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A65F149@orsmsx506.amr.corp.intel.com>

Thanks, applied.

Since this was in the CM code I had to do some regression testing before accepting.


>-----Original Message-----
>From: general-bounces at lists.openfabrics.org
>[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean
>Sent: Friday, February 13, 2009 2:55 PM
>To: Davis, Arlin R; general at lists.openfabrics.org;
>ofw at lists.openfabrics.org
>Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF
>
>Modify the openib_scm provider to support both OFED and WinOF releases.
>This takes advantage of having a libibverbs compatibility library.*
>
>Signed-off-by: Sean Hefty <sean.hefty at intel.com>
>---
>* If only there were a sockets compatility layer... gurgle
>This is only build tested for windows, but does run on Linux.
>
>diff --git a/Makefile.am b/Makefile.am
>index bfc93f7..5044e36 100755
>--- a/Makefile.am
>+++ b/Makefile.am
>@@ -49,7 +49,8 @@ dapl_udapl_libdaploscm_la_CFLAGS =
>$(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG
>                                 -DOPENIB -DCQ_WAIT_OBJECT \
>                                 -I$(srcdir)/dat/include/
>-I$(srcdir)/dapl/include/ \
>                                 -I$(srcdir)/dapl/common
>-I$(srcdir)/dapl/udapl/linux \
>-                                -I$(srcdir)/dapl/openib_scm
>+                                -I$(srcdir)/dapl/openib_scm \
>+                              -I$(srcdir)/dapl/openib_scm/linux
>
> if HAVE_LD_VERSION_SCRIPT
>     dat_version_script =
>-Wl,--version-script=$(srcdir)/dat/udat/libdat2.map
>diff --git a/dapl/openib_scm/README b/dapl/openib_scm/README
>deleted file mode 100644
>index 239dfe6..0000000
>--- a/dapl/openib_scm/README
>+++ /dev/null
>@@ -1,40 +0,0 @@
>-
>-OpenIB uDAPL provider using socket-based CM, in leiu of
>uCM/uAT, to setup QP/channels.
>-
>-to build:
>-
>-cd dapl/udapl
>-make VERBS=openib_scm clean
>-make VERBS=openib_scm
>-
>-
>-Modifications to common code:
>-
>-- added dapl/openib_scm directory
>-
>-      dapl/udapl/Makefile
>-
>-New files for openib_scm provider
>-
>-      dapl/openib/dapl_ib_cq.c
>-      dapl/openib/dapl_ib_dto.h
>-      dapl/openib/dapl_ib_mem.c
>-      dapl/openib/dapl_ib_qp.c
>-      dapl/openib/dapl_ib_util.c
>-      dapl/openib/dapl_ib_util.h
>-      dapl/openib/dapl_ib_cm.c
>-
>-A simple dapl test just for openib_scm testing...
>-
>-      test/dtest/dtest.c
>-      test/dtest/makefile
>-
>-      server: dtest -s
>-      client: dtest -h hostname
>-
>-known issues:
>-
>-      no memory windows support in ibverbs, dat_create_rmr fails.
>-
>-
>-
>diff --git a/dapl/openib_scm/dapl_ib_cm.c
>b/dapl/openib_scm/dapl_ib_cm.c
>index 80a7d5e..9a15e42 100644
>--- a/dapl/openib_scm/dapl_ib_cm.c
>+++ b/dapl/openib_scm/dapl_ib_cm.c
>@@ -52,26 +52,169 @@
> #include "dapl_cr_util.h"
> #include "dapl_name_service.h"
> #include "dapl_ib_util.h"
>-
>-#include <stdio.h>
>-#include <unistd.h>
>-#include <fcntl.h>
>-#include <netinet/tcp.h>
>-#include <byteswap.h>
>-#include <poll.h>
>-
>-#include <sys/socket.h>
>-#include <netinet/in.h>
>-#include <arpa/inet.h>
>-
>-#if __BYTE_ORDER == __LITTLE_ENDIAN
>-static inline uint64_t cpu_to_be64(uint64_t x) {return bswap_64(x);}
>-#elif __BYTE_ORDER == __BIG_ENDIAN
>-static inline uint64_t cpu_to_be64(uint64_t x) {return x;}
>-#endif
>+#include "dapl_osd.h"
>
> extern int g_scm_pipe[2];
>
>+#if defined(_WIN32) || defined(_WIN64)
>+enum DAPL_FD_EVENTS {
>+      DAPL_FD_READ    = 0x1,
>+      DAPL_FD_WRITE   = 0x2,
>+      DAPL_FD_ERROR   = 0x4
>+};
>+
>+static int dapl_config_socket(DAPL_SOCKET s)
>+{
>+      unsigned long nonblocking = 1;
>+      return ioctlsocket(s, FIONBIO, &nonblocking);
>+}
>+
>+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr,
>+                             int addrlen)
>+{
>+      int err;
>+
>+      connect(s, addr, addrlen);
>+      err = WSAGetLastError();
>+      return (err == WSAEWOULDBLOCK) ? EAGAIN : err;
>+}
>+
>+struct dapl_fd_set {
>+      struct fd_set set[3];
>+};
>+
>+static struct dapl_fd_set *dapl_alloc_fd_set(void)
>+{
>+      return dapl_os_alloc(sizeof(struct dapl_fd_set));
>+}
>+
>+static void dapl_fd_zero(struct dapl_fd_set *set)
>+{
>+      FD_ZERO(&set->set[0]);
>+      FD_ZERO(&set->set[1]);
>+      FD_ZERO(&set->set[2]);
>+}
>+
>+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
>+                      enum DAPL_FD_EVENTS event)
>+{
>+      FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]);
>+      FD_SET(s, &set->set[2]);
>+      return 0;
>+}
>+
>+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum
>DAPL_FD_EVENTS event)
>+{
>+      struct fd_set rw_fds;
>+      struct fd_set err_fds;
>+      struct timeval tv;
>+      int ret;
>+
>+      FD_ZERO(&rw_fds);
>+      FD_ZERO(&err_fds);
>+      FD_SET(s, &rw_fds);
>+      FD_SET(s, &err_fds);
>+
>+      tv.tv_sec = 0;
>+      tv.tv_usec = 0;
>+
>+      if (event == DAPL_FD_READ)
>+              ret = select(1, &rw_fds, NULL, &err_fds, &tv);
>+      else
>+              ret = select(1, NULL, &rw_fds, &err_fds, &tv);
>+
>+      if (ret == 0)
>+              return 0;
>+      else if (FD_ISSET(s, &rw_fds))
>+              return event;
>+      else if (FD_ISSET(s, &err_fds))
>+              return DAPL_FD_ERROR;
>+      else
>+              return WSAGetLastError();
>+}
>+
>+static int dapl_select(struct dapl_fd_set *set)
>+{
>+      return select(0, &set->set[0], &set->set[1],
>&set->set[2], NULL);
>+}
>+#else // _WIN32 || _WIN64
>+enum DAPL_FD_EVENTS {
>+      DAPL_FD_READ    = POLLIN,
>+      DAPL_FD_WRITE   = POLLOUT,
>+      DAPL_FD_ERROR   = POLLERR
>+};
>+
>+static int dapl_config_socket(DAPL_SOCKET s)
>+{
>+      int ret;
>+
>+      ret = fcntl(s, F_GETFL);
>+      if (ret >= 0)
>+              ret = fcntl(s, F_SETFL, ret | O_NONBLOCK);
>+      return ret;
>+}
>+
>+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr
>*addr, int addrlen)
>+{
>+      int ret;
>+
>+      ret = connect(s, addr, addrlen);
>+
>+      return (errno == EINPROGRESS) ? EAGAIN : ret;
>+}
>+
>+struct dapl_fd_set {
>+      int index;
>+      struct pollfd set[DAPL_FD_SETSIZE];
>+};
>+
>+static struct dapl_fd_set *dapl_alloc_fd_set(void)
>+{
>+      return dapl_os_alloc(sizeof(struct dapl_fd_set));
>+}
>+
>+static void dapl_fd_zero(struct dapl_fd_set *set)
>+{
>+      set->index = 0;
>+}
>+
>+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set,
>+                      enum DAPL_FD_EVENTS event)
>+{
>+      if (set->index == DAPL_FD_SETSIZE - 1) {
>+              dapl_log(DAPL_DBG_TYPE_ERR,
>+                       "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n",
>+                       set->index + 1);
>+              return -1;
>+      }
>+
>+      set->set[set->index].fd = s;
>+      set->set[set->index].revents = 0;
>+      set->set[set->index++].events = event;
>+      return 0;
>+}
>+
>+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum
>DAPL_FD_EVENTS event)
>+{
>+      struct pollfd fds;
>+      int ret;
>+
>+      fds.fd = s;
>+      fds.events = event;
>+      fds.revents = 0;
>+      ret = poll(&fds, 1, 0);
>+      if (ret <= 0)
>+              return ret;
>+
>+      return fds.revents;
>+}
>+
>+static int dapl_select(struct dapl_fd_set *set)
>+{
>+      return poll(set->set, set->index, -1);
>+}
>+#endif
>+
> static struct ib_cm_handle *dapli_cm_create(void)
> {
>       struct ib_cm_handle *cm_ptr;
>@@ -85,7 +228,7 @@ static struct ib_cm_handle *dapli_cm_create(void)
>
>       (void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr));
>       cm_ptr->dst.ver = htons(DSCM_VER);
>-      cm_ptr->socket = -1;
>+      cm_ptr->socket = DAPL_INVALID_SOCKET;
>       return cm_ptr;
> bail:
>       dapl_os_free(cm_ptr, sizeof(*cm_ptr));
>@@ -100,8 +243,8 @@ static void dapli_cm_destroy(struct
>ib_cm_handle *cm_ptr)
>
>       /* cleanup, never made it to work queue */
>       if (cm_ptr->state == SCM_INIT) {
>-              if (cm_ptr->socket >= 0)
>-                      close(cm_ptr->socket);
>+              if (cm_ptr->socket != DAPL_INVALID_SOCKET)
>+                      closesocket(cm_ptr->socket);
>               dapl_os_free(cm_ptr, sizeof(*cm_ptr));
>               return;
>       }
>@@ -112,9 +255,9 @@ static void dapli_cm_destroy(struct
>ib_cm_handle *cm_ptr)
>               cm_ptr->ep->cm_handle = IB_INVALID_HANDLE;
>
>       /* close socket if still active */
>-      if (cm_ptr->socket >= 0) {
>-              close(cm_ptr->socket);
>-              cm_ptr->socket = -1;
>+      if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
>+              closesocket(cm_ptr->socket);
>+              cm_ptr->socket = DAPL_INVALID_SOCKET;
>       }
>       dapl_os_unlock(&cm_ptr->lock);
>
>@@ -172,14 +315,14 @@
>dapli_socket_disconnect(dp_ib_cm_handle_t      cm_ptr)
>               return DAT_SUCCESS;
>       } else {
>               /* send disc date, close socket, schedule destroy */
>-              if (cm_ptr->socket >= 0) {
>-                      if (write(cm_ptr->socket,
>-                                &disc_data, sizeof(disc_data)) == -1)
>+              if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
>+                      if (send(cm_ptr->socket, (char *) &disc_data,
>+                                      sizeof(disc_data), 0) == -1)
>                               dapl_log(DAPL_DBG_TYPE_WARN,
>                                        " cm_disc: write error
>= %s\n",
>                                        strerror(errno));
>-                      close(cm_ptr->socket);
>-                      cm_ptr->socket = -1;
>+                      closesocket(cm_ptr->socket);
>+                      cm_ptr->socket = DAPL_INVALID_SOCKET;
>               }
>               cm_ptr->state = SCM_DISCONNECTED;
>       }
>@@ -211,7 +354,7 @@ void
> dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err)
> {
>       int             len, opt = 1;
>-      struct iovec    iovec[2];
>+      struct iovec iov[2];
>       struct dapl_ep  *ep_ptr = cm_ptr->ep;
>
>       if (err) {
>@@ -226,18 +369,21 @@ dapli_socket_connected(dp_ib_cm_handle_t
>cm_ptr, int err)
>                    " socket connected, write QP and private data\n");
>
>       /* no delay for small packets */
>-
>setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt));
>+      setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY,
>+              (char *) &opt, sizeof(opt));
>
>       /* send qp info and pdata to remote peer */
>-      iovec[0].iov_base = &cm_ptr->dst;
>-      iovec[0].iov_len  = sizeof(ib_qp_cm_t);
>+      iov[0].iov_base = (void *) &cm_ptr->dst;
>+      iov[0].iov_len = sizeof(ib_qp_cm_t);
>       if (cm_ptr->dst.p_size) {
>-              iovec[1].iov_base = cm_ptr->p_data;
>-              iovec[1].iov_len  = ntohl(cm_ptr->dst.p_size);
>+              iov[1].iov_base = cm_ptr->p_data;
>+              iov[1].iov_len = ntohl(cm_ptr->dst.p_size);
>+              len = writev(cm_ptr->socket, iov, 2);
>+      } else {
>+              len = writev(cm_ptr->socket, iov, 1);
>       }
>
>-      len = writev(cm_ptr->socket, iovec, (cm_ptr->dst.p_size ? 2:1));
>-      if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) {
>+      if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n",
>                        strerror(errno), len,
>@@ -253,9 +399,9 @@ dapli_socket_connected(dp_ib_cm_handle_t
>cm_ptr, int err)
>         dapl_dbg_log(DAPL_DBG_TYPE_CM,
>                      " connected: sending SRC GID subnet
>%016llx id %016llx\n",
>                      (unsigned long long)
>-
>cpu_to_be64(cm_ptr->dst.gid.global.subnet_prefix),
>+                      htonll(cm_ptr->dst.gid.global.subnet_prefix),
>                      (unsigned long long)
>-
>cpu_to_be64(cm_ptr->dst.gid.global.interface_id));
>+                      htonll(cm_ptr->dst.gid.global.interface_id));
>
>       /* queue up to work thread to avoid blocking consumer */
>       cm_ptr->state = SCM_RTU_PENDING;
>@@ -290,25 +436,23 @@ dapli_socket_connect(DAPL_EP             *ep_ptr,
>               return DAT_INSUFFICIENT_RESOURCES;
>
>       /* create, connect, sockopt, and exchange QP information */
>-      if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) {
>+      if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) ==
>DAPL_INVALID_SOCKET) {
>               dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
>               return DAT_INSUFFICIENT_RESOURCES;
>       }
>
>-      /* non-blocking */
>-      ret = fcntl(cm_ptr->socket, F_GETFL);
>-        if (ret < 0 || fcntl(cm_ptr->socket,
>-                              F_SETFL, ret | O_NONBLOCK) < 0) {
>-                dapl_log(DAPL_DBG_TYPE_ERR,
>-                         " socket connect: fcntl on socket %d
>ERR %d %s\n",
>-                         cm_ptr->socket, ret,
>-                         strerror(errno));
>-                goto bail;
>-        }
>+      ret = dapl_config_socket(cm_ptr->socket);
>+      if (ret < 0) {
>+              dapl_log(DAPL_DBG_TYPE_ERR,
>+                      " socket connect: config socket %d ERR %d %s\n",
>+                      cm_ptr->socket, ret, strerror(errno));
>+              goto bail;
>+      }
>
>       ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual);
>-      ret = connect(cm_ptr->socket, r_addr, sizeof(*r_addr));
>-      if (ret && errno != EINPROGRESS) {
>+      ret = dapl_connect_socket(cm_ptr->socket, (struct
>sockaddr *) r_addr,
>+                              sizeof(*r_addr));
>+      if (ret && ret != EAGAIN) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " socket connect ERROR: %s -> %s r_qual %d\n",
>                        strerror(errno),
>@@ -391,16 +535,13 @@
>dapli_socket_connect_rtu(dp_ib_cm_handle_t     cm_ptr)
> {
>       DAPL_EP         *ep_ptr = cm_ptr->ep;
>       int             len;
>-      struct iovec    iovec[2];
>       short           rtu_data = htons(0x0E0F);
>       ib_cm_events_t  event = IB_CME_DESTINATION_REJECT;
>
>       /* read DST information into cm_ptr, overwrite SRC info */
>       dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer
>QP data\n");
>
>-      iovec[0].iov_base = &cm_ptr->dst;
>-      iovec[0].iov_len  = sizeof(ib_qp_cm_t);
>-      len = readv(cm_ptr->socket, iovec, 1);
>+      len = recv(cm_ptr->socket, (char *) &cm_ptr->dst,
>sizeof(ib_qp_cm_t), 0);
>       if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver)
>!= DSCM_VER) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                    " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n",
>@@ -456,9 +597,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
>       /* read private data into cm_handle if any present */
>       dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read
>private data\n");
>       if (cm_ptr->dst.p_size) {
>-              iovec[0].iov_base = cm_ptr->p_data;
>-              iovec[0].iov_len  = cm_ptr->dst.p_size;
>-              len = readv(cm_ptr->socket, iovec, 1);
>+              len = recv(cm_ptr->socket, cm_ptr->p_data,
>cm_ptr->dst.p_size, 0);
>               if (len != cm_ptr->dst.p_size) {
>                       dapl_log(DAPL_DBG_TYPE_ERR,
>                           " CONN_RTU read pdata: ERR %s,
>rcnt=%d -> %s\n",
>@@ -495,7 +634,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr)
>       dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n");
>
>       /* complete handshake after final QP state change */
>-      if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) {
>+      if (send(cm_ptr->socket, (char *) &rtu_data,
>sizeof(rtu_data), 0) == -1) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " CONN_RTU: write error = %s\n",
>strerror(errno));
>               goto bail;
>@@ -564,7 +703,7 @@ dapli_socket_listen(DAPL_IA                *ia_ptr,
>       cm_ptr->hca = ia_ptr->hca_ptr;
>
>       /* bind, listen, set sockopt, accept, exchange data */
>-      if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
>+      if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0))
>== DAPL_INVALID_SOCKET) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " ERR: listen socket create: %s\n",
>                        strerror(errno));
>@@ -572,7 +711,8 @@ dapli_socket_listen(DAPL_IA                *ia_ptr,
>               goto bail;
>       }
>
>-
>setsockopt(cm_ptr->socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt));
>+      setsockopt(cm_ptr->socket, SOL_SOCKET, SO_REUSEADDR,
>+              (char *) &opt, sizeof(opt));
>       addr.sin_port        = htons(serviceID);
>       addr.sin_family      = AF_INET;
>       addr.sin_addr.s_addr = INADDR_ANY;
>@@ -625,7 +765,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
>
>       (void) dapl_os_memzero(acm_ptr, sizeof(*acm_ptr));
>
>-      acm_ptr->socket = -1;
>+      acm_ptr->socket = DAPL_INVALID_SOCKET;
>       acm_ptr->sp = cm_ptr->sp;
>       acm_ptr->hca = cm_ptr->hca;
>
>@@ -633,7 +773,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
>       acm_ptr->socket = accept(cm_ptr->socket,
>                               (struct
>sockaddr*)&acm_ptr->dst.ia_address,
>                               (socklen_t*)&len);
>-      if (acm_ptr->socket < 0) {
>+      if (acm_ptr->socket == DAPL_INVALID_SOCKET) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                       " accept: ERR %s on FD %d l_cr %p\n",
>                       strerror(errno),cm_ptr->socket,cm_ptr);
>@@ -664,7 +804,7 @@
>dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
>       dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read
>QP data\n");
>
>       /* read in DST QP info, IA address. check for private data */
>-      len = read(acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t));
>+      len = recv(acm_ptr->socket, (char *) &acm_ptr->dst,
>sizeof(ib_qp_cm_t), 0);
>       if (len != sizeof(ib_qp_cm_t) ||
>           ntohs(acm_ptr->dst.ver) != DSCM_VER) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>@@ -700,8 +840,7 @@
>dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr)
>
>       /* read private data into cm_handle if any present */
>       if (acm_ptr->dst.p_size) {
>-              len = read( acm_ptr->socket,
>-                          acm_ptr->p_data, acm_ptr->dst.p_size);
>+              len = recv(acm_ptr->socket, acm_ptr->p_data,
>acm_ptr->dst.p_size, 0);
>               if (len != acm_ptr->dst.p_size) {
>                       dapl_log(DAPL_DBG_TYPE_ERR,
>                                    " accept read pdata: ERR
>%s, rcnt=%d\n",
>@@ -757,14 +896,14 @@ dapli_socket_accept_usr(DAPL_EP          *ep_ptr,
>       DAPL_IA         *ia_ptr = ep_ptr->header.owner_ia;
>       dp_ib_cm_handle_t  cm_ptr = cr_ptr->ib_cm_handle;
>       ib_qp_cm_t      local;
>-      struct iovec    iovec[2];
>+      struct iovec    iov[2];
>       int             len;
>
>       if (p_size > IB_MAX_REP_PDATA_SIZE)
>               return DAT_LENGTH_ERROR;
>
>       /* must have a accepted socket */
>-      if (cm_ptr->socket < 0)
>+      if (cm_ptr->socket == DAPL_INVALID_SOCKET)
>               return DAT_INTERNAL_ERROR;
>
>       dapl_dbg_log(DAPL_DBG_TYPE_EP,
>@@ -844,14 +983,17 @@ dapli_socket_accept_usr(DAPL_EP          *ep_ptr,
>
>       local.ia_address = ia_ptr->hca_ptr->hca_address;
>       local.p_size = htonl(p_size);
>-      iovec[0].iov_base = &local;
>-      iovec[0].iov_len  = sizeof(ib_qp_cm_t);
>+      iov[0].iov_base = (void *) &local;
>+      iov[0].iov_len = sizeof(ib_qp_cm_t);
>       if (p_size) {
>-              iovec[1].iov_base = p_data;
>-              iovec[1].iov_len  = p_size;
>+              iov[1].iov_base = p_data;
>+              iov[1].iov_len = p_size;
>+              len = writev(cm_ptr->socket, iov, 2);
>+      } else {
>+              len = writev(cm_ptr->socket, iov, 1);
>       }
>-      len = writev(cm_ptr->socket, iovec, (p_size ? 2:1));
>-      if (len != (p_size + sizeof(ib_qp_cm_t))) {
>+
>+      if (len != (p_size + sizeof(ib_qp_cm_t))) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n",
>                        strerror(errno), len,
>@@ -859,6 +1001,7 @@ dapli_socket_accept_usr(DAPL_EP           *ep_ptr,
>                            &cm_ptr->dst.ia_address)->sin_addr));
>               goto bail;
>       }
>+
>       dapl_dbg_log(DAPL_DBG_TYPE_CM,
>                    " ACCEPT_USR: local port=0x%x lid=0x%x"
>                    " qpn=0x%x psize=%d\n",
>@@ -867,9 +1010,9 @@ dapli_socket_accept_usr(DAPL_EP           *ep_ptr,
>         dapl_dbg_log(DAPL_DBG_TYPE_CM,
>                      " ACCEPT_USR SRC GID subnet %016llx id
>%016llx\n",
>                      (unsigned long long)
>-                      cpu_to_be64(local.gid.global.subnet_prefix),
>+                      htonll(local.gid.global.subnet_prefix),
>                      (unsigned long long)
>-                      cpu_to_be64(local.gid.global.interface_id));
>+                      htonll(local.gid.global.interface_id));
>
>       /* save state and reference to EP, queue for RTU data */
>       cm_ptr->ep = ep_ptr;
>@@ -894,7 +1037,7 @@ dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr)
>       short           rtu_data = 0;
>
>       /* complete handshake after final QP state change */
>-      len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data));
>+      len = recv(cm_ptr->socket, (char *) &rtu_data,
>sizeof(rtu_data), 0);
>       if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) {
>               dapl_log(DAPL_DBG_TYPE_ERR,
>                        " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n",
>@@ -1108,9 +1251,9 @@ dapls_ib_remove_conn_listener (
>
>       /* close accepted socket, free cm_srvc_handle and return */
>       if (cm_ptr != NULL) {
>-              if (cm_ptr->socket >= 0) {
>-                      close(cm_ptr->socket );
>-                      cm_ptr->socket = -1;
>+              if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
>+                      closesocket(cm_ptr->socket);
>+                      cm_ptr->socket = DAPL_INVALID_SOCKET;
>               }
>               /* cr_thread will free */
>               cm_ptr->state = SCM_DESTROY;
>@@ -1195,27 +1338,29 @@ dapls_ib_reject_connection(
>       IN DAT_COUNT psize,
>       IN const DAT_PVOID pdata)
> {
>-      struct iovec iovec[2];
>+      struct iovec iov[2];
>
>       dapl_dbg_log (DAPL_DBG_TYPE_EP,
>                     " reject(cm %p reason %x, pdata %p, psize %d)\n",
>                     cm_ptr, reason, pdata, psize);
>
>       /* write reject data to indicate reject */
>-      if (cm_ptr->socket >= 0) {
>+      if (cm_ptr->socket != DAPL_INVALID_SOCKET) {
>               cm_ptr->dst.rej = (uint16_t)reason;
>               cm_ptr->dst.rej = htons(cm_ptr->dst.rej);
>-              iovec[0].iov_base = &cm_ptr->dst;
>-              iovec[0].iov_len  = sizeof(ib_qp_cm_t);
>+
>+              iov[0].iov_base = (void *) &cm_ptr->dst;
>+              iov[0].iov_len = sizeof(ib_qp_cm_t);
>               if (psize) {
>-                      iovec[1].iov_base = pdata;
>-                      iovec[2].iov_len = psize;
>-                      writev(cm_ptr->socket, &iovec[0], 2);
>-              } else
>-                      writev(cm_ptr->socket, &iovec[0], 1);
>-
>-              close(cm_ptr->socket);
>-              cm_ptr->socket = -1;
>+                      iov[1].iov_base = pdata;
>+                      iov[1].iov_len = psize;
>+                      writev(cm_ptr->socket, iov, 2);
>+              } else {
>+                      writev(cm_ptr->socket, iov, 1);
>+              }
>+
>+              closesocket(cm_ptr->socket);
>+              cm_ptr->socket = DAPL_INVALID_SOCKET;
>       }
>
>       /* cr_thread will destroy CR */
>@@ -1444,138 +1589,141 @@ dapls_ib_get_cm_event (
> }
>
> /* outbound/inbound CR processing thread to avoid blocking
>applications */
>-#define SCM_MAX_CONN 8192
> void cr_thread(void *arg)
> {
>-    struct dapl_hca   *hca_ptr = arg;
>-    dp_ib_cm_handle_t cr, next_cr;
>-    int               opt,ret,idx;
>-    socklen_t         opt_len;
>-    char              rbuf[2];
>-    struct pollfd     ufds[SCM_MAX_CONN];
>-
>-    dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca
>%p\n",hca_ptr);
>-
>-    dapl_os_lock( &hca_ptr->ib_trans.lock );
>-    hca_ptr->ib_trans.cr_state = IB_THREAD_RUN;
>-    while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) {
>-      idx=0;
>-      ufds[idx].fd = g_scm_pipe[0]; /* wakeup and process work */
>-        ufds[idx].events = POLLIN;
>-      ufds[idx].revents = 0;
>-
>-      if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
>-            next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list);
>-      else
>-          next_cr = NULL;
>-
>-      while (next_cr) {
>-          cr = next_cr;
>-          if ((cr->socket == -1 && cr->state == SCM_DESTROY) ||
>-              hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
>-
>-              dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: Free
>%p\n", cr);
>-              next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list,
>-
>(DAPL_LLIST_ENTRY*)&cr->entry );
>-              dapl_llist_remove_entry(&hca_ptr->ib_trans.list,
>-                                      (DAPL_LLIST_ENTRY*)&cr->entry);
>-              dapl_os_free(cr, sizeof(*cr));
>-              continue;
>-          }
>-
>-          if (idx==SCM_MAX_CONN-1) {
>-              dapl_dbg_log(DAPL_DBG_TYPE_ERR,
>-                           "SCM ERR: cm_thread exceeded
>FD_SETSIZE %d\n",idx+1);
>-              continue;
>-          }
>-
>-          /* Add to ufds for poll, check for immediate work */
>-          ufds[++idx].fd = cr->socket; /* add listen or cr */
>-          ufds[idx].revents = 0;
>-          if (cr->state == SCM_CONN_PENDING)
>-              ufds[idx].events = POLLOUT;
>-          else
>-              ufds[idx].events = POLLIN;
>-
>-          /* check socket for event, accept in or connect out */
>-          dapl_dbg_log(DAPL_DBG_TYPE_CM," poll cr=%p, fd=%d,%d\n",
>-                              cr, cr->socket, ufds[idx].fd);
>-          dapl_os_unlock(&hca_ptr->ib_trans.lock);
>-          ret = poll(&ufds[idx],1,0);
>-          dapl_dbg_log(DAPL_DBG_TYPE_CM,
>-                       " poll wakeup ret=%d cr->st=%d"
>-                       " ev=0x%x fd=%d\n",
>-                       ret,cr->state,ufds[idx].revents,ufds[idx].fd);
>-
>-          /* data on listen, qp exchange, and on disconnect request */
>-          if ((ret == 1) && ufds[idx].revents == POLLIN) {
>-              if (cr->socket > 0) {
>-                      if (cr->state == SCM_LISTEN)
>-                              dapli_socket_accept(cr);
>-                      else if (cr->state == SCM_ACCEPTING)
>-                              dapli_socket_accept_data(cr);
>-                      else if (cr->state == SCM_ACCEPTED)
>-                              dapli_socket_accept_rtu(cr);
>-                      else if (cr->state == SCM_RTU_PENDING)
>-                              dapli_socket_connect_rtu(cr);
>-                      else if (cr->state == SCM_CONNECTED)
>-                              dapli_socket_disconnect(cr);
>+      struct dapl_hca *hca_ptr = arg;
>+      dp_ib_cm_handle_t cr, next_cr;
>+      int opt, ret;
>+      socklen_t opt_len;
>+      char rbuf[2];
>+      struct dapl_fd_set *set;
>+      enum DAPL_FD_EVENTS event;
>+
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca
>%p\n", hca_ptr);
>+      set = dapl_alloc_fd_set();
>+      if (!set)
>+              goto out;
>+
>+      dapl_os_lock(&hca_ptr->ib_trans.lock);
>+      hca_ptr->ib_trans.cr_state = IB_THREAD_RUN;
>+
>+      while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) {
>+              dapl_fd_zero(set);
>+              dapl_fd_set(g_scm_pipe[0], set, DAPL_FD_READ);
>+
>+              if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
>+                      next_cr =
>dapl_llist_peek_head(&hca_ptr->ib_trans.list);
>+              else
>+                      next_cr = NULL;
>+
>+              while (next_cr) {
>+                      cr = next_cr;
>+                      if ((cr->socket == DAPL_INVALID_SOCKET
>&& cr->state == SCM_DESTROY) ||
>+                              hca_ptr->ib_trans.cr_state !=
>IB_THREAD_RUN) {
>+                              next_cr =
>dapl_llist_next_entry(&hca_ptr->ib_trans.list,
>+
>(DAPL_LLIST_ENTRY*)&cr->entry);
>+
>dapl_llist_remove_entry(&hca_ptr->ib_trans.list,
>+
>(DAPL_LLIST_ENTRY*)&cr->entry);
>+                              dapl_os_free(cr, sizeof(*cr));
>+                              continue;
>+                      }
>+
>+                      event = (cr->state == SCM_CONN_PENDING) ?
>+                              DAPL_FD_WRITE : DAPL_FD_READ;
>+                      if (dapl_fd_set(cr->socket, set, event)) {
>+                              dapl_log(DAPL_DBG_TYPE_ERR,
>+                                       " cr_thread: DESTROY
>CR st=%d fd %d"
>+                                       " -> %s\n", cr->state,
>cr->socket,
>+                                       inet_ntoa(((struct
>sockaddr_in*)
>+
>&cr->dst.ia_address)->sin_addr));
>+                              dapli_cm_destroy(cr);
>+                              continue;
>+                      }
>+
>+                      dapl_dbg_log(DAPL_DBG_TYPE_CM, " poll
>cr=%p, fd=%d\n",
>+                              cr, cr->socket);
>+                      dapl_os_unlock(&hca_ptr->ib_trans.lock);
>+
>+                      ret = dapl_poll(cr->socket, event);
>+
>+                      dapl_dbg_log(DAPL_DBG_TYPE_CM,
>+                              " poll wakeup ret=%d cr->st=%d fd=%d\n",
>+                              ret, cr->state, cr->socket);
>+
>+                      /* data on listen, qp exchange, and on
>disconnect request */
>+                      if (ret == DAPL_FD_READ) {
>+                              if (cr->socket != DAPL_INVALID_SOCKET) {
>+                                      switch (cr->state) {
>+                                      case SCM_LISTEN:
>+                                              dapli_socket_accept(cr);
>+                                              break;
>+                                      case SCM_ACCEPTING:
>+
>dapli_socket_accept_data(cr);
>+                                              break;
>+                                      case SCM_ACCEPTED:
>+
>dapli_socket_accept_rtu(cr);
>+                                              break;
>+                                      case SCM_RTU_PENDING:
>+
>dapli_socket_connect_rtu(cr);
>+                                              break;
>+                                      case SCM_CONNECTED:
>+
>dapli_socket_disconnect(cr);
>+                                              break;
>+                                      default:
>+                                              break;
>+                                      }
>+                              }
>+                      /* connect socket is writable, check status */
>+                      } else if (ret == DAPL_FD_WRITE || ret
>== DAPL_FD_ERROR) {
>+                              if (cr->state == SCM_CONN_PENDING) {
>+                                      opt = 0;
>+                                      ret =
>getsockopt(cr->socket, SOL_SOCKET,
>+                                              SO_ERROR, (char
>*) &opt, &opt_len);
>+                                      if (!ret)
>+
>dapli_socket_connected(cr, opt);
>+                                      else
>+
>dapli_socket_connected(cr, errno);
>+                              } else {
>+                                      dapl_log(DAPL_DBG_TYPE_CM,
>+                                              " CM poll ERR,
>wrong state(%d) -> %s SKIP\n", cr->state,
>+
>inet_ntoa(((struct sockaddr_in*)&cr->dst.ia_address)->sin_addr));
>+                              }
>+                      } else if (ret != 0) {
>+                              dapl_log(DAPL_DBG_TYPE_CM,
>+                                      " CM poll warning %s,
>ret=%d st=%d -> %s\n",
>+                                      strerror(errno), ret, cr->state,
>+                                      inet_ntoa(((struct sockaddr_in*)
>+
>&cr->dst.ia_address)->sin_addr));
>+
>+                              /* POLLUP, NVAL, or poll error,
>issue event if connected */
>+                              if (cr->state == SCM_CONNECTED)
>+                                      dapli_socket_disconnect(cr);
>+                      }
>+
>+                      dapl_os_lock(&hca_ptr->ib_trans.lock);
>+                      next_cr =
>dapl_llist_next_entry(&hca_ptr->ib_trans.list,
>+                              (DAPL_LLIST_ENTRY*)&cr->entry);
>               }
>-          /* connect socket is writable, check status */
>-          } else if ((ret == 1) &&
>-                      (ufds[idx].revents & POLLOUT ||
>-                       ufds[idx].revents & POLLERR)) {
>-              if (cr->state == SCM_CONN_PENDING) {
>-                      opt = 0;
>-                      ret = getsockopt(cr->socket, SOL_SOCKET,
>-                                       SO_ERROR, &opt, &opt_len);
>-                      if (!ret)
>-                              dapli_socket_connected(cr,opt);
>-                      else
>-                              dapli_socket_connected(cr,errno);
>-              } else {
>-                      dapl_log(DAPL_DBG_TYPE_CM,
>-                               " CM poll ERR, wrong state(%d)
>-> %s SKIP\n",
>-                               cr->state,
>-                               inet_ntoa(((struct sockaddr_in*)
>-
>&cr->dst.ia_address)->sin_addr));
>+
>+              dapl_os_unlock(&hca_ptr->ib_trans.lock);
>+              dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread:
>sleep, fds=%d\n",
>+                           set->index+1);
>+              dapl_select(set);
>+              dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n");
>+
>+              /* if pipe used to wakeup, consume */
>+              if (dapl_poll(g_scm_pipe[0], DAPL_FD_READ) ==
>DAPL_FD_READ) {
>+                      if (read(g_scm_pipe[0], rbuf, 2) == -1)
>+                              dapl_log(DAPL_DBG_TYPE_CM,
>+                                       " cr_thread: read pipe
>error = %s\n",
>+                                       strerror(errno));
>               }
>-          } else if (ret != 0) {
>-              dapl_log(DAPL_DBG_TYPE_CM,
>-                       " CM poll warning %s, ret=%d revnt=%x
>st=%d -> %s\n",
>-                       strerror(errno), ret,
>ufds[idx].revents, cr->state,
>-                       inet_ntoa(((struct sockaddr_in*)
>-                              &cr->dst.ia_address)->sin_addr));
>-
>-              /* POLLUP, NVAL, or poll error, issue event if
>connected */
>-              if (cr->state == SCM_CONNECTED)
>-                      dapli_socket_disconnect(cr);
>-          }
>-          dapl_os_lock(&hca_ptr->ib_trans.lock);
>-          next_cr =  dapl_llist_next_entry(&hca_ptr->ib_trans.list,
>-
>(DAPL_LLIST_ENTRY*)&cr->entry);
>+              dapl_os_lock(&hca_ptr->ib_trans.lock);
>       }
>+
>       dapl_os_unlock(&hca_ptr->ib_trans.lock);
>-      dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, %d\n", idx+1);
>-      poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */
>-      /* if pipe used to wakeup, consume */
>-      if (ufds[0].revents == POLLIN)
>-              if (read(g_scm_pipe[0], rbuf, 2) == -1)
>-                      dapl_log(DAPL_DBG_TYPE_CM,
>-                               " cr_thread: read pipe error = %s\n",
>-                               strerror(errno));
>-      dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n");
>-      dapl_os_lock(&hca_ptr->ib_trans.lock);
>-    }
>-    dapl_os_unlock(&hca_ptr->ib_trans.lock);
>-    hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT;
>-    dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p)
>exit\n",hca_ptr);
>+      free(set);
>+out:
>+      hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT;
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p)
>exit\n",hca_ptr);
> }
>-
>-/*
>- * Local variables:
>- *  c-indent-level: 4
>- *  c-basic-offset: 4
>- *  tab-width: 8
>- * End:
>- */
>diff --git a/dapl/openib_scm/dapl_ib_cq.c
>b/dapl/openib_scm/dapl_ib_cq.c
>index 7d6bd4f..59fff11 100644
>--- a/dapl/openib_scm/dapl_ib_cq.c
>+++ b/dapl/openib_scm/dapl_ib_cq.c
>@@ -46,97 +46,111 @@
>  *
>
>***************************************************************
>***********/
>
>+#include "openib_osd.h"
> #include "dapl.h"
> #include "dapl_adapter_util.h"
> #include "dapl_lmr_util.h"
> #include "dapl_evd_util.h"
> #include "dapl_ring_buffer_util.h"
>-#include <sys/poll.h>
>-#include <signal.h>
>
>-int dapli_cq_thread_init(struct dapl_hca *hca_ptr)
>+#if defined(_WIN64) || defined(_WIN32)
>+void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr)
> {
>-        DAT_RETURN dat_status;
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_destroy(%p)\n", hca_ptr);
>
>-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_init(%p)\n", hca_ptr);
>+      if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN)
>+              return;
>
>-        /* create thread to process inbound connect request */
>-      hca_ptr->ib_trans.cq_state = IB_THREAD_INIT;
>-        dat_status = dapl_os_thread_create(cq_thread,
>(void*)hca_ptr, &hca_ptr->ib_trans.cq_thread);
>-        if (dat_status != DAT_SUCCESS)
>-        {
>-                dapl_dbg_log(DAPL_DBG_TYPE_ERR,
>-                             " cq_thread_init: failed to
>create thread\n");
>-                return 1;
>-        }
>+      /* destroy cr_thread and lock */
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
>+      SetEvent(hca_ptr->ib_trans.ib_cq->event);
>+      dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p)
>cancel\n",hca_ptr);
>+      while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
>+              dapl_os_sleep_usec(20000);
>+      }
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d)
>exit\n",dapl_os_getpid());
>+}
>+
>+static void cq_thread(void *arg)
>+{
>+      struct dapl_hca *hca_ptr = arg;
>+      struct dapl_evd *evd_ptr;
>+      struct ibv_cq   *ibv_cq = NULL;
>+
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_RUN;
>+
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca
>%p\n",hca_ptr);
>
>-      /* wait for thread to start */
>-      while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) {
>-                struct timespec sleep, remain;
>-                sleep.tv_sec = 0;
>-                sleep.tv_nsec = 20000000; /* 20 ms */
>-                dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>-                             " cq_thread_init: waiting for
>cq_thread\n");
>-                nanosleep (&sleep, &remain);
>-        }
>-      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d)
>exit\n",getpid());
>-        return 0;
>+      /* wait on DTO event, or signal to abort */
>+      while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
>+              if (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq,
>&ibv_cq, (void*)&evd_ptr)) {
>+
>+                      if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) {
>+                              ibv_ack_cq_events(ibv_cq, 1);
>+                              return;
>+                      }
>+
>+                      /* process DTO event via callback */
>+
>dapl_evd_dto_callback(hca_ptr->ib_hca_handle, evd_ptr->ib_cq_handle,
>+                              (void*)evd_ptr );
>+
>+                      ibv_ack_cq_events(ibv_cq, 1);
>+              }
>+      }
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca
>%p \n", hca_ptr);
> }
>
>+#else // _WIN32 || _WIN64
>+
> void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr)
> {
>-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_destroy(%p)\n", hca_ptr);
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_destroy(%p)\n", hca_ptr);
>
>       if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN)
>               return;
>
>-        /* destroy cr_thread and lock */
>-        hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
>-        pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1);
>-        dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p)
>cancel\n",hca_ptr);
>-        while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
>-                struct timespec sleep, remain;
>-                sleep.tv_sec = 0;
>-                sleep.tv_nsec = 2000000; /* 2 ms */
>-                dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>-                             " cq_thread_destroy: waiting for
>cq_thread\n");
>-                nanosleep (&sleep, &remain);
>-        }
>-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_destroy(%d) exit\n",getpid());
>+      /* destroy cr_thread and lock */
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL;
>+      pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1);
>+      dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p)
>cancel\n",hca_ptr);
>+      while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) {
>+              dapl_os_sleep_usec(20000);
>+      }
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d)
>exit\n",dapl_os_getpid());
> }
>
> /* catch the signal */
> static void ib_cq_handler(int signum)
> {
>-        return;
>+      return;
> }
>
>-void cq_thread( void *arg )
>+static void cq_thread(void *arg)
> {
>-        struct dapl_hca *hca_ptr = arg;
>-        struct dapl_evd *evd_ptr;
>-        struct ibv_cq   *ibv_cq = NULL;
>+      struct dapl_hca *hca_ptr = arg;
>+      struct dapl_evd *evd_ptr;
>+      struct ibv_cq   *ibv_cq = NULL;
>       sigset_t        sigset;
>
>       sigemptyset(&sigset);
>-        sigaddset(&sigset,SIGUSR1);
>-        pthread_sigmask(SIG_UNBLOCK, &sigset, NULL);
>-        signal(SIGUSR1, ib_cq_handler);
>+      sigaddset(&sigset,SIGUSR1);
>+      pthread_sigmask(SIG_UNBLOCK, &sigset, NULL);
>+      signal(SIGUSR1, ib_cq_handler);
>
>       hca_ptr->ib_trans.cq_state = IB_THREAD_RUN;
>-
>+
>       dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca
>%p\n",hca_ptr);
>
>-        /* wait on DTO event, or signal to abort */
>-        while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
>-                struct pollfd cq_fd = {
>-                        .fd      = hca_ptr->ib_trans.ib_cq->fd,
>-                        .events  = POLLIN,
>-                        .revents = 0
>-                };
>+      /* wait on DTO event, or signal to abort */
>+      while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) {
>+              struct pollfd cq_fd = {
>+                      .fd      = hca_ptr->ib_trans.ib_cq->fd,
>+                      .events  = POLLIN,
>+                      .revents = 0
>+              };
>               if ((poll(&cq_fd, 1, -1) == 1) &&
>-                      (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq,
>-                                 &ibv_cq, (void*)&evd_ptr))) {
>+
>(!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq,
>(void*)&evd_ptr))) {
>
>                       if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) {
>                               ibv_ack_cq_events(ibv_cq, 1);
>@@ -144,15 +158,40 @@ void cq_thread( void *arg )
>                       }
>
>                       /* process DTO event via callback */
>-                      dapl_evd_dto_callback ( hca_ptr->ib_hca_handle,
>-                                              evd_ptr->ib_cq_handle,
>-                                              (void*)evd_ptr );
>+                      dapl_evd_dto_callback(hca_ptr->ib_hca_handle,
>+                              evd_ptr->ib_cq_handle, (void*)evd_ptr );
>
>                       ibv_ack_cq_events(ibv_cq, 1);
>               }
>-        }
>-        hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
>-        dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT:
>hca %p \n", hca_ptr);
>+      }
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT;
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca
>%p \n", hca_ptr);
>+}
>+
>+#endif // _WIN32 || _WIN64
>+
>+
>+int dapli_cq_thread_init(struct dapl_hca *hca_ptr)
>+{
>+      DAT_RETURN dat_status;
>+
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL,"
>cq_thread_init(%p)\n", hca_ptr);
>+
>+      /* create thread to process inbound connect request */
>+      hca_ptr->ib_trans.cq_state = IB_THREAD_INIT;
>+      dat_status = dapl_os_thread_create(cq_thread,
>(void*)hca_ptr, &hca_ptr->ib_trans.cq_thread);
>+      if (dat_status != DAT_SUCCESS) {
>+              dapl_dbg_log(DAPL_DBG_TYPE_ERR,
>+                      " cq_thread_init: failed to create thread\n");
>+              return 1;
>+      }
>+
>+      /* wait for thread to start */
>+      while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) {
>+              dapl_os_sleep_usec(20000);
>+      }
>+      dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d)
>exit\n",dapl_os_getpid());
>+      return 0;
> }
>
>
>@@ -308,11 +347,11 @@ dapls_ib_cq_alloc (
>       IN  DAPL_EVD            *evd_ptr,
>       IN  DAT_COUNT           *cqlen )
> {
>+      struct ibv_comp_channel *channel =
>ia_ptr->hca_ptr->ib_trans.ib_cq;
>+
>       dapl_dbg_log ( DAPL_DBG_TYPE_UTIL,
>               "dapls_ib_cq_alloc: evd %p cqlen=%d \n",
>evd_ptr, *cqlen );
>
>-      struct ibv_comp_channel *channel =
>ia_ptr->hca_ptr->ib_trans.ib_cq;
>-
> #ifdef CQ_WAIT_OBJECT
>       if (evd_ptr->cq_wait_obj_handle)
>               channel = evd_ptr->cq_wait_obj_handle;
>diff --git a/dapl/openib_scm/dapl_ib_dto.h
>b/dapl/openib_scm/dapl_ib_dto.h
>index 45000b9..fa19d01 100644
>--- a/dapl/openib_scm/dapl_ib_dto.h
>+++ b/dapl/openib_scm/dapl_ib_dto.h
>@@ -147,12 +147,6 @@ dapls_ib_post_send (
>       IN  const DAT_RMR_TRIPLET       *remote_iov,
>       IN  DAT_COMPLETION_FLAGS        completion_flags)
> {
>-      dapl_dbg_log(DAPL_DBG_TYPE_EP,
>-                   " post_snd: ep %p op %d ck %p sgs",
>-                   "%d l_iov %p r_iov %p f %d\n",
>-                   ep_ptr, op_type, cookie, segments, local_iov,
>-                   remote_iov, completion_flags);
>-
>       ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES];
>       ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL;
>       struct ibv_send_wr wr;
>@@ -163,6 +157,12 @@ dapls_ib_post_send (
>       int ret;
>
>       dapl_dbg_log(DAPL_DBG_TYPE_EP,
>+                   " post_snd: ep %p op %d ck %p sgs",
>+                   "%d l_iov %p r_iov %p f %d\n",
>+                   ep_ptr, op_type, cookie, segments, local_iov,
>+                   remote_iov, completion_flags);
>+
>+      dapl_dbg_log(DAPL_DBG_TYPE_EP,
>                    " post_snd: ep %p cookie %p segs %d l_iov %p\n",
>                    ep_ptr, cookie, segments, local_iov);
>
>@@ -317,12 +317,6 @@ dapls_ib_post_ext_send (
>       IN  DAT_COMPLETION_FLAGS        completion_flags,
>       IN  DAT_IB_ADDR_HANDLE          *remote_ah)
> {
>-      dapl_dbg_log(DAPL_DBG_TYPE_EP,
>-                   " post_ext_snd: ep %p op %d ck %p sgs",
>-                   "%d l_iov %p r_iov %p f %d\n",
>-                   ep_ptr, op_type, cookie, segments, local_iov,
>-                   remote_iov, completion_flags, remote_ah);
>-
>       ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES];
>       ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL;
>       struct ibv_send_wr wr;
>@@ -331,6 +325,12 @@ dapls_ib_post_ext_send (
>       int ret;
>
>       dapl_dbg_log(DAPL_DBG_TYPE_EP,
>+                   " post_ext_snd: ep %p op %d ck %p sgs",
>+                   "%d l_iov %p r_iov %p f %d\n",
>+                   ep_ptr, op_type, cookie, segments, local_iov,
>+                   remote_iov, completion_flags, remote_ah);
>+
>+      dapl_dbg_log(DAPL_DBG_TYPE_EP,
>                    " post_snd: ep %p cookie %p segs %d l_iov %p\n",
>                    ep_ptr, cookie, segments, local_iov);
>
>diff --git a/dapl/openib_scm/dapl_ib_mem.c
>b/dapl/openib_scm/dapl_ib_mem.c
>index 54340ed..9a97e5e 100644
>--- a/dapl/openib_scm/dapl_ib_mem.c
>+++ b/dapl/openib_scm/dapl_ib_mem.c
>@@ -1,4 +1,4 @@
>-/*
>+      /*
>  * Copyright (c) 2005-2007 Intel Corporation.  All rights reserved.
>  *
>  * This Software is licensed under one of the following licenses:
>@@ -35,13 +35,6 @@
>  *
>
>**********************************************************************/
>
>-#include <sys/ioctl.h>  /* for IOCTL's */
>-#include <sys/types.h>  /* for socket(2) and related bits and
>pieces */
>-#include <sys/socket.h> /* for socket(2) */
>-#include <net/if.h>     /* for struct ifreq */
>-#include <net/if_arp.h> /* for ARPHRD_ETHER */
>-#include <unistd.h>           /* for _SC_CLK_TCK */
>-
> #include "dapl.h"
> #include "dapl_adapter_util.h"
> #include "dapl_lmr_util.h"
>@@ -215,10 +208,9 @@ dapls_ib_mr_register(IN  DAPL_IA *ia_ptr,
>       lmr->param.registered_address = (DAT_VADDR)(uintptr_t)virt_addr;
>
>       dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>-                   " mr_register: mr=%p addr=%p h %x pd %p ctx %p "
>+                   " mr_register: mr=%p addr=%p pd %p ctx %p "
>                    "lkey=0x%x rkey=0x%x priv=%x\n",
>                    lmr->mr_handle, lmr->mr_handle->addr,
>-                   lmr->mr_handle->handle,
>                    lmr->mr_handle->pd, lmr->mr_handle->context,
>                    lmr->mr_handle->lkey, lmr->mr_handle->rkey,
>                    length, dapls_convert_privileges(privileges));
>diff --git a/dapl/openib_scm/dapl_ib_util.c
>b/dapl/openib_scm/dapl_ib_util.c
>index 92b45d5..d82d3f5 100644
>--- a/dapl/openib_scm/dapl_ib_util.c
>+++ b/dapl/openib_scm/dapl_ib_util.c
>@@ -49,17 +49,13 @@
> static const char rcsid[] = "$Id:  $";
> #endif
>
>+#include "openib_osd.h"
> #include "dapl.h"
> #include "dapl_adapter_util.h"
> #include "dapl_ib_util.h"
>+#include "dapl_osd.h"
>
> #include <stdlib.h>
>-#include <netinet/tcp.h>
>-#include <sys/utsname.h>
>-#include <sys/socket.h>
>-#include <arpa/inet.h>
>-#include <unistd.h>
>-#include <fcntl.h>
>
> int g_dapl_loopback_connection = 0;
> int g_scm_pipe[2];
>@@ -88,52 +84,43 @@ char *dapl_ib_mtu_str(enum ibv_mtu mtu)
>       }
> }
>
>-/* just get IP address for hostname */
>-DAT_RETURN getipaddr( char *addr, int addr_len)
>+static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len)
> {
>-      struct sockaddr_in      *ipv4_addr = (struct sockaddr_in*)addr;
>-      struct hostent          *h_ptr;
>-      struct utsname          ourname;
>+      struct sockaddr_in *sin;
>+      struct addrinfo *res, hint, *ai;
>+      int ret;
>+      char hostname[256];
>
>-      if (uname(&ourname) < 0)  {
>-               dapl_log(DAPL_DBG_TYPE_ERR,
>-                        " open_hca: uname err=%s\n", strerror(errno));
>+      if (addr_len < sizeof(*sin)) {
>               return DAT_INTERNAL_ERROR;
>       }
>
>-      h_ptr = gethostbyname(ourname.nodename);
>-      if (h_ptr == NULL) {
>-               dapl_log(DAPL_DBG_TYPE_ERR,
>-                        " open_hca: gethostbyname err=%s\n",
>-                        strerror(errno));
>-              return DAT_INTERNAL_ERROR;
>+      ret = gethostname(hostname,256);
>+      if (ret)
>+              return ret;
>+
>+      memset(&hint, 0, sizeof hint);
>+      hint.ai_flags = AI_PASSIVE;
>+      hint.ai_family = AF_INET;
>+      hint.ai_socktype = SOCK_STREAM;
>+      hint.ai_protocol = IPPROTO_TCP;
>+
>+      ret = getaddrinfo(hostname, NULL, &hint, &res);
>+      if (ret)
>+              return ret;
>+
>+      ret = DAT_INVALID_ADDRESS;
>+      for (ai = res; ai; ai = ai->ai_next) {
>+              sin = (struct sockaddr_in *) ai->ai_addr;
>+              if (*((uint32_t *) &sin->sin_addr) !=
>htonl(0x7f000001)) {
>+                      *((struct sockaddr_in *) addr) = *sin;
>+                      ret = DAT_SUCCESS;
>+                      break;
>+              }
>       }
>
>-      if (h_ptr->h_addrtype == AF_INET) {
>-              int i;
>-              struct in_addr  **alist =
>-                      (struct in_addr **)h_ptr->h_addr_list;
>-
>-              *(uint32_t*)&ipv4_addr->sin_addr = 0;
>-              ipv4_addr->sin_family = AF_INET;
>-
>-              /* Walk the list of addresses for host */
>-              for (i=0; alist[i] != NULL; i++) {
>-                     /* first non-loopback address */
>-                     if (*(uint32_t*)alist[i] != htonl(0x7f000001)) {
>-                               dapl_os_memcpy(&ipv4_addr->sin_addr,
>-                                              h_ptr->h_addr_list[i],
>-                                              4);
>-                               break;
>-                       }
>-               }
>-               /* if no acceptable address found */
>-               if (*(uint32_t*)&ipv4_addr->sin_addr == 0)
>-                      return DAT_INVALID_ADDRESS;
>-      } else
>-              return DAT_INVALID_ADDRESS;
>-
>-      return DAT_SUCCESS;
>+      freeaddrinfo(res);
>+      return ret;
> }
>
> /*
>@@ -165,6 +152,28 @@ int32_t dapls_ib_release (void)
>       return 0;
> }
>
>+#if defined(_WIN64) || defined(_WIN32)
>+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
>+{
>+      return 0;
>+}
>+#else // _WIN64 || WIN32
>+int dapls_config_comp_channel(struct ibv_comp_channel *channel)
>+{
>+      int opts;
>+
>+      opts = fcntl(channel->fd, F_GETFL); /* uCQ */
>+      if (opts < 0 || fcntl(channel->fd, F_SETFL, opts |
>O_NONBLOCK) < 0) {
>+              dapl_log(DAPL_DBG_TYPE_ERR,
>+                       " dapls_create_comp_channel: fcntl on
>ib_cq->fd %d ERR %d %s\n",
>+                       channel->fd, opts, strerror(errno));
>+              return errno;
>+      }
>+
>+      return 0;
>+}
>+#endif
>+
> /*
>  * dapls_ib_open_hca
>  *
>@@ -187,7 +196,6 @@ DAT_RETURN dapls_ib_open_hca (
>         IN   DAPL_HCA         *hca_ptr)
> {
>       struct ibv_device **dev_list;
>-      int             opts;
>       int             i;
>       DAT_RETURN      dat_status = DAT_SUCCESS;
>
>@@ -219,7 +227,7 @@ found:
>       dapl_dbg_log(DAPL_DBG_TYPE_UTIL," open_hca: Found dev
>%s %016llx\n",
>                    ibv_get_device_name(hca_ptr->ib_trans.ib_dev),
>                    (unsigned long long)
>-
>bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev)));
>+
>ntohll(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev)));
>
>       hca_ptr->ib_hca_handle =
>ibv_open_device(hca_ptr->ib_trans.ib_dev);
>       if (!hca_ptr->ib_hca_handle) {
>@@ -268,13 +276,7 @@ found:
>               goto bail;
>       }
>
>-      opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */
>-      if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd,
>-                            F_SETFL, opts | O_NONBLOCK) < 0) {
>-              dapl_log(DAPL_DBG_TYPE_ERR,
>-                       " open_hca: fcntl on ib_cq->fd %d ERR
>%d %s\n",
>-                       hca_ptr->ib_trans.ib_cq->fd, opts,
>-                       strerror(errno));
>+      if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) {
>               goto bail;
>       }
>
>@@ -309,16 +311,11 @@ found:
>
>       /* wait for thread */
>       while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) {
>-              struct timespec sleep, remain;
>-              sleep.tv_sec = 0;
>-              sleep.tv_nsec = 2000000; /* 2 ms */
>-              dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>-                           " open_hca: waiting for cr_thread\n");
>-              nanosleep (&sleep, &remain);
>+              dapl_os_sleep_usec(20000);
>       }
>
>       /* get the IP address of the device */
>-      dat_status = getipaddr((char*)&hca_ptr->hca_address,
>+      dat_status = getlocalipaddr((DAT_SOCK_ADDR*)
>&hca_ptr->hca_address,
>                               sizeof(DAT_SOCK_ADDR6));
>
>       dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>@@ -376,16 +373,13 @@ DAT_RETURN dapls_ib_close_hca (  IN
>DAPL_HCA       *hca_ptr )
>                        " thread_destroy: thread wakeup err = %s\n",
>                        strerror(errno));
>       while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) {
>-              struct timespec sleep, remain;
>-              sleep.tv_sec = 0;
>-              sleep.tv_nsec = 2000000; /* 2 ms */
>               dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
>                            " close_hca: waiting for cr_thread\n");
>               if (write(g_scm_pipe[1], "w", sizeof "w") == -1)
>                       dapl_log(DAPL_DBG_TYPE_UTIL,
>                                " thread_destroy: thread
>wakeup err = %s\n",
>                                strerror(errno));
>-              nanosleep (&sleep, &remain);
>+              dapl_os_sleep_usec(20000);
>       }
>       dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
>
>diff --git a/dapl/openib_scm/dapl_ib_util.h
>b/dapl/openib_scm/dapl_ib_util.h
>index 863da2b..fd1c24e 100644
>--- a/dapl/openib_scm/dapl_ib_util.h
>+++ b/dapl/openib_scm/dapl_ib_util.h
>@@ -49,8 +49,8 @@
> #ifndef _DAPL_IB_UTIL_H_
> #define _DAPL_IB_UTIL_H_
>
>+#include "openib_osd.h"
> #include <infiniband/verbs.h>
>-#include <byteswap.h>
>
> #ifdef DAT_EXTENSIONS
> #include <dat2/dat_ib_extensions.h>
>@@ -73,8 +73,6 @@ typedef      struct ibv_wc
>ib_work_completion_t;
> typedef       struct ibv_context      *ib_hca_handle_t;
> typedef ib_hca_handle_t               dapl_ibal_ca_t;
>
>-/* CM mappings, user CM not complete use SOCKETS */
>-
> /* destination info to exchange, define wire protocol version */
> #define DSCM_VER 3
> typedef struct _ib_qp_cm
>@@ -86,7 +84,7 @@ typedef struct _ib_qp_cm
>       uint32_t                qpn;
>       uint32_t                p_size;
>       DAT_SOCK_ADDR6          ia_address;
>-        union ibv_gid         gid;
>+      union ibv_gid           gid;
>       uint16_t                qp_type;
> } ib_qp_cm_t;
>
>@@ -110,20 +108,18 @@ struct ib_cm_handle
>       struct dapl_llist_entry entry;
>       DAPL_OS_LOCK            lock;
>       SCM_STATE               state;
>-      int                     socket;
>+      DAPL_SOCKET             socket;
>       struct dapl_hca         *hca;
>       struct dapl_sp          *sp;
>-      struct dapl_ep          *ep;
>+      struct dapl_ep          *ep;
>       ib_qp_cm_t              dst;
>-      unsigned char           p_data[256];
>+      unsigned char           p_data[256];    /* must follow
>ib_qp_cm_t */
>       struct ibv_ah           *ah;
> };
>
> typedef struct ib_cm_handle   *dp_ib_cm_handle_t;
> typedef dp_ib_cm_handle_t     ib_cm_srvc_handle_t;
>
>-DAT_RETURN getipaddr(char *addr, int addr_len);
>-
> /* CM events */
> typedef enum
> {
>@@ -141,9 +137,6 @@ typedef enum
>
> } ib_cm_events_t;
>
>-/* prototype for cm thread */
>-void cr_thread (void *arg);
>-
> /* Operation and state mappings */
> typedef enum  ibv_send_flags  ib_send_op_type_t;
> typedef       struct  ibv_sge         ib_data_segment_t;
>@@ -289,7 +282,7 @@ typedef struct _ib_hca_transport
>       DAPL_OS_LOCK            cq_lock;
>       int                     max_inline_send;
>       ib_thread_state_t       cq_state;
>-      DAPL_OS_THREAD          cq_thread;
>+      DAPL_OS_THREAD                  cq_thread;
>       struct ibv_comp_channel *ib_cq;
>       int                     cr_state;
>       DAPL_OS_THREAD          thread;
>@@ -317,7 +310,6 @@ typedef uint32_t ib_shm_transport_t;
> /* prototypes */
> int32_t       dapls_ib_init (void);
> int32_t       dapls_ib_release (void);
>-void cq_thread (void *arg);
> void cr_thread(void *arg);
> int dapli_cq_thread_init(struct dapl_hca *hca_ptr);
> void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr);
>@@ -349,7 +341,7 @@ dapl_convert_errno( IN int err, IN const
>char *str )
>     if (!err) return DAT_SUCCESS;
>
> #if DAPL_DBG
>-    if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT))
>+    if ((err != EAGAIN) && (err != ETIMEDOUT))
>       dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err));
> #endif
>
>@@ -357,24 +349,15 @@ dapl_convert_errno( IN int err, IN const
>char *str )
>     {
>       case EOVERFLOW  : return DAT_LENGTH_ERROR;
>       case EACCES     : return DAT_PRIVILEGES_VIOLATION;
>-      case ENXIO      :
>-      case ERANGE     :
>       case EPERM      : return DAT_PROTECTION_VIOLATION;
>
>-      case EINVAL     :
>-        case EBADF    :
>-      case ENOENT     :
>-      case ENOTSOCK   : return DAT_INVALID_HANDLE;
>+      case EINVAL     : return DAT_INVALID_HANDLE;
>       case EISCONN    : return DAT_INVALID_STATE |
>DAT_INVALID_STATE_EP_CONNECTED;
>       case ECONNREFUSED : return DAT_INVALID_STATE |
>DAT_INVALID_STATE_EP_NOTREADY;
>-      case ETIME      :
>       case ETIMEDOUT  : return DAT_TIMEOUT_EXPIRED;
>       case ENETUNREACH: return DAT_INVALID_ADDRESS |
>DAT_INVALID_ADDRESS_UNREACHABLE;
>       case EADDRINUSE : return DAT_CONN_QUAL_IN_USE;
>       case EALREADY   : return DAT_INVALID_STATE |
>DAT_INVALID_STATE_EP_ACTCONNPENDING;
>-        case ENOSPC   :
>-      case ENOMEM     :
>-        case E2BIG    :
>-        case EDQUOT   : return DAT_INSUFFICIENT_RESOURCES;
>+      case ENOMEM     : return DAT_INSUFFICIENT_RESOURCES;
>         case EAGAIN   : return DAT_QUEUE_EMPTY;
>       case EINTR      : return DAT_INTERRUPTED_CALL;
>       case EAFNOSUPPORT : return DAT_INVALID_ADDRESS |
>DAT_INVALID_ADDRESS_MALFORMED;
>diff --git a/dapl/openib_scm/linux/openib_osd.h
>b/dapl/openib_scm/linux/openib_osd.h
>new file mode 100644
>index 0000000..235a82e
>--- /dev/null
>+++ b/dapl/openib_scm/linux/openib_osd.h
>@@ -0,0 +1,21 @@
>+#ifndef OPENIB_OSD_H
>+#define OPENIB_OSD_H
>+
>+#include <endian.h>
>+#include <netinet/in.h>
>+
>+#if __BYTE_ORDER == __BIG_ENDIAN
>+#define htonll(x) (x)
>+#define ntohll(x) (x)
>+#elif __BYTE_ORDER == __LITTLE_ENDIAN
>+#define htonll(x)  bswap_64(x)
>+#define ntohll(x)  bswap_64(x)
>+#endif
>+
>+#define DAPL_SOCKET int
>+#define DAPL_INVALID_SOCKET -1
>+#define DAPL_FD_SETSIZE 8192
>+
>+#define closesocket close
>+
>+#endif // OPENIB_OSD_H
>diff --git a/dapl/openib_scm/windows/openib_osd.h
>b/dapl/openib_scm/windows/openib_osd.h
>new file mode 100644
>index 0000000..67c70ec
>--- /dev/null
>+++ b/dapl/openib_scm/windows/openib_osd.h
>@@ -0,0 +1,39 @@
>+#ifndef OPENIB_OSD_H
>+#define OPENIB_OSD_H
>+
>+#ifndef FD_SETSIZE
>+#define FD_SETSIZE 1024 /* Set before including winsock2 -
>see select help */
>+#define DAPL_FD_SETSIZE FD_SETSIZE
>+#endif
>+
>+#include <winsock2.h>
>+#include <ws2tcpip.h>
>+#include <io.h>
>+#include <fcntl.h>
>+
>+#define ntohll _byteswap_uint64
>+#define htonll _byteswap_uint64
>+
>+#define pipe(x) _pipe(x, 4096, _O_TEXT)
>+#define read _read
>+#define write _write
>+#define DAPL_SOCKET SOCKET
>+#define DAPL_INVALID_SOCKET INVALID_SOCKET
>+
>+/* allow casting to WSABUF */
>+struct iovec
>+{
>+       u_long iov_len;
>+       char FAR* iov_base;
>+};
>+
>+static int writev(DAPL_SOCKET s, struct iovec *vector, int count)
>+{
>+       int len, ret;
>+
>+       ret = WSASend(s, (WSABUF *) vector, count, &len, 0,
>NULL, NULL);
>+       return ret ? ret : len;
>+}
>+
>+#endif // OPENIB_OSD_H
>+
>diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
>index 6fef9af..ae02944 100644
>--- a/dapl/udapl/linux/dapl_osd.h
>+++ b/dapl/udapl/linux/dapl_osd.h
>@@ -302,6 +302,15 @@ dapl_os_thread_create (
>       IN  void                        *data,
>       OUT DAPL_OS_THREAD              *thread_id );
>
>+STATIC _INLINE_ void
>+dapl_os_sleep_usec(int usec)
>+{
>+      struct timespec sleep, remain;
>+
>+      sleep.tv_sec = 0;
>+      sleep.tv_nsec = usec * 1000;
>+      nanosleep(&sleep, &remain);
>+}
>
> /*
>  * Lock Functions
>
>
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit
>http://openib.org/mailman/listinfo/openib-general
>


From weiny2 at llnl.gov  Tue Feb 17 09:19:55 2009
From: weiny2 at llnl.gov (weiny2 at llnl.gov)
Date: Tue, 17 Feb 2009 09:19:55 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
Message-ID: <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>

Quoting Hal Rosenstock <hal.rosenstock at gmail.com>:

> Sasha,
>
> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky   
> <sashak at voltaire.com> wrote:
>>
>> I looked at implementation of safe_*() functions (safe_smp_query,
>> safe_smp_set and safe_ca_call) and found that they are not actually
>> "safe" as declared by its names. The only thread-unsafe thing which
>> is used there is static 'mad_portid' structure (from rpc.c),
>
> I'm not sure that the only thread unsafe thing in the mad rpc
> mechanism is the portid.
>
>> but modification of this structure is not protected by same mutex (actually
>> not protected at all).
>
> A first step would be removing the portid as static. If so, portid
> would need to be a supplied parameter to various mad routines and the
> existing ones relying on madrpc_portid would be deprecated. Does this
> make sense to do ? Would you accept such a patch ?
>

Don't we already have an interface like this with mad_rpc_open_port?

I don't like the void * return but it is "struct ibmadb_port" under  
the hood.  Are those calls which use it not thread safe?

Ira


> -- Hal
>
>> As far as I know nothing uses those safe_*() primitives right now outside
>> libibmad, so I think it is better to remove this confused functions from
>> API (with changing library version, etc.).
>>
>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
>> hidden static pthread mutex which is not controlled by caller
>> application. I think that it will be more robust for multithreaded
>> application to use its own synchronization methods (pthread mutex or any
>> other) for better control. So let's remove madrpc_lock/unlock() too.
>>
>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>> ---
>>  libibmad/include/infiniband/mad.h |   41   
>> -------------------------------------
>>  libibmad/libibmad.ver             |    2 +-
>>  libibmad/src/libibmad.map         |    2 -
>>  libibmad/src/rpc.c                |   15 -------------
>>  libibmad/src/sa.c                 |    5 ++-
>>  5 files changed, 4 insertions(+), 61 deletions(-)
>>
>> diff --git a/libibmad/include/infiniband/mad.h   
>> b/libibmad/include/infiniband/mad.h
>> index eff6738..89b4be5 100644
>> --- a/libibmad/include/infiniband/mad.h
>> +++ b/libibmad/include/infiniband/mad.h
>> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t   
>> *dport, ib_rmpp_hdr_t *rmpp,
>>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
>>                    int num_classes);
>>  void   madrpc_save_mad(void *madbuf, int len);
>> -void   madrpc_lock(void);
>> -void   madrpc_unlock(void);
>>  void   madrpc_show_errors(int set);
>>
>>  void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t  
>>  *id, unsigned attrid,
>>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid,  
>>  unsigned mod,
>>                      unsigned timeout, const void *srcport);
>>
>> -inline static uint8_t *
>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid,  
>>  unsigned mod,
>> -              unsigned timeout)
>> -{
>> -       uint8_t *p;
>> -
>> -       madrpc_lock();
>> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
>> -       madrpc_unlock();
>> -
>> -       return p;
>> -}
>> -
>> -inline static uint8_t *
>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid,   
>> unsigned mod,
>> -            unsigned timeout)
>> -{
>> -       uint8_t *p;
>> -
>> -       madrpc_lock();
>> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
>> -       madrpc_unlock();
>> -
>> -       return p;
>> -}
>> -
>>  /* sa.c */
>>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>>                  unsigned timeout);
>> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid,   
>> ibmad_gid_t destgid, ib_portid_t *sm_id,
>>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
>>                          ibmad_gid_t destgid, ib_portid_t *sm_id,   
>> void *buf);
>>
>> -inline static uint8_t *
>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>> -            unsigned timeout)
>> -{
>> -       uint8_t *p;
>> -
>> -       madrpc_lock();
>> -       p = sa_call(rcvbuf, portid, sa, timeout);
>> -       madrpc_unlock();
>> -
>> -       return p;
>> -}
>> -
>>  /* resolve.c */
>>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
>>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
>> index 7e93c16..23d2dc2 100644
>> --- a/libibmad/libibmad.ver
>> +++ b/libibmad/libibmad.ver
>> @@ -6,4 +6,4 @@
>>  # API_REV - advance on any added API
>>  # RUNNING_REV - advance any change to the vendor files
>>  # AGE - number of backward versions the API still supports
>> -LIBVERSION=5:0:4
>> +LIBVERSION=2:0:0
>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
>> index 927e51c..f944d86 100644
>> --- a/libibmad/src/libibmad.map
>> +++ b/libibmad/src/libibmad.map
>> @@ -72,14 +72,12 @@ IBMAD_1.3 {
>>                madrpc;
>>                madrpc_def_timeout;
>>                madrpc_init;
>> -               madrpc_lock;
>>                madrpc_portid;
>>                madrpc_rmpp;
>>                madrpc_save_mad;
>>                madrpc_set_retries;
>>                madrpc_set_timeout;
>>                madrpc_show_errors;
>> -               madrpc_unlock;
>>                ib_path_query;
>>                sa_call;
>>                sa_rpc_call;
>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
>> index 5226540..670a936 100644
>> --- a/libibmad/src/rpc.c
>> +++ b/libibmad/src/rpc.c
>> @@ -38,7 +38,6 @@
>>  #include <stdio.h>
>>  #include <stdlib.h>
>>  #include <unistd.h>
>> -#include <pthread.h>
>>  #include <string.h>
>>  #include <errno.h>
>>
>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport,  
>>  ib_rmpp_hdr_t *rmpp, void *data)
>>        return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
>>  }
>>
>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
>> -
>> -void
>> -madrpc_lock(void)
>> -{
>> -       pthread_mutex_lock(&rpclock);
>> -}
>> -
>> -void
>> -madrpc_unlock(void)
>> -{
>> -       pthread_mutex_unlock(&rpclock);
>> -}
>> -
>>  void
>>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int   
>> num_classes)
>>  {
>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
>> index 27b9d52..c601254 100644
>> --- a/libibmad/src/sa.c
>> +++ b/libibmad/src/sa.c
>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport,   
>> ibmad_gid_t srcgid, ibmad_gid_t destgid,
>>        if (srcport) {
>>                p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
>>        } else {
>> -               p = safe_sa_call(buf, sm_id, &sa, 0);
>> +               p = sa_call(buf, sm_id, &sa, 0);
>>        }
>>        if (!p) {
>>                IBWARN("sa call path_query failed");
>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport,   
>> ibmad_gid_t srcgid, ibmad_gid_t destgid,
>>        mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
>>        return dlid;
>>  }
>> +
>>  int
>>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t  
>>  *sm_id, void *buf)
>>  {
>> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
>> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
>>  }
>> --
>> 1.6.0.4.766.g6fc4a
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://   
>> openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://   
> openib.org/mailman/listinfo/openib-general
>
>


From brian at sun.com  Tue Feb 17 09:52:23 2009
From: brian at sun.com (Brian J. Murrell)
Date: Tue, 17 Feb 2009 12:52:23 -0500
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <499A8A20.1090507@mellanox.co.il>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
	<49994BB2.3010206@mellanox.co.il>
	<7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>
	<499A8A20.1090507@mellanox.co.il>
Message-ID: <1234893143.21802.96.camel@pc.interlinx.bc.ca>

On Tue, 2009-02-17 at 11:57 +0200, Tziporet Koren wrote:
> neutron wrote:
> > One remaining question.
> >
> > In my code of kernel module,   do I need to #include the header files
> > from <ofed-prefix>/src/openib/include/....
> > Or I just include the header files from  <kernel_src_dir>/include/.....
> >
> >   
> You should use the headers from ofed if you wish to use OFED kernel modules.

Ahhh.  But should he just include <ofed-prefix>/src/openib/include/ or
also
<ofed-prefix>/src/openib/kernel_addons/backport/<kernel_ver>/include/
(as described in <ofed-prefix>/src/openib/ofed_patch.mk as well?

And in what order should these be specified in?

b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090217/e05f0553/attachment.sig>

From sashak at voltaire.com  Tue Feb 17 10:50:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 20:50:27 +0200
Subject: [ofa-general] [PATCH] opensm/console: dump_portguid command fixes
Message-ID: <20090217185027.GJ7189@sashak.voltaire.com>


Don't try to match invalid expressions, so things like 'dump_portguid *'
will not crash.

Free memory allocated by regcomp() and for regexp list.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_console.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index a66a7d3..0f26e51 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1247,6 +1247,8 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 					fprintf(out,
 						"Couldn't parse regular expression %s. Skipping it.\n",
 						p_cmd);
+					free(p_regexp);
+					continue;
 				}
 				p_regexp->next = p_head_regexp;
 				p_head_regexp = p_regexp;
@@ -1292,6 +1294,11 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 	if (output != out)
 		fclose(output);
 
+	for (; p_head_regexp; p_head_regexp = p_regexp) {
+		p_regexp = p_head_regexp->next;
+		regfree(&p_head_regexp->exp);
+		free(p_head_regexp);
+	}
 }
 
 static void help_dump_portguid(FILE * out, int detail)
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Tue Feb 17 10:51:09 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 20:51:09 +0200
Subject: [ofa-general] [PATCH] opensm/console: dump_portguid - don't
	duplicate matched guids
Message-ID: <20090217185109.GK7189@sashak.voltaire.com>


Don't repeat port GUIDs when more then one regular expression matches.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_console.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 0f26e51..0c3cdbf 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1285,9 +1285,11 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 		     p_regexp = p_regexp->next)
 			if (regexec
 			    (&(p_regexp->exp), p_port->p_node->print_desc, 0,
-			     NULL, 0) == 0)
+			     NULL, 0) == 0) {
 				fprintf(output, "0x%" PRIxLEAST64 "\n",
 					cl_ntoh64(p_port->p_physp->port_guid));
+				break;
+			}
 	}
 
 	CL_PLOCK_RELEASE(p_osm->sm.p_lock);
-- 
1.6.1.2.319.gbd9e


From hal.rosenstock at gmail.com  Tue Feb 17 13:12:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Feb 2009 16:12:12 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
Message-ID: <f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>

On Tue, Feb 17, 2009 at 12:19 PM,  <weiny2 at llnl.gov> wrote:
> Quoting Hal Rosenstock <hal.rosenstock at gmail.com>:
>
>> Sasha,
>>
>> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky  <sashak at voltaire.com>
>> wrote:
>>>
>>> I looked at implementation of safe_*() functions (safe_smp_query,
>>> safe_smp_set and safe_ca_call) and found that they are not actually
>>> "safe" as declared by its names. The only thread-unsafe thing which
>>> is used there is static 'mad_portid' structure (from rpc.c),
>>
>> I'm not sure that the only thread unsafe thing in the mad rpc
>> mechanism is the portid.
>>
>>> but modification of this structure is not protected by same mutex
>>> (actually
>>> not protected at all).
>>
>> A first step would be removing the portid as static. If so, portid
>> would need to be a supplied parameter to various mad routines and the
>> existing ones relying on madrpc_portid would be deprecated. Does this
>> make sense to do ? Would you accept such a patch ?
>>

> Don't we already have an interface like this with mad_rpc_open_port?

I'm not sure this was carried all the way through (The basic building
blocks are there but I think some additional routines are needed).

Shouldn't the in tree clients be converted over and the old routines
deprecated ?

> I don't like the void * return but it is "struct ibmadb_port" under the hood.

Is access into that currently opaque struct needed for something by
the clients of the library ?

> Are those calls which use it not thread safe?

They look OK but I'm not 100% sure yet.

-- Hal

> Ira
>
>
>> -- Hal
>>
>>> As far as I know nothing uses those safe_*() primitives right now outside
>>> libibmad, so I think it is better to remove this confused functions from
>>> API (with changing library version, etc.).
>>>
>>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
>>> hidden static pthread mutex which is not controlled by caller
>>> application. I think that it will be more robust for multithreaded
>>> application to use its own synchronization methods (pthread mutex or any
>>> other) for better control. So let's remove madrpc_lock/unlock() too.
>>>
>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>> ---
>>>  libibmad/include/infiniband/mad.h |   41
>>>  -------------------------------------
>>>  libibmad/libibmad.ver             |    2 +-
>>>  libibmad/src/libibmad.map         |    2 -
>>>  libibmad/src/rpc.c                |   15 -------------
>>>  libibmad/src/sa.c                 |    5 ++-
>>>  5 files changed, 4 insertions(+), 61 deletions(-)
>>>
>>> diff --git a/libibmad/include/infiniband/mad.h
>>>  b/libibmad/include/infiniband/mad.h
>>> index eff6738..89b4be5 100644
>>> --- a/libibmad/include/infiniband/mad.h
>>> +++ b/libibmad/include/infiniband/mad.h
>>> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t
>>>  *dport, ib_rmpp_hdr_t *rmpp,
>>>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
>>>                   int num_classes);
>>>  void   madrpc_save_mad(void *madbuf, int len);
>>> -void   madrpc_lock(void);
>>> -void   madrpc_unlock(void);
>>>  void   madrpc_show_errors(int set);
>>>
>>>  void * mad_rpc_open_port(char *dev_name, int dev_port, int
>>> *mgmt_classes,
>>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t  *id,
>>> unsigned attrid,
>>>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid,
>>>  unsigned mod,
>>>                     unsigned timeout, const void *srcport);
>>>
>>> -inline static uint8_t *
>>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
>>>  unsigned mod,
>>> -              unsigned timeout)
>>> -{
>>> -       uint8_t *p;
>>> -
>>> -       madrpc_lock();
>>> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
>>> -       madrpc_unlock();
>>> -
>>> -       return p;
>>> -}
>>> -
>>> -inline static uint8_t *
>>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
>>>  unsigned mod,
>>> -            unsigned timeout)
>>> -{
>>> -       uint8_t *p;
>>> -
>>> -       madrpc_lock();
>>> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
>>> -       madrpc_unlock();
>>> -
>>> -       return p;
>>> -}
>>> -
>>>  /* sa.c */
>>>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>>>                 unsigned timeout);
>>> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid,
>>>  ibmad_gid_t destgid, ib_portid_t *sm_id,
>>>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
>>>                         ibmad_gid_t destgid, ib_portid_t *sm_id,  void
>>> *buf);
>>>
>>> -inline static uint8_t *
>>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>>> -            unsigned timeout)
>>> -{
>>> -       uint8_t *p;
>>> -
>>> -       madrpc_lock();
>>> -       p = sa_call(rcvbuf, portid, sa, timeout);
>>> -       madrpc_unlock();
>>> -
>>> -       return p;
>>> -}
>>> -
>>>  /* resolve.c */
>>>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
>>>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
>>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
>>> index 7e93c16..23d2dc2 100644
>>> --- a/libibmad/libibmad.ver
>>> +++ b/libibmad/libibmad.ver
>>> @@ -6,4 +6,4 @@
>>>  # API_REV - advance on any added API
>>>  # RUNNING_REV - advance any change to the vendor files
>>>  # AGE - number of backward versions the API still supports
>>> -LIBVERSION=5:0:4
>>> +LIBVERSION=2:0:0
>>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
>>> index 927e51c..f944d86 100644
>>> --- a/libibmad/src/libibmad.map
>>> +++ b/libibmad/src/libibmad.map
>>> @@ -72,14 +72,12 @@ IBMAD_1.3 {
>>>               madrpc;
>>>               madrpc_def_timeout;
>>>               madrpc_init;
>>> -               madrpc_lock;
>>>               madrpc_portid;
>>>               madrpc_rmpp;
>>>               madrpc_save_mad;
>>>               madrpc_set_retries;
>>>               madrpc_set_timeout;
>>>               madrpc_show_errors;
>>> -               madrpc_unlock;
>>>               ib_path_query;
>>>               sa_call;
>>>               sa_rpc_call;
>>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
>>> index 5226540..670a936 100644
>>> --- a/libibmad/src/rpc.c
>>> +++ b/libibmad/src/rpc.c
>>> @@ -38,7 +38,6 @@
>>>  #include <stdio.h>
>>>  #include <stdlib.h>
>>>  #include <unistd.h>
>>> -#include <pthread.h>
>>>  #include <string.h>
>>>  #include <errno.h>
>>>
>>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport,
>>>  ib_rmpp_hdr_t *rmpp, void *data)
>>>       return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
>>>  }
>>>
>>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
>>> -
>>> -void
>>> -madrpc_lock(void)
>>> -{
>>> -       pthread_mutex_lock(&rpclock);
>>> -}
>>> -
>>> -void
>>> -madrpc_unlock(void)
>>> -{
>>> -       pthread_mutex_unlock(&rpclock);
>>> -}
>>> -
>>>  void
>>>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int
>>>  num_classes)
>>>  {
>>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
>>> index 27b9d52..c601254 100644
>>> --- a/libibmad/src/sa.c
>>> +++ b/libibmad/src/sa.c
>>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport,  ibmad_gid_t
>>> srcgid, ibmad_gid_t destgid,
>>>       if (srcport) {
>>>               p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
>>>       } else {
>>> -               p = safe_sa_call(buf, sm_id, &sa, 0);
>>> +               p = sa_call(buf, sm_id, &sa, 0);
>>>       }
>>>       if (!p) {
>>>               IBWARN("sa call path_query failed");
>>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport,  ibmad_gid_t
>>> srcgid, ibmad_gid_t destgid,
>>>       mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
>>>       return dlid;
>>>  }
>>> +
>>>  int
>>>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t
>>>  *sm_id, void *buf)
>>>  {
>>> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
>>> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
>>>  }
>>> --
>>> 1.6.0.4.766.g6fc4a
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://
>>>  openib.org/mailman/listinfo/openib-general
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://
>>  openib.org/mailman/listinfo/openib-general
>>
>>
>
>
>
>


From sashak at voltaire.com  Tue Feb 17 13:18:48 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 23:18:48 +0200
Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation
	support
In-Reply-To: <20090214203753.GE32660@comcast.net>
References: <20090214203753.GE32660@comcast.net>
Message-ID: <20090217211848.GP7189@sashak.voltaire.com>

Hi Hal,

On 15:37 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> Add SIM_PORT environment variable to allow for end port selection

How this would handle case when SIM_PORT=N, but program tries to work
via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)?

IOW should port number selection be initiated natively by program rather
than by using environment variables?

> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
>  ibsim/ibsim.c         |    6 +-
>  include/ibsim.h       |    2 +
>  umad2sim/sim_client.c |   49 +++++++++-
>  umad2sim/sim_client.h |    4 +-
>  umad2sim/umad2sim.c   |  254 ++++++++++++++++++++++++++-----------------------
>  5 files changed, 189 insertions(+), 126 deletions(-)
> 
> diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
> index f48e1f0..6a35fdc 100644
> --- a/ibsim/ibsim.c
> +++ b/ibsim/ibsim.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -187,7 +188,8 @@ static int sm_exists(Node * node)
>  	return 0;
>  }
>  
> -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from)
> +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl,
> +			      union name_t *from)
>  {
>  	union name_t name;
>  	size_t size;
> @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f
>  			ctl->type = SIM_CTL_ERROR;
>  			return -1;
>  		}
> -		cl->port = node_get_port(node, 0);
> +		cl->port = node_get_port(node, scl->portnum);
>  		VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64,
>  		     i, node->nodeid, cl->port->portguid);
>  	} else {
> diff --git a/include/ibsim.h b/include/ibsim.h
> index 15fc37c..66ba6f9 100644
> --- a/include/ibsim.h
> +++ b/include/ibsim.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -100,6 +101,7 @@ struct sim_client_info {
>  	uint32_t qp;
>  	uint32_t issm;		/* accept request for qp 0 & 1 */
>  	char nodeid[32];
> +	uint32_t portnum;
>  };
>  
>  union name_t {
> diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
> index 06bb7a8..1c35109 100644
> --- a/umad2sim/sim_client.c
> +++ b/umad2sim/sim_client.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid)
>  	info.id = id;
>  	info.issm = 0;
>  	info.qp = qp;
> +	info.portnum = sc->portnum;
>  
>  	if (nodeid)
>  		strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1);
> @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc)
>  	return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0);
>  }
>  
> -static int sim_init(struct sim_client *sc, char *nodeid)
> +static int sim_init(struct sim_client *sc, char *nodeid, int portnum)
>  {
>  	union name_t name;
>  	socklen_t size;
> @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid)
>  	DEBUG("init %d: opened ctl fd %d as \'%s\'",
>  	      pid, ctlfd, get_name(&name));
>  
> +	sc->portnum = portnum;
>  	port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT;
>  	size = make_name(&name, connect_host, port, "%s:ctl", socket_basename);
>  
> @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm)
>  int sim_client_init(struct sim_client *sc)
>  {
>  	char *nodeid;
> +	char *portno;
> +	int i, j = 0, portnum = 0, startport = 1, endport;
> +	uint8_t numports, nodetype;
> +	uint8_t *portinfo;
>  
>  	nodeid = getenv("SIM_HOST");
> -	if (sim_init(sc, nodeid) < 0)
> +	portno = getenv("SIM_PORT");
> +	if (portno)
> +		portnum = atoi(portno);
> +
> +	if (sim_init(sc, nodeid, portnum) < 0)
>  		return -1;
>  	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
>  		    sizeof(sc->vendor)) < 0)
> @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc)
>  	if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo,
>  		    sizeof(sc->nodeinfo)) < 0)
>  		goto _exit;
> +	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
> +	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
> +	if (nodetype == 2) { // switch
> +		startport = 0;
> +		endport = 0;
> +	} else {
> +		if (portnum == 0) {
> +			IBWARN("portnum 0 is not valid end port on non switch node");
> +			goto _exit;
> +		}

This makes exporting SIM_PORT environment variable to be mandatory,
which doesn't look like a good idea for me (personally I will need to
rewrite some amount of my scripts).

I think that SIM_HOST should be optional and the default behavior
should be preserved.

> +		endport = numports;
> +	}
> +	if (portnum > endport) {
> +		IBWARN("portnum %d is not a valid end port number (%d)",
> +		       portnum, endport);
> +		goto _exit;
> +	}
>  
> -	sc->portinfo[0] = 0;	// portno requested
> -	if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo,
> -		    sizeof(sc->portinfo)) < 0)
> +	sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1));	// portinfo size x number of ports starting at 0
> +	if (!sc->portinfo)
>  		goto _exit;
> +
> +	// loop through end ports
> +	for (i = startport; i <= endport ; i++, j++) {
> +		portinfo = sc->portinfo + 64 * j;

You don't need 'j' - just move portinfo pointer.

> +		*portinfo = i + 1; // portno requested
> +		if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0)
> +			goto _exit;
> +	}
> +
> +	// although pkeys also per port, current config same on all end ports

Which is not correct really.

Sasha

>  	if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0)
>  		goto _exit;
>  	if (getenv("SIM_SET_ISSM"))
> @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc)
>  void sim_client_exit(struct sim_client *sc)
>  {
>  	sim_disconnect(sc);
> +	if (sc->portinfo)
> +		free(sc->portinfo);
>  	sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1;
>  }
> diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h
> index 80ed442..0faca80 100644
> --- a/umad2sim/sim_client.h
> +++ b/umad2sim/sim_client.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -41,8 +42,9 @@ struct sim_client {
>  	int clientid;
>  	int fd_pktin, fd_pktout, fd_ctl;
>  	struct sim_vendor vendor;
> +	int portnum;
>  	uint8_t nodeinfo[64];
> -	uint8_t portinfo[64];
> +	uint8_t *portinfo;
>  	uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)];
>  };
>  
> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
> index 8d83a24..6e3c269 100644
> --- a/umad2sim/umad2sim.c
> +++ b/umad2sim/umad2sim.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>  	struct sim_client *sc = &dev->sim_client;
>  	char *str;
>  	uint8_t *portinfo;
> -	int i;
> +	char *ports_path_end;
> +	int i, j;
> +	int startport = 1, endport;
> +	uint8_t numports, nodetype;
>  
>  	/* /sys/class/infiniband_mad/abi_version */
>  	snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir);
> @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>  	strncat(path, "/ports", sizeof(path) - 1);
>  	make_path(path);
>  
> -	portinfo = sc->portinfo;
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/ */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> -	snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
> -	make_path(path);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
> -	file_printf(path, SYS_PORT_LMC, "%d", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/sm_lid */
> -	val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
> -	file_printf(path, SYS_PORT_SMLID, "0x%x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/sm_sl */
> -	val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
> -	file_printf(path, SYS_PORT_SMSL, "%d", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/lid */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
> -	file_printf(path, SYS_PORT_LID, "0x%x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/state */
> -	val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
> -	if (val == 0)
> -		str = "NOP";
> -	else if (val == 1)
> -		str = "DOWN";
> -	else if (val == 2)
> -		str = "INIT";
> -	else if (val == 3)
> -		str = "ARMED";
> -	else if (val == 4)
> -		str = "ACTIVE";
> -	else if (val == 5)
> -		str = "ACTIVE_DEFER";
> -	else
> -		str = "<unknown>";
> -	file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/phys_state */
> -	val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
> -	if (val == 1)
> -		str = "Sleep";
> -	else if (val == 2)
> -		str = "Polling";
> -	else if (val == 3)
> -		str = "Disabled";
> -	else if (val == 4)
> -		str = "PortConfigurationTraining";
> -	else if (val == 5)
> -		str = "LinkUp";
> -	else if (val == 6)
> -		str = "LinkErrorRecovery";
> -	else if (val == 7)
> -		str = "Phy Test";
> -	else
> -		str = "<unknown>";
> -	file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/rate */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
> -	speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
> -	if (val == 1)
> -		val = 1;
> -	else if (val == 2)
> -		val = 4;
> -	else if (val == 4)
> -		val = 8;
> -	else if (val == 8)
> -		val = 12;
> -	else
> -		val = 0;
> -	if (speed == 2)
> -		str = " DDR";
> -	else if (speed == 4)
> -		str = " QDR";
> -	else
> -		str = "";
> -	file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
> -		    (val * speed * 25) / 10,
> -		    (val * speed * 25) % 10 ? ".5" : "", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/cap_mask */
> -	val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
> -	file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/gids/0 */
> -	str = path + strlen(path);
> -	strncat(path, "/gids", sizeof(path) - 1);
> -	make_path(path);
> -	*str = '\0';
> -	gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
> -	guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) +
> -	    mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> -	file_printf(path, SYS_PORT_GID,
> -		    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> -		    (uint16_t) ((gid >> 48) & 0xffff),
> -		    (uint16_t) ((gid >> 32) & 0xffff),
> -		    (uint16_t) ((gid >> 16) & 0xffff),
> -		    (uint16_t) ((gid >> 0) & 0xffff),
> -		    (uint16_t) ((guid >> 48) & 0xffff),
> -		    (uint16_t) ((guid >> 32) & 0xffff),
> -		    (uint16_t) ((guid >> 16) & 0xffff),
> -		    (uint16_t) ((guid >> 0) & 0xffff));
> +	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
> +	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
> +        if (nodetype == 2) { // switch
> +		startport = 0;
> +		endport = 0;
> +	} else
> +		endport = numports;
> +
> +	ports_path_end = path + strlen(path);
> +
> +	// loop through end ports
> +	for (j = startport; j <= endport; j++) {
> +
> +		portinfo = sc->portinfo + 64 * j;
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/ */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> +		snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
> +		make_path(path);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/lid_mask_count */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
> +		file_printf(path, SYS_PORT_LMC, "%d", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/sm_lid */
> +		val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
> +		file_printf(path, SYS_PORT_SMLID, "0x%x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/sm_sl */
> +		val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
> +		file_printf(path, SYS_PORT_SMSL, "%d", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/lid */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
> +		file_printf(path, SYS_PORT_LID, "0x%x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/state */
> +		val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
> +		if (val == 0)
> +			str = "NOP";
> +		else if (val == 1)
> +			str = "DOWN";
> +		else if (val == 2)
> +			str = "INIT";
> +		else if (val == 3)
> +			str = "ARMED";
> +		else if (val == 4)
> +			str = "ACTIVE";
> +		else if (val == 5)
> +			str = "ACTIVE_DEFER";
> +		else
> +			str = "<unknown>";
> +		file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/phys_state */
> +		val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
> +		if (val == 1)
> +			str = "Sleep";
> +		else if (val == 2)
> +			str = "Polling";
> +		else if (val == 3)
> +			str = "Disabled";
> +		else if (val == 4)
> +			str = "PortConfigurationTraining";
> +		else if (val == 5)
> +			str = "LinkUp";
> +		else if (val == 6)
> +			str = "LinkErrorRecovery";
> +		else if (val == 7)
> +			str = "Phy Test";
> +		else
> +			str = "<unknown>";
> +		file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/rate */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
> +		speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
> +		if (val == 1)
> +			val = 1;
> +		else if (val == 2)
> +			val = 4;
> +		else if (val == 4)
> +			val = 8;
> +		else if (val == 8)
> +			val = 12;
> +		else
> +			val = 0;
> +		if (speed == 2)
> +			str = " DDR";
> +		else if (speed == 4)
> +			str = " QDR";
> +		else
> +			str = "";
> +		file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
> +			    (val * speed * 25) / 10,
> +			    (val * speed * 25) % 10 ? ".5" : "", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/cap_mask */
> +		val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
> +		file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/gids/0 */
> +		str = path + strlen(path);
> +		strncat(path, "/gids", sizeof(path) - 1);
> +		make_path(path);
> +		*str = '\0';
> +		gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
> +		guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j;
> +		file_printf(path, SYS_PORT_GID,
> +			    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> +			    (uint16_t) ((gid >> 48) & 0xffff),
> +			    (uint16_t) ((gid >> 32) & 0xffff),
> +			    (uint16_t) ((gid >> 16) & 0xffff),
> +			    (uint16_t) ((gid >> 0) & 0xffff),
> +			    (uint16_t) ((guid >> 48) & 0xffff),
> +			    (uint16_t) ((guid >> 32) & 0xffff),
> +			    (uint16_t) ((guid >> 16) & 0xffff),
> +			    (uint16_t) ((guid >> 0) & 0xffff));
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/pkeys/0 */
> +		str = path + strlen(path);
> +		strncat(path, "/pkeys", sizeof(path) - 1);
> +		make_path(path);
> +		for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
> +			char name[8];
> +			snprintf(name, sizeof(name), "%u", i);
> +			file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
> +		}
> +		*str = '\0';
>  
> -	/* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */
> -	str = path + strlen(path);
> -	strncat(path, "/pkeys", sizeof(path) - 1);
> -	make_path(path);
> -	for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
> -		char name[8];
> -		snprintf(name, sizeof(name), "%u", i);
> -		file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
> +		*ports_path_end = '\0';
>  	}
> -	*str = '\0';
>  
>  	/* /sys/class/infiniband_mad/umad0/ */
>  	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
> @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
>  	if (sim_client_init(&dev->sim_client) < 0)
>  		goto _error;
>  
> -	dev->port = mad_get_field(&dev->sim_client.portinfo, 0,
> -				  IB_PORT_LOCAL_PORT_F);
> +	dev->port = dev->sim_client.portnum;
>  	for (i = 0; i < arrsize(dev->agents); i++)
>  		dev->agents[i].id = (uint32_t)(-1);
>  	for (i = 0; i < arrsize(dev->agent_idx); i++)
> -- 
> 1.5.6.4
> 


From sashak at voltaire.com  Tue Feb 17 13:27:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 23:27:42 +0200
Subject: [ofa-general] Re: [PATCH] ibsim/sim_client.c: In sim_client_init,
	return -1 on error
In-Reply-To: <20090214203703.GD32660@comcast.net>
References: <20090214203703.GD32660@comcast.net>
Message-ID: <20090217212729.GQ7189@sashak.voltaire.com>

On 15:37 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Those three patches are applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Tue Feb 17 13:28:40 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Feb 2009 16:28:40 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim: Add better end port
	simulation support
In-Reply-To: <20090217211848.GP7189@sashak.voltaire.com>
References: <20090214203753.GE32660@comcast.net>
	<20090217211848.GP7189@sashak.voltaire.com>
Message-ID: <f0e08f230902171328m3fcc074ew65c978b3f8f81520@mail.gmail.com>

Sasha,

On Tue, Feb 17, 2009 at 4:18 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 15:37 Sat 14 Feb     , hnrose at comcast.net wrote:
>>
>> Add SIM_PORT environment variable to allow for end port selection
>
> How this would handle case when SIM_PORT=N, but program tries to work
> via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)?

That's a configuration error. SIM_PORT needs to be set to same port as
program intends to use.

> IOW should port number selection be initiated natively by program rather
> than by using environment variables?

That would've been nice but AFAIT the simulation layer needs the port
number earlier than the program can supply it. Maybe that could be
changed but I didn't dig into that.

>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> ---
>>  ibsim/ibsim.c         |    6 +-
>>  include/ibsim.h       |    2 +
>>  umad2sim/sim_client.c |   49 +++++++++-
>>  umad2sim/sim_client.h |    4 +-
>>  umad2sim/umad2sim.c   |  254 ++++++++++++++++++++++++++-----------------------
>>  5 files changed, 189 insertions(+), 126 deletions(-)
>>
>> diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
>> index f48e1f0..6a35fdc 100644
>> --- a/ibsim/ibsim.c
>> +++ b/ibsim/ibsim.c
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -187,7 +188,8 @@ static int sm_exists(Node * node)
>>       return 0;
>>  }
>>
>> -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from)
>> +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl,
>> +                           union name_t *from)
>>  {
>>       union name_t name;
>>       size_t size;
>> @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f
>>                       ctl->type = SIM_CTL_ERROR;
>>                       return -1;
>>               }
>> -             cl->port = node_get_port(node, 0);
>> +             cl->port = node_get_port(node, scl->portnum);
>>               VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64,
>>                    i, node->nodeid, cl->port->portguid);
>>       } else {
>> diff --git a/include/ibsim.h b/include/ibsim.h
>> index 15fc37c..66ba6f9 100644
>> --- a/include/ibsim.h
>> +++ b/include/ibsim.h
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -100,6 +101,7 @@ struct sim_client_info {
>>       uint32_t qp;
>>       uint32_t issm;          /* accept request for qp 0 & 1 */
>>       char nodeid[32];
>> +     uint32_t portnum;
>>  };
>>
>>  union name_t {
>> diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
>> index 06bb7a8..1c35109 100644
>> --- a/umad2sim/sim_client.c
>> +++ b/umad2sim/sim_client.c
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid)
>>       info.id = id;
>>       info.issm = 0;
>>       info.qp = qp;
>> +     info.portnum = sc->portnum;
>>
>>       if (nodeid)
>>               strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1);
>> @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc)
>>       return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0);
>>  }
>>
>> -static int sim_init(struct sim_client *sc, char *nodeid)
>> +static int sim_init(struct sim_client *sc, char *nodeid, int portnum)
>>  {
>>       union name_t name;
>>       socklen_t size;
>> @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid)
>>       DEBUG("init %d: opened ctl fd %d as \'%s\'",
>>             pid, ctlfd, get_name(&name));
>>
>> +     sc->portnum = portnum;
>>       port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT;
>>       size = make_name(&name, connect_host, port, "%s:ctl", socket_basename);
>>
>> @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm)
>>  int sim_client_init(struct sim_client *sc)
>>  {
>>       char *nodeid;
>> +     char *portno;
>> +     int i, j = 0, portnum = 0, startport = 1, endport;
>> +     uint8_t numports, nodetype;
>> +     uint8_t *portinfo;
>>
>>       nodeid = getenv("SIM_HOST");
>> -     if (sim_init(sc, nodeid) < 0)
>> +     portno = getenv("SIM_PORT");
>> +     if (portno)
>> +             portnum = atoi(portno);
>> +
>> +     if (sim_init(sc, nodeid, portnum) < 0)
>>               return -1;
>>       if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
>>                   sizeof(sc->vendor)) < 0)
>> @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc)
>>       if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo,
>>                   sizeof(sc->nodeinfo)) < 0)
>>               goto _exit;
>> +     numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
>> +     nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
>> +     if (nodetype == 2) { // switch
>> +             startport = 0;
>> +             endport = 0;
>> +     } else {
>> +             if (portnum == 0) {
>> +                     IBWARN("portnum 0 is not valid end port on non switch node");
>> +                     goto _exit;
>> +             }
>
> This makes exporting SIM_PORT environment variable to be mandatory,
> which doesn't look like a good idea for me (personally I will need to
> rewrite some amount of my scripts).
>
> I think that SIM_HOST should be optional and the default behavior
> should be preserved.
>
>> +             endport = numports;
>> +     }
>> +     if (portnum > endport) {
>> +             IBWARN("portnum %d is not a valid end port number (%d)",
>> +                    portnum, endport);
>> +             goto _exit;
>> +     }
>>
>> -     sc->portinfo[0] = 0;    // portno requested
>> -     if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo,
>> -                 sizeof(sc->portinfo)) < 0)
>> +     sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1)); // portinfo size x number of ports starting at 0
>> +     if (!sc->portinfo)
>>               goto _exit;
>> +
>> +     // loop through end ports
>> +     for (i = startport; i <= endport ; i++, j++) {
>> +             portinfo = sc->portinfo + 64 * j;
>
> You don't need 'j' - just move portinfo pointer.

OK.

>> +             *portinfo = i + 1; // portno requested
>> +             if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0)
>> +                     goto _exit;
>> +     }
>> +
>> +     // although pkeys also per port, current config same on all end ports
>
> Which is not correct really.

What are you referring to ? Is there some config for end port pkeys in
the simulator ?

-- Hal

> Sasha
>
>>       if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0)
>>               goto _exit;
>>       if (getenv("SIM_SET_ISSM"))
>> @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc)
>>  void sim_client_exit(struct sim_client *sc)
>>  {
>>       sim_disconnect(sc);
>> +     if (sc->portinfo)
>> +             free(sc->portinfo);
>>       sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1;
>>  }
>> diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h
>> index 80ed442..0faca80 100644
>> --- a/umad2sim/sim_client.h
>> +++ b/umad2sim/sim_client.h
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -41,8 +42,9 @@ struct sim_client {
>>       int clientid;
>>       int fd_pktin, fd_pktout, fd_ctl;
>>       struct sim_vendor vendor;
>> +     int portnum;
>>       uint8_t nodeinfo[64];
>> -     uint8_t portinfo[64];
>> +     uint8_t *portinfo;
>>       uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)];
>>  };
>>
>> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
>> index 8d83a24..6e3c269 100644
>> --- a/umad2sim/umad2sim.c
>> +++ b/umad2sim/umad2sim.c
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>>       struct sim_client *sc = &dev->sim_client;
>>       char *str;
>>       uint8_t *portinfo;
>> -     int i;
>> +     char *ports_path_end;
>> +     int i, j;
>> +     int startport = 1, endport;
>> +     uint8_t numports, nodetype;
>>
>>       /* /sys/class/infiniband_mad/abi_version */
>>       snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir);
>> @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>>       strncat(path, "/ports", sizeof(path) - 1);
>>       make_path(path);
>>
>> -     portinfo = sc->portinfo;
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/ */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
>> -     snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
>> -     make_path(path);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
>> -     file_printf(path, SYS_PORT_LMC, "%d", val);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/sm_lid */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
>> -     file_printf(path, SYS_PORT_SMLID, "0x%x", val);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/sm_sl */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
>> -     file_printf(path, SYS_PORT_SMSL, "%d", val);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/lid */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
>> -     file_printf(path, SYS_PORT_LID, "0x%x", val);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/state */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
>> -     if (val == 0)
>> -             str = "NOP";
>> -     else if (val == 1)
>> -             str = "DOWN";
>> -     else if (val == 2)
>> -             str = "INIT";
>> -     else if (val == 3)
>> -             str = "ARMED";
>> -     else if (val == 4)
>> -             str = "ACTIVE";
>> -     else if (val == 5)
>> -             str = "ACTIVE_DEFER";
>> -     else
>> -             str = "<unknown>";
>> -     file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/phys_state */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
>> -     if (val == 1)
>> -             str = "Sleep";
>> -     else if (val == 2)
>> -             str = "Polling";
>> -     else if (val == 3)
>> -             str = "Disabled";
>> -     else if (val == 4)
>> -             str = "PortConfigurationTraining";
>> -     else if (val == 5)
>> -             str = "LinkUp";
>> -     else if (val == 6)
>> -             str = "LinkErrorRecovery";
>> -     else if (val == 7)
>> -             str = "Phy Test";
>> -     else
>> -             str = "<unknown>";
>> -     file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/rate */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
>> -     speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
>> -     if (val == 1)
>> -             val = 1;
>> -     else if (val == 2)
>> -             val = 4;
>> -     else if (val == 4)
>> -             val = 8;
>> -     else if (val == 8)
>> -             val = 12;
>> -     else
>> -             val = 0;
>> -     if (speed == 2)
>> -             str = " DDR";
>> -     else if (speed == 4)
>> -             str = " QDR";
>> -     else
>> -             str = "";
>> -     file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
>> -                 (val * speed * 25) / 10,
>> -                 (val * speed * 25) % 10 ? ".5" : "", val, str);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/cap_mask */
>> -     val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
>> -     file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
>> -
>> -     /* /sys/class/infiniband/mthca0/ports/1/gids/0 */
>> -     str = path + strlen(path);
>> -     strncat(path, "/gids", sizeof(path) - 1);
>> -     make_path(path);
>> -     *str = '\0';
>> -     gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
>> -     guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) +
>> -         mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
>> -     file_printf(path, SYS_PORT_GID,
>> -                 "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
>> -                 (uint16_t) ((gid >> 48) & 0xffff),
>> -                 (uint16_t) ((gid >> 32) & 0xffff),
>> -                 (uint16_t) ((gid >> 16) & 0xffff),
>> -                 (uint16_t) ((gid >> 0) & 0xffff),
>> -                 (uint16_t) ((guid >> 48) & 0xffff),
>> -                 (uint16_t) ((guid >> 32) & 0xffff),
>> -                 (uint16_t) ((guid >> 16) & 0xffff),
>> -                 (uint16_t) ((guid >> 0) & 0xffff));
>> +     numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
>> +     nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
>> +        if (nodetype == 2) { // switch
>> +             startport = 0;
>> +             endport = 0;
>> +     } else
>> +             endport = numports;
>> +
>> +     ports_path_end = path + strlen(path);
>> +
>> +     // loop through end ports
>> +     for (j = startport; j <= endport; j++) {
>> +
>> +             portinfo = sc->portinfo + 64 * j;
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/ */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
>> +             snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
>> +             make_path(path);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/lid_mask_count */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
>> +             file_printf(path, SYS_PORT_LMC, "%d", val);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/sm_lid */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
>> +             file_printf(path, SYS_PORT_SMLID, "0x%x", val);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/sm_sl */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
>> +             file_printf(path, SYS_PORT_SMSL, "%d", val);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/lid */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
>> +             file_printf(path, SYS_PORT_LID, "0x%x", val);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/state */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
>> +             if (val == 0)
>> +                     str = "NOP";
>> +             else if (val == 1)
>> +                     str = "DOWN";
>> +             else if (val == 2)
>> +                     str = "INIT";
>> +             else if (val == 3)
>> +                     str = "ARMED";
>> +             else if (val == 4)
>> +                     str = "ACTIVE";
>> +             else if (val == 5)
>> +                     str = "ACTIVE_DEFER";
>> +             else
>> +                     str = "<unknown>";
>> +             file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/phys_state */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
>> +             if (val == 1)
>> +                     str = "Sleep";
>> +             else if (val == 2)
>> +                     str = "Polling";
>> +             else if (val == 3)
>> +                     str = "Disabled";
>> +             else if (val == 4)
>> +                     str = "PortConfigurationTraining";
>> +             else if (val == 5)
>> +                     str = "LinkUp";
>> +             else if (val == 6)
>> +                     str = "LinkErrorRecovery";
>> +             else if (val == 7)
>> +                     str = "Phy Test";
>> +             else
>> +                     str = "<unknown>";
>> +             file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/rate */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
>> +             speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
>> +             if (val == 1)
>> +                     val = 1;
>> +             else if (val == 2)
>> +                     val = 4;
>> +             else if (val == 4)
>> +                     val = 8;
>> +             else if (val == 8)
>> +                     val = 12;
>> +             else
>> +                     val = 0;
>> +             if (speed == 2)
>> +                     str = " DDR";
>> +             else if (speed == 4)
>> +                     str = " QDR";
>> +             else
>> +                     str = "";
>> +             file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
>> +                         (val * speed * 25) / 10,
>> +                         (val * speed * 25) % 10 ? ".5" : "", val, str);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/cap_mask */
>> +             val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
>> +             file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/gids/0 */
>> +             str = path + strlen(path);
>> +             strncat(path, "/gids", sizeof(path) - 1);
>> +             make_path(path);
>> +             *str = '\0';
>> +             gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
>> +             guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j;
>> +             file_printf(path, SYS_PORT_GID,
>> +                         "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
>> +                         (uint16_t) ((gid >> 48) & 0xffff),
>> +                         (uint16_t) ((gid >> 32) & 0xffff),
>> +                         (uint16_t) ((gid >> 16) & 0xffff),
>> +                         (uint16_t) ((gid >> 0) & 0xffff),
>> +                         (uint16_t) ((guid >> 48) & 0xffff),
>> +                         (uint16_t) ((guid >> 32) & 0xffff),
>> +                         (uint16_t) ((guid >> 16) & 0xffff),
>> +                         (uint16_t) ((guid >> 0) & 0xffff));
>> +
>> +             /* /sys/class/infiniband/mthca0/ports/<n>/pkeys/0 */
>> +             str = path + strlen(path);
>> +             strncat(path, "/pkeys", sizeof(path) - 1);
>> +             make_path(path);
>> +             for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
>> +                     char name[8];
>> +                     snprintf(name, sizeof(name), "%u", i);
>> +                     file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
>> +             }
>> +             *str = '\0';
>>
>> -     /* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */
>> -     str = path + strlen(path);
>> -     strncat(path, "/pkeys", sizeof(path) - 1);
>> -     make_path(path);
>> -     for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
>> -             char name[8];
>> -             snprintf(name, sizeof(name), "%u", i);
>> -             file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
>> +             *ports_path_end = '\0';
>>       }
>> -     *str = '\0';
>>
>>       /* /sys/class/infiniband_mad/umad0/ */
>>       snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
>> @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
>>       if (sim_client_init(&dev->sim_client) < 0)
>>               goto _error;
>>
>> -     dev->port = mad_get_field(&dev->sim_client.portinfo, 0,
>> -                               IB_PORT_LOCAL_PORT_F);
>> +     dev->port = dev->sim_client.portnum;
>>       for (i = 0; i < arrsize(dev->agents); i++)
>>               dev->agents[i].id = (uint32_t)(-1);
>>       for (i = 0; i < arrsize(dev->agent_idx); i++)
>> --
>> 1.5.6.4
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Tue Feb 17 13:33:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 23:33:26 +0200
Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation
	support
In-Reply-To: <20090214203753.GE32660@comcast.net>
References: <20090214203753.GE32660@comcast.net>
Message-ID: <20090217213326.GR7189@sashak.voltaire.com>

On 15:37 Sat 14 Feb     , hnrose at comcast.net wrote:
> 
> Add SIM_PORT environment variable to allow for end port selection

Also this patch looks like a mix of two independent ones - fetching all
node ports and showing it in sysfs simulation and SIM_PORT. Likely more
descriptive commit message would be helpful here.

Sasha

> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
>  ibsim/ibsim.c         |    6 +-
>  include/ibsim.h       |    2 +
>  umad2sim/sim_client.c |   49 +++++++++-
>  umad2sim/sim_client.h |    4 +-
>  umad2sim/umad2sim.c   |  254 ++++++++++++++++++++++++++-----------------------
>  5 files changed, 189 insertions(+), 126 deletions(-)
> 
> diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
> index f48e1f0..6a35fdc 100644
> --- a/ibsim/ibsim.c
> +++ b/ibsim/ibsim.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -187,7 +188,8 @@ static int sm_exists(Node * node)
>  	return 0;
>  }
>  
> -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from)
> +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl,
> +			      union name_t *from)
>  {
>  	union name_t name;
>  	size_t size;
> @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f
>  			ctl->type = SIM_CTL_ERROR;
>  			return -1;
>  		}
> -		cl->port = node_get_port(node, 0);
> +		cl->port = node_get_port(node, scl->portnum);
>  		VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64,
>  		     i, node->nodeid, cl->port->portguid);
>  	} else {
> diff --git a/include/ibsim.h b/include/ibsim.h
> index 15fc37c..66ba6f9 100644
> --- a/include/ibsim.h
> +++ b/include/ibsim.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -100,6 +101,7 @@ struct sim_client_info {
>  	uint32_t qp;
>  	uint32_t issm;		/* accept request for qp 0 & 1 */
>  	char nodeid[32];
> +	uint32_t portnum;
>  };
>  
>  union name_t {
> diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c
> index 06bb7a8..1c35109 100644
> --- a/umad2sim/sim_client.c
> +++ b/umad2sim/sim_client.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid)
>  	info.id = id;
>  	info.issm = 0;
>  	info.qp = qp;
> +	info.portnum = sc->portnum;
>  
>  	if (nodeid)
>  		strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1);
> @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc)
>  	return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0);
>  }
>  
> -static int sim_init(struct sim_client *sc, char *nodeid)
> +static int sim_init(struct sim_client *sc, char *nodeid, int portnum)
>  {
>  	union name_t name;
>  	socklen_t size;
> @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid)
>  	DEBUG("init %d: opened ctl fd %d as \'%s\'",
>  	      pid, ctlfd, get_name(&name));
>  
> +	sc->portnum = portnum;
>  	port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT;
>  	size = make_name(&name, connect_host, port, "%s:ctl", socket_basename);
>  
> @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm)
>  int sim_client_init(struct sim_client *sc)
>  {
>  	char *nodeid;
> +	char *portno;
> +	int i, j = 0, portnum = 0, startport = 1, endport;
> +	uint8_t numports, nodetype;
> +	uint8_t *portinfo;
>  
>  	nodeid = getenv("SIM_HOST");
> -	if (sim_init(sc, nodeid) < 0)
> +	portno = getenv("SIM_PORT");
> +	if (portno)
> +		portnum = atoi(portno);
> +
> +	if (sim_init(sc, nodeid, portnum) < 0)
>  		return -1;
>  	if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor,
>  		    sizeof(sc->vendor)) < 0)
> @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc)
>  	if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo,
>  		    sizeof(sc->nodeinfo)) < 0)
>  		goto _exit;
> +	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
> +	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
> +	if (nodetype == 2) { // switch
> +		startport = 0;
> +		endport = 0;
> +	} else {
> +		if (portnum == 0) {
> +			IBWARN("portnum 0 is not valid end port on non switch node");
> +			goto _exit;
> +		}
> +		endport = numports;
> +	}
> +	if (portnum > endport) {
> +		IBWARN("portnum %d is not a valid end port number (%d)",
> +		       portnum, endport);
> +		goto _exit;
> +	}
>  
> -	sc->portinfo[0] = 0;	// portno requested
> -	if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo,
> -		    sizeof(sc->portinfo)) < 0)
> +	sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1));	// portinfo size x number of ports starting at 0
> +	if (!sc->portinfo)
>  		goto _exit;
> +
> +	// loop through end ports
> +	for (i = startport; i <= endport ; i++, j++) {
> +		portinfo = sc->portinfo + 64 * j;
> +		*portinfo = i + 1; // portno requested
> +		if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0)
> +			goto _exit;
> +	}
> +
> +	// although pkeys also per port, current config same on all end ports
>  	if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0)
>  		goto _exit;
>  	if (getenv("SIM_SET_ISSM"))
> @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc)
>  void sim_client_exit(struct sim_client *sc)
>  {
>  	sim_disconnect(sc);
> +	if (sc->portinfo)
> +		free(sc->portinfo);
>  	sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1;
>  }
> diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h
> index 80ed442..0faca80 100644
> --- a/umad2sim/sim_client.h
> +++ b/umad2sim/sim_client.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -41,8 +42,9 @@ struct sim_client {
>  	int clientid;
>  	int fd_pktin, fd_pktout, fd_ctl;
>  	struct sim_vendor vendor;
> +	int portnum;
>  	uint8_t nodeinfo[64];
> -	uint8_t portinfo[64];
> +	uint8_t *portinfo;
>  	uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)];
>  };
>  
> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
> index 8d83a24..6e3c269 100644
> --- a/umad2sim/umad2sim.c
> +++ b/umad2sim/umad2sim.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>  	struct sim_client *sc = &dev->sim_client;
>  	char *str;
>  	uint8_t *portinfo;
> -	int i;
> +	char *ports_path_end;
> +	int i, j;
> +	int startport = 1, endport;
> +	uint8_t numports, nodetype;
>  
>  	/* /sys/class/infiniband_mad/abi_version */
>  	snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir);
> @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>  	strncat(path, "/ports", sizeof(path) - 1);
>  	make_path(path);
>  
> -	portinfo = sc->portinfo;
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/ */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> -	snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
> -	make_path(path);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
> -	file_printf(path, SYS_PORT_LMC, "%d", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/sm_lid */
> -	val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
> -	file_printf(path, SYS_PORT_SMLID, "0x%x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/sm_sl */
> -	val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
> -	file_printf(path, SYS_PORT_SMSL, "%d", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/lid */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
> -	file_printf(path, SYS_PORT_LID, "0x%x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/state */
> -	val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
> -	if (val == 0)
> -		str = "NOP";
> -	else if (val == 1)
> -		str = "DOWN";
> -	else if (val == 2)
> -		str = "INIT";
> -	else if (val == 3)
> -		str = "ARMED";
> -	else if (val == 4)
> -		str = "ACTIVE";
> -	else if (val == 5)
> -		str = "ACTIVE_DEFER";
> -	else
> -		str = "<unknown>";
> -	file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/phys_state */
> -	val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
> -	if (val == 1)
> -		str = "Sleep";
> -	else if (val == 2)
> -		str = "Polling";
> -	else if (val == 3)
> -		str = "Disabled";
> -	else if (val == 4)
> -		str = "PortConfigurationTraining";
> -	else if (val == 5)
> -		str = "LinkUp";
> -	else if (val == 6)
> -		str = "LinkErrorRecovery";
> -	else if (val == 7)
> -		str = "Phy Test";
> -	else
> -		str = "<unknown>";
> -	file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/rate */
> -	val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
> -	speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
> -	if (val == 1)
> -		val = 1;
> -	else if (val == 2)
> -		val = 4;
> -	else if (val == 4)
> -		val = 8;
> -	else if (val == 8)
> -		val = 12;
> -	else
> -		val = 0;
> -	if (speed == 2)
> -		str = " DDR";
> -	else if (speed == 4)
> -		str = " QDR";
> -	else
> -		str = "";
> -	file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
> -		    (val * speed * 25) / 10,
> -		    (val * speed * 25) % 10 ? ".5" : "", val, str);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/cap_mask */
> -	val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
> -	file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
> -
> -	/* /sys/class/infiniband/mthca0/ports/1/gids/0 */
> -	str = path + strlen(path);
> -	strncat(path, "/gids", sizeof(path) - 1);
> -	make_path(path);
> -	*str = '\0';
> -	gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
> -	guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) +
> -	    mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> -	file_printf(path, SYS_PORT_GID,
> -		    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> -		    (uint16_t) ((gid >> 48) & 0xffff),
> -		    (uint16_t) ((gid >> 32) & 0xffff),
> -		    (uint16_t) ((gid >> 16) & 0xffff),
> -		    (uint16_t) ((gid >> 0) & 0xffff),
> -		    (uint16_t) ((guid >> 48) & 0xffff),
> -		    (uint16_t) ((guid >> 32) & 0xffff),
> -		    (uint16_t) ((guid >> 16) & 0xffff),
> -		    (uint16_t) ((guid >> 0) & 0xffff));
> +	numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F);
> +	nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F);
> +        if (nodetype == 2) { // switch
> +		startport = 0;
> +		endport = 0;
> +	} else
> +		endport = numports;
> +
> +	ports_path_end = path + strlen(path);
> +
> +	// loop through end ports
> +	for (j = startport; j <= endport; j++) {
> +
> +		portinfo = sc->portinfo + 64 * j;
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/ */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F);
> +		snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val);
> +		make_path(path);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/lid_mask_count */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LMC_F);
> +		file_printf(path, SYS_PORT_LMC, "%d", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/sm_lid */
> +		val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F);
> +		file_printf(path, SYS_PORT_SMLID, "0x%x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/sm_sl */
> +		val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F);
> +		file_printf(path, SYS_PORT_SMSL, "%d", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/lid */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LID_F);
> +		file_printf(path, SYS_PORT_LID, "0x%x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/state */
> +		val = mad_get_field(portinfo, 0, IB_PORT_STATE_F);
> +		if (val == 0)
> +			str = "NOP";
> +		else if (val == 1)
> +			str = "DOWN";
> +		else if (val == 2)
> +			str = "INIT";
> +		else if (val == 3)
> +			str = "ARMED";
> +		else if (val == 4)
> +			str = "ACTIVE";
> +		else if (val == 5)
> +			str = "ACTIVE_DEFER";
> +		else
> +			str = "<unknown>";
> +		file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/phys_state */
> +		val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F);
> +		if (val == 1)
> +			str = "Sleep";
> +		else if (val == 2)
> +			str = "Polling";
> +		else if (val == 3)
> +			str = "Disabled";
> +		else if (val == 4)
> +			str = "PortConfigurationTraining";
> +		else if (val == 5)
> +			str = "LinkUp";
> +		else if (val == 6)
> +			str = "LinkErrorRecovery";
> +		else if (val == 7)
> +			str = "Phy Test";
> +		else
> +			str = "<unknown>";
> +		file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/rate */
> +		val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F);
> +		speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F);
> +		if (val == 1)
> +			val = 1;
> +		else if (val == 2)
> +			val = 4;
> +		else if (val == 4)
> +			val = 8;
> +		else if (val == 8)
> +			val = 12;
> +		else
> +			val = 0;
> +		if (speed == 2)
> +			str = " DDR";
> +		else if (speed == 4)
> +			str = " QDR";
> +		else
> +			str = "";
> +		file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n",
> +			    (val * speed * 25) / 10,
> +			    (val * speed * 25) % 10 ? ".5" : "", val, str);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/cap_mask */
> +		val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F);
> +		file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val);
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/gids/0 */
> +		str = path + strlen(path);
> +		strncat(path, "/gids", sizeof(path) - 1);
> +		make_path(path);
> +		*str = '\0';
> +		gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F);
> +		guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j;
> +		file_printf(path, SYS_PORT_GID,
> +			    "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> +			    (uint16_t) ((gid >> 48) & 0xffff),
> +			    (uint16_t) ((gid >> 32) & 0xffff),
> +			    (uint16_t) ((gid >> 16) & 0xffff),
> +			    (uint16_t) ((gid >> 0) & 0xffff),
> +			    (uint16_t) ((guid >> 48) & 0xffff),
> +			    (uint16_t) ((guid >> 32) & 0xffff),
> +			    (uint16_t) ((guid >> 16) & 0xffff),
> +			    (uint16_t) ((guid >> 0) & 0xffff));
> +
> +		/* /sys/class/infiniband/mthca0/ports/<n>/pkeys/0 */
> +		str = path + strlen(path);
> +		strncat(path, "/pkeys", sizeof(path) - 1);
> +		make_path(path);
> +		for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
> +			char name[8];
> +			snprintf(name, sizeof(name), "%u", i);
> +			file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
> +		}
> +		*str = '\0';
>  
> -	/* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */
> -	str = path + strlen(path);
> -	strncat(path, "/pkeys", sizeof(path) - 1);
> -	make_path(path);
> -	for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) {
> -		char name[8];
> -		snprintf(name, sizeof(name), "%u", i);
> -		file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i]));
> +		*ports_path_end = '\0';
>  	}
> -	*str = '\0';
>  
>  	/* /sys/class/infiniband_mad/umad0/ */
>  	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
> @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
>  	if (sim_client_init(&dev->sim_client) < 0)
>  		goto _error;
>  
> -	dev->port = mad_get_field(&dev->sim_client.portinfo, 0,
> -				  IB_PORT_LOCAL_PORT_F);
> +	dev->port = dev->sim_client.portnum;
>  	for (i = 0; i < arrsize(dev->agents); i++)
>  		dev->agents[i].id = (uint32_t)(-1);
>  	for (i = 0; i < arrsize(dev->agent_idx); i++)
> -- 
> 1.5.6.4
> 


From sashak at voltaire.com  Tue Feb 17 13:55:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Feb 2009 23:55:40 +0200
Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port
	simulation support
In-Reply-To: <f0e08f230902171328m3fcc074ew65c978b3f8f81520@mail.gmail.com>
References: <20090214203753.GE32660@comcast.net>
	<20090217211848.GP7189@sashak.voltaire.com>
	<f0e08f230902171328m3fcc074ew65c978b3f8f81520@mail.gmail.com>
Message-ID: <20090217215533.GS7189@sashak.voltaire.com>

On 16:28 Tue 17 Feb     , Hal Rosenstock wrote:
> Sasha,
> 
> On Tue, Feb 17, 2009 at 4:18 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > Hi Hal,
> >
> > On 15:37 Sat 14 Feb     , hnrose at comcast.net wrote:
> >>
> >> Add SIM_PORT environment variable to allow for end port selection
> >
> > How this would handle case when SIM_PORT=N, but program tries to work
> > via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)?
> 
> That's a configuration error. SIM_PORT needs to be set to same port as
> program intends to use.

This is different things - program doesn't have to know about simulator
at all. so dependency between '-C' and SIM_PORT is not a good idea.
Actually I think that SIM_PORT is not needed at all - see below.

> > IOW should port number selection be initiated natively by program rather
> > than by using environment variables?
> 
> That would've been nice but AFAIT the simulation layer needs the port
> number earlier than the program can supply it.

This is using the current implementation only where sysfs tree is
generated (simulated) only for one port. Now if you are going to fetch
all PortInfo(s) anyway, then application can choose port number just by
using it's regular mechanisms - no needs for any SIM_PORT variables.
(Likely you will need additional sim_ctl() call which will be triggered
by umad open() to set a port number on ibsim's client side).

> Maybe that could be
> changed but I didn't dig into that.
> 
> >> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

[snip...]

> >> +
> >> +     // although pkeys also per port, current config same on all end ports
> >
> > Which is not correct really.
> 
> What are you referring to ? Is there some config for end port pkeys in
> the simulator ?

Each port on ibsim side has each own pkey table (it has some default
preset value and can be configured using OpenSM and maybe ibutils, so
special "out-of-bound" config is not needed). And we need to display it
properly for each port.

Sasha


From swise at opengridcomputing.com  Tue Feb 17 14:00:00 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Feb 2009 16:00:00 -0600
Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for
	active connections.
Message-ID: <20090217215959.16117.17150.stgit@NTAC>

- wrapper calls into cxgb3 and fail them if we're in the middle
  of an eeh event.

- correctly unwind and release endpoint and other resources when
  we are in an EEH event.

- post DEVICE_FATAL event on all active QPs when cxgb3 notifies
  iw_cxgb3 of a fatal error.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |    8 +--
 drivers/infiniband/hw/cxgb3/cxio_hal.h |    1 
 drivers/infiniband/hw/cxgb3/iwch.c     |   26 +++++++++
 drivers/infiniband/hw/cxgb3/iwch.h     |    5 ++
 drivers/infiniband/hw/cxgb3/iwch_cm.c  |   90 +++++++++++++++++++++++---------
 drivers/infiniband/hw/cxgb3/iwch_qp.c  |    4 +
 6 files changed, 101 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index eeae5f5..99d114d 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -152,7 +152,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
 	sge_cmd = qpid << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
 	skb->priority = CPL_PRIORITY_CONTROL;
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
 }
 
 int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
@@ -571,7 +571,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
 	     (unsigned long long) rdev_p->ctrl_qp.dma_addr,
 	     rdev_p->ctrl_qp.workq, 1 << T3_CTRL_QP_SIZE_LOG2);
 	skb->priority = CPL_PRIORITY_CONTROL;
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
 err:
 	kfree_skb(skb);
 	return err;
@@ -858,7 +858,7 @@ int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
 	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
 	wqe->irs = cpu_to_be32(attr->irs);
 	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
 }
 
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
@@ -1024,9 +1024,9 @@ void cxio_rdev_close(struct cxio_rdev *rdev_p)
 		cxio_hal_pblpool_destroy(rdev_p);
 		cxio_hal_rqtpool_destroy(rdev_p);
 		list_del(&rdev_p->entry);
-		rdev_p->t3cdev_p->ulp = NULL;
 		cxio_hal_destroy_ctrl_qp(rdev_p);
 		cxio_hal_destroy_resource(rdev_p->rscp);
+		rdev_p->t3cdev_p->ulp = NULL;
 	}
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 9ed65b0..6cbf216 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -185,6 +185,7 @@ void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_flush_hw_cq(struct t3_cq *cq);
 int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
 		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb);
 
 #define MOD "iw_cxgb3: "
 #define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
index 37a4fc2..e5d57fa 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -162,15 +162,37 @@ static void close_rnic_dev(struct t3cdev *tdev)
 	mutex_unlock(&dev_mutex);
 }
 
+static int iwch_post_qp_fatal(int id, void *p, void *data)
+{
+	struct ib_event event;
+	struct iwch_qp *qhp = p;
+
+	event.event = IB_EVENT_DEVICE_FATAL;
+	event.device = qhp->ibqp.device;
+	event.element.qp = &qhp->ibqp;
+	BUG_ON(qhp->rhp != data);
+	BUG_ON(qhp->wq.qpid != id);
+	if (qhp->ibqp.event_handler) {
+		PDBG("%s posting DEVICE_FATAL for qpid %u\n",
+			__func__, qhp->wq.qpid);
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+	}
+	return 0;
+}
+
 static void iwch_err_handler(struct t3cdev *tdev, u32 status, u32 error)
 {
 	struct cxio_rdev *rdev = tdev->ulp;
+	struct iwch_dev *rnicp = rdev_to_iwch_dev(rdev);
 
-	if (status == OFFLOAD_STATUS_DOWN)
+	if (status == OFFLOAD_STATUS_DOWN) {
 		rdev->flags = CXIO_ERROR_FATAL;
+		spin_lock_irq(&rnicp->lock);
+		idr_for_each(&rnicp->qpidr, iwch_post_qp_fatal, rnicp);
+		spin_unlock_irq(&rnicp->lock);
+	}
 
 	return;
-
 }
 
 static int __init iwch_init_module(void)
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
index 3773453..8473550 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.h
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -117,6 +117,11 @@ static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
 	return container_of(ibdev, struct iwch_dev, ibdev);
 }
 
+static inline struct iwch_dev *rdev_to_iwch_dev(struct cxio_rdev *rdev)
+{
+	return container_of(rdev, struct iwch_dev, rdev);
+}
+
 static inline int t3b_device(const struct iwch_dev *rhp)
 {
 	return rhp->rdev.t3cdev_p->type == T3B;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 8699947..8ef670d 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -139,6 +139,38 @@ static void stop_ep_timer(struct iwch_ep *ep)
 	put_ep(&ep->com);
 }
 
+int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e)
+{
+	int	error=0;
+	struct cxio_rdev *rdev;
+
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (rdev->flags) {
+		kfree_skb(skb);
+		return -EIO;
+	}
+	error = l2t_send(tdev, skb, l2e);
+	if (error)
+		kfree_skb(skb);
+	return error;
+}
+
+int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb)
+{
+	int	error=0;
+	struct cxio_rdev *rdev;
+
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (rdev->flags) {
+		kfree_skb(skb);
+		return -EIO;
+	}
+	error = cxgb3_ofld_send(tdev, skb);
+	if (error)
+		kfree_skb(skb);
+	return error;
+}
+
 static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
 {
 	struct cpl_tid_release *req;
@@ -150,7 +182,7 @@ static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
 	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
 	skb->priority = CPL_PRIORITY_SETUP;
-	cxgb3_ofld_send(tdev, skb);
+	iwch_cxgb3_ofld_send(tdev, skb);
 	return;
 }
 
@@ -172,8 +204,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep)
 	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
 
 	skb->priority = CPL_PRIORITY_DATA;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 int iwch_resume_tid(struct iwch_ep *ep)
@@ -194,8 +225,7 @@ int iwch_resume_tid(struct iwch_ep *ep)
 	req->val = 0;
 
 	skb->priority = CPL_PRIORITY_DATA;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static void set_emss(struct iwch_ep *ep, u16 opt)
@@ -382,7 +412,7 @@ static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
 
 	PDBG("%s t3cdev %p\n", __func__, dev);
 	req->cmd = CPL_ABORT_NO_RST;
-	cxgb3_ofld_send(dev, skb);
+	iwch_cxgb3_ofld_send(dev, skb);
 }
 
 static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
@@ -402,8 +432,7 @@ static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
 	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
 	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
@@ -424,8 +453,7 @@ static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
 	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
 	req->cmd = CPL_ABORT_SEND_RST;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_connect(struct iwch_ep *ep)
@@ -469,8 +497,7 @@ static int send_connect(struct iwch_ep *ep)
 	req->opt0l = htonl(opt0l);
 	req->params = 0;
 	req->opt2 = htonl(opt2);
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
@@ -527,7 +554,7 @@ static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
+	iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 	start_ep_timer(ep);
 	state_set(&ep->com, MPA_REQ_SENT);
 	return;
@@ -578,8 +605,7 @@ static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
@@ -630,8 +656,7 @@ static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
 	req->sndseq = htonl(ep->snd_seq);
 	ep->mpa_skb = skb;
 	state_set(&ep->com, MPA_REP_SENT);
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
@@ -795,7 +820,7 @@ static int update_rx_credits(struct iwch_ep *ep, u32 credits)
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
 	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
 	skb->priority = CPL_PRIORITY_ACK;
-	cxgb3_ofld_send(ep->com.tdev, skb);
+	iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 	return credits;
 }
 
@@ -1203,8 +1228,7 @@ static int listen_start(struct iwch_listen_ep *ep)
 	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
 
 	skb->priority = 1;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
@@ -1237,8 +1261,7 @@ static int listen_stop(struct iwch_listen_ep *ep)
 	req->cpu_idx = 0;
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
 	skb->priority = 1;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
@@ -1286,7 +1309,7 @@ static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
 	rpl->opt2 = htonl(opt2);
 	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
 	skb->priority = CPL_PRIORITY_SETUP;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
+	iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 
 	return;
 }
@@ -1315,7 +1338,7 @@ static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
 		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
 		rpl->opt2 = 0;
 		rpl->rsvd = rpl->opt2;
-		cxgb3_ofld_send(tdev, skb);
+		iwch_cxgb3_ofld_send(tdev, skb);
 	}
 }
 
@@ -1613,7 +1636,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
 	rpl->cmd = CPL_ABORT_NO_RST;
-	cxgb3_ofld_send(ep->com.tdev, rpl_skb);
+	iwch_cxgb3_ofld_send(ep->com.tdev, rpl_skb);
 out:
 	if (release)
 		release_ep_resources(ep);
@@ -2017,8 +2040,11 @@ int iwch_destroy_listen(struct iw_cm_id *cm_id)
 	ep->com.rpl_done = 0;
 	ep->com.rpl_err = 0;
 	err = listen_stop(ep);
+	if (err)
+		goto done;
 	wait_event(ep->com.waitq, ep->com.rpl_done);
 	cxgb3_free_stid(ep->com.tdev, ep->stid);
+done:
 	err = ep->com.rpl_err;
 	cm_id->rem_ref(cm_id);
 	put_ep(&ep->com);
@@ -2030,12 +2056,22 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
 	int ret=0;
 	unsigned long flags;
 	int close = 0;
+	int fatal = 0;
+	struct t3cdev *tdev;
+	struct cxio_rdev *rdev;
 
 	spin_lock_irqsave(&ep->com.lock, flags);
 
 	PDBG("%s ep %p state %s, abrupt %d\n", __func__, ep,
 	     states[ep->com.state], abrupt);
 
+	tdev = (struct t3cdev *)ep->com.tdev;
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (rdev->flags) {
+		fatal = 1;
+		close_complete_upcall(ep);
+		ep->com.state = DEAD;
+	}
 	switch (ep->com.state) {
 	case MPA_REQ_WAIT:
 	case MPA_REQ_SENT:
@@ -2075,7 +2111,11 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
 			ret = send_abort(ep, NULL, gfp);
 		else
 			ret = send_halfclose(ep, gfp);
+		if (ret)
+			fatal = 1;
 	}
+	if (fatal)
+		release_ep_resources(ep);
 	return ret;
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index aa72d18..9324aa1 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -751,7 +751,7 @@ int iwch_post_zb_read(struct iwch_qp *qhp)
 	wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)|
 						V_FW_RIWR_LEN(flit_cnt));
 	skb->priority = CPL_PRIORITY_DATA;
-	return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
+	return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
 }
 
 /*
@@ -783,7 +783,7 @@ int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
 			 V_FW_RIWR_FLAGS(T3_COMPLETION_FLAG | T3_NOTIFY_FLAG));
 	wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid));
 	skb->priority = CPL_PRIORITY_DATA;
-	return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
+	return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
 }
 
 /*


From sashak at voltaire.com  Tue Feb 17 14:09:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 00:09:33 +0200
Subject: [ofa-general] [PATCH] opensm: pre-scan command line for config file
	option
Message-ID: <20090217220933.GT7189@sashak.voltaire.com>


Scan command line for config file option and parse cofig file if found
before processing other command line options. It makes prevents
potential multiple set for options listed before '-F' (command line was
rescanned anyway).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/main.c |   37 ++++++++++++++++++++++---------------
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index a8dc9e6..a632cd7 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -522,9 +522,8 @@ int main(int argc, char *argv[])
 	boolean_t run_once_flag = FALSE;
 	int32_t vendor_debug = 0;
 	uint32_t next_option;
-	char *conf_template = NULL;
+	char *conf_template = NULL, *config_file = NULL;
 	uint32_t val;
-	unsigned config_file_done = 0;
 	const char *const short_option =
 	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:";
 
@@ -608,7 +607,26 @@ int main(int argc, char *argv[])
 	osm_subn_set_default_opt(&opt);
 
 	if (osm_subn_parse_conf_file(OSM_DEFAULT_CONFIG_FILE, &opt) < 0)
-		printf("\nosm_subn_parse_conf_file failed!\n");
+		printf("\nFail to parse config file \'%s\'\n",
+		       OSM_DEFAULT_CONFIG_FILE);
+
+	do {
+		next_option = getopt_long_only(argc, argv, short_option,
+					       long_option, NULL);
+		switch (next_option) {
+		case 'F':
+			config_file = optarg;
+			printf("Config file is `%s`:\n", config_file);
+			break;
+		default:
+			break;
+		}
+	} while (next_option != -1);
+
+	optind = 0; /* reset command line */
+
+	if (config_file && osm_subn_parse_conf_file(config_file, &opt) < 0)
+		printf("\nFail to parse config file \'%s\'\n", config_file);
 
 	printf("Command Line Arguments:\n");
 	do {
@@ -619,16 +637,6 @@ int main(int argc, char *argv[])
 			exit(0);
 			break;
 		case 'F':
-			if (config_file_done)
-				break;
-			printf("Reloading config from `%s`:\n", optarg);
-			if (osm_subn_parse_conf_file(optarg, &opt)) {
-				printf("cannot parse config file.\n");
-				exit(1);
-			}
-			printf("Rescaning command line:\n");
-			config_file_done = 1;
-			optind = 0;
 			break;
 		case 'c':
 			conf_template = optarg;
@@ -936,8 +944,7 @@ int main(int argc, char *argv[])
 		default:	/* something wrong */
 			abort();
 		}
-	}
-	while (next_option != -1);
+	} while (next_option != -1);
 
 	if (opt.log_file != NULL)
 		printf(" Log File: %s\n", opt.log_file);
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Tue Feb 17 14:25:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 00:25:19 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: move parse and setup
	functions
Message-ID: <20090217222519.GU7189@sashak.voltaire.com>


Move options parse and setup functions above options rec struct
initialization - eliminate prototyping, typedefs, etc.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |  421 +++++++++++++++++++++-----------------------
 1 files changed, 204 insertions(+), 217 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 69937c1..f12685e 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -73,25 +73,217 @@ static const char null_str[] = "(null)";
 
 #define OPT_OFFSET(opt) offsetof(osm_subn_opt_t, opt)
 
-typedef void (setup_fn_t)(osm_subn_t *p_subn, void *p_val);
-typedef void (parse_fn_t)(osm_subn_t *p_subn, char *p_key, char *p_val_str,
-			  void *p_val, setup_fn_t *f);
-
 typedef struct opt_rec {
 	const char *name;
 	unsigned long opt_offset;
-	parse_fn_t *parse_fn;
-	setup_fn_t *setup_fn;
+	void (*parse_fn)(osm_subn_t *p_subn, char *p_key, char *p_val_str,
+			 void *p_val, void (*)(osm_subn_t *, void *));
+	void (*setup_fn)(osm_subn_t *p_subn, void *p_val);
 	int  can_update;
 } opt_rec_t;
 
-static parse_fn_t opts_parse_uint8, opts_parse_uint16, opts_parse_net16,
-	opts_parse_uint32, opts_parse_int32, opts_parse_net64,
-	opts_parse_charp, opts_parse_boolean;
+static void log_report(const char *fmt, ...)
+{
+	char buf[128];
+	va_list args;
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printf("%s", buf);
+	cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0);
+}
+
+static void log_config_value(char *name, const char *fmt, ...)
+{
+	char buf[128];
+	va_list args;
+	unsigned n;
+	va_start(args, fmt);
+	n = snprintf(buf, sizeof(buf), " Loading Cached Option:%s = ", name);
+	if (n > sizeof(buf))
+		n = sizeof(buf);
+	n += vsnprintf(buf + n, sizeof(buf) - n, fmt, args);
+	if (n > sizeof(buf))
+		n = sizeof(buf);
+	snprintf(buf + n, sizeof(buf) - n, "\n");
+	va_end(args);
+	printf("%s", buf);
+	cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0);
+}
+
+static void opts_setup_log_flags(osm_subn_t *p_subn, void *p_val)
+{
+	p_subn->p_osm->log.level = *((uint8_t *) p_val);
+}
+
+static void opts_setup_force_log_flush(osm_subn_t *p_subn, void *p_val)
+{
+	p_subn->p_osm->log.flush = *((boolean_t *) p_val);
+}
+
+static void opts_setup_accum_log_file(osm_subn_t *p_subn, void *p_val)
+{
+	p_subn->p_osm->log.accum_log_file = *((boolean_t *) p_val);
+}
+
+static void opts_setup_log_max_size(osm_subn_t *p_subn, void *p_val)
+{
+	uint32_t log_max_size = *((uint32_t *) p_val);
+
+	p_subn->p_osm->log.max_size = log_max_size << 20; /* convert from MB to bytes */
+}
+
+static void opts_setup_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val)
+{
+	osm_sm_t *p_sm = &p_subn->p_osm->sm;
+	uint32_t sminfo_polling_timeout = *((uint32_t *) p_val);
+
+	cl_timer_stop(&p_sm->polling_timer);
+	cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout);
+}
+
+static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val)
+{
+	osm_sm_t *p_sm = &p_subn->p_osm->sm;
+	uint8_t sm_priority = *((uint8_t *) p_val);
+
+	osm_set_sm_priority(p_sm, sm_priority);
+}
+
+static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key,
+			     IN char *p_val_str, IN void *p_v,
+			     void (*pfn)(osm_subn_t *, void *))
+{
+	uint64_t *p_val = p_v;
+	uint64_t val = strtoull(p_val_str, NULL, 0);
+
+	if (cl_hton64(val) != *p_val) {
+		log_config_value(p_key, "0x%016" PRIx64, val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = cl_ntoh64(val);
+	}
+}
+
+static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key,
+			      IN char *p_val_str, IN void *p_v,
+			      void (*pfn)(osm_subn_t *, void *))
+{
+	uint32_t *p_val = p_v;
+	uint32_t val = strtoul(p_val_str, NULL, 0);
+
+	if (val != *p_val) {
+		log_config_value(p_key, "%u", val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = val;
+	}
+}
+
+static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key,
+			     IN char *p_val_str, IN void *p_v,
+			     void (*pfn)(osm_subn_t *, void *))
+{
+	int32_t *p_val = p_v;
+	int32_t val = strtol(p_val_str, NULL, 0);
+
+	if (val != *p_val) {
+		log_config_value(p_key, "%d", val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = val;
+	}
+}
+
+static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key,
+			      IN char *p_val_str, IN void *p_v,
+			      void (*pfn)(osm_subn_t *, void *))
+{
+	uint16_t *p_val = p_v;
+	uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0);
+
+	if (val != *p_val) {
+		log_config_value(p_key, "%u", val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = val;
+	}
+}
+
+static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key,
+			     IN char *p_val_str, IN void *p_v,
+			     void (*pfn)(osm_subn_t *, void *))
+{
+	uint16_t *p_val = p_v;
+	uint16_t val = strtoul(p_val_str, NULL, 0);
+
+	CL_ASSERT(val < 0x10000);
+	if (cl_hton16(val) != *p_val) {
+		log_config_value(p_key, "0x%04x", val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = cl_hton16(val);
+	}
+}
+
+static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key,
+			     IN char *p_val_str, IN void *p_v,
+			     void (*pfn)(osm_subn_t *, void *))
+{
+	uint8_t *p_val = p_v;
+	uint8_t val = strtoul(p_val_str, NULL, 0);
+
+	CL_ASSERT(val < 0x100);
+	if (val != *p_val) {
+		log_config_value(p_key, "%u", val);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = val;
+	}
+}
+
+static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key,
+			       IN char *p_val_str, IN void *p_v,
+			       void (*pfn)(osm_subn_t *, void *))
+{
+	boolean_t *p_val = p_v;
+	boolean_t val;
 
-static setup_fn_t opts_setup_log_flags, opts_setup_log_max_size,
-	opts_setup_force_log_flush, opts_setup_accum_log_file,
-	opts_setup_sminfo_polling_timeout, opts_setup_sm_priority;
+	if (!p_val_str)
+		return;
+
+	if (strcmp("TRUE", p_val_str))
+		val = FALSE;
+	else
+		val = TRUE;
+
+	if (val != *p_val) {
+		log_config_value(p_key, "%s", p_val_str);
+		if (pfn)
+			pfn(p_subn, &val);
+		*p_val = val;
+	}
+}
+
+static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key,
+			     IN char *p_val_str, IN void *p_v,
+			     void (*pfn)(osm_subn_t *, void *))
+{
+	char **p_val = p_v;
+	const char *current_str = *p_val ? *p_val : null_str ;
+
+	if (p_val_str && strcmp(p_val_str, current_str)) {
+		char *new;
+		log_config_value(p_key, "%s", p_val_str);
+		/* special case the "(null)" string */
+		new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL;
+		if (pfn)
+			pfn(p_subn, new);
+		if (*p_val)
+			free(*p_val);
+		*p_val = new;
+	}
+}
 
 static const opt_rec_t opt_tbl[] = {
 	{ "guid", OPT_OFFSET(guid), opts_parse_net64, NULL, 0 },
@@ -196,45 +388,6 @@ static const opt_rec_t opt_tbl[] = {
 	{0}
 };
 
-static void opts_setup_log_flags(osm_subn_t *p_subn, void *p_val)
-{
-	p_subn->p_osm->log.level = *((uint8_t *) p_val);
-}
-
-static void opts_setup_force_log_flush(osm_subn_t *p_subn, void *p_val)
-{
-	p_subn->p_osm->log.flush = *((boolean_t *) p_val);
-}
-
-static void opts_setup_accum_log_file(osm_subn_t *p_subn, void *p_val)
-{
-	p_subn->p_osm->log.accum_log_file = *((boolean_t *) p_val);
-}
-
-static void opts_setup_log_max_size(osm_subn_t *p_subn, void *p_val)
-{
-	uint32_t log_max_size = *((uint32_t *) p_val);
-
-	p_subn->p_osm->log.max_size = log_max_size << 20; /* convert from MB to bytes */
-}
-
-static void opts_setup_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val)
-{
-	osm_sm_t *p_sm = &p_subn->p_osm->sm;
-	uint32_t sminfo_polling_timeout = *((uint32_t *) p_val);
-
-	cl_timer_stop(&p_sm->polling_timer);
-	cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout);
-}
-
-static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val)
-{
-	osm_sm_t *p_sm = &p_subn->p_osm->sm;
-	uint8_t sm_priority = *((uint8_t *) p_val);
-
-	osm_set_sm_priority(p_sm, sm_priority);
-}
-
 /**********************************************************************
  **********************************************************************/
 void osm_subn_construct(IN osm_subn_t * const p_subn)
@@ -596,172 +749,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 
 /**********************************************************************
  **********************************************************************/
-static void log_report(const char *fmt, ...)
-{
-	char buf[128];
-	va_list args;
-	va_start(args, fmt);
-	vsnprintf(buf, sizeof(buf), fmt, args);
-	va_end(args);
-	printf("%s", buf);
-	cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0);
-}
-
-static void log_config_value(char *name, const char *fmt, ...)
-{
-	char buf[128];
-	va_list args;
-	unsigned n;
-	va_start(args, fmt);
-	n = snprintf(buf, sizeof(buf), " Loading Cached Option:%s = ", name);
-	if (n > sizeof(buf))
-		n = sizeof(buf);
-	n += vsnprintf(buf + n, sizeof(buf) - n, fmt, args);
-	if (n > sizeof(buf))
-		n = sizeof(buf);
-	snprintf(buf + n, sizeof(buf) - n, "\n");
-	va_end(args);
-	printf("%s", buf);
-	cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0);
-}
-
-static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
-			     IN setup_fn_t pfn)
-{
-	uint64_t *p_val = p_v;
-	uint64_t val = strtoull(p_val_str, NULL, 0);
-
-	if (cl_hton64(val) != *p_val) {
-		log_config_value(p_key, "0x%016" PRIx64, val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = cl_ntoh64(val);
-	}
-}
-
-static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key,
-			      IN char *p_val_str, IN void *p_v,
-			      IN setup_fn_t pfn)
-{
-	uint32_t *p_val = p_v;
-	uint32_t val = strtoul(p_val_str, NULL, 0);
-
-	if (val != *p_val) {
-		log_config_value(p_key, "%u", val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = val;
-	}
-}
-
-static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
-			     IN setup_fn_t pfn)
-{
-	int32_t *p_val = p_v;
-	int32_t val = strtol(p_val_str, NULL, 0);
-
-	if (val != *p_val) {
-		log_config_value(p_key, "%d", val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = val;
-	}
-}
-
-static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key,
-			      IN char *p_val_str, IN void *p_v,
-			      IN setup_fn_t pfn)
-{
-	uint16_t *p_val = p_v;
-	uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0);
-
-	if (val != *p_val) {
-		log_config_value(p_key, "%u", val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = val;
-	}
-}
-
-static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
-			     IN setup_fn_t pfn)
-{
-	uint16_t *p_val = p_v;
-	uint16_t val = strtoul(p_val_str, NULL, 0);
-
-	CL_ASSERT(val < 0x10000);
-	if (cl_hton16(val) != *p_val) {
-		log_config_value(p_key, "0x%04x", val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = cl_hton16(val);
-	}
-}
-
-static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
-			     IN setup_fn_t pfn)
-{
-	uint8_t *p_val = p_v;
-	uint8_t val = strtoul(p_val_str, NULL, 0);
-
-	CL_ASSERT(val < 0x100);
-	if (val != *p_val) {
-		log_config_value(p_key, "%u", val);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = val;
-	}
-}
-
-static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key,
-			       IN char *p_val_str, IN void *p_v,
-			       IN setup_fn_t pfn)
-{
-	boolean_t *p_val = p_v;
-	boolean_t val;
-
-	if (!p_val_str)
-		return;
-
-	if (strcmp("TRUE", p_val_str))
-		val = FALSE;
-	else
-		val = TRUE;
-
-	if (val != *p_val) {
-		log_config_value(p_key, "%s", p_val_str);
-		if (pfn)
-			pfn(p_subn, &val);
-		*p_val = val;
-	}
-}
-
-static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
-			     IN setup_fn_t pfn)
-{
-	char **p_val = p_v;
-	const char *current_str = *p_val ? *p_val : null_str ;
-
-	if (p_val_str && strcmp(p_val_str, current_str)) {
-		char *new;
-		log_config_value(p_key, "%s", p_val_str);
-		/* special case the "(null)" string */
-		new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL;
-		if (pfn)
-			pfn(p_subn, new);
-		if (*p_val)
-			free(*p_val);
-		*p_val = new;
-	}
-}
-
-/**********************************************************************
- **********************************************************************/
 static char *clean_val(char *val)
 {
 	char *p = val;
-- 
1.6.1.2.319.gbd9e


From rdreier at cisco.com  Tue Feb 17 14:27:28 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Feb 2009 14:27:28 -0800
Subject: [ofa-general] Re: [PATCH] RDMA/nes: Inform hardware that
	asynchronous event has been handled
In-Reply-To: <20090213212431.GA7092@ctung-MOBL> (Chien Tung's message of "Fri, 
	13 Feb 2009 15:24:31 -0600")
References: <20090213212431.GA7092@ctung-MOBL>
Message-ID: <adahc2s66nj.fsf@cisco.com>

thanks,applied


From sean.hefty at intel.com  Tue Feb 17 14:27:35 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:27:35 -0800
Subject: [ofa-general] [PATCH 0/8] ib-mgmt: add support for WinOF
Message-ID: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>

Enable IB management diagnostic tools to support both OFED and WinOF
releases.  Only 8 of the diags have been ported to both platforms.  These
changes allow the management.git tree to drop into the WinOF build
environment. 

The following applies only to WinOF.  The WinOF environment adds the following:

src/ibdiag_windows.c - windows specific source file built as part
				of all diags (includes getopt.c)
include/windows/ - directory for windows version of include files
	config.h - included by all diags as an 'OS independent' file
			mainly #defines to map stuff like foo to _foo
	ibdiag_version.h - defines IBDIAG_VERSION
	inttypes.h - empty include file
	unistd.h - empty include file
	netinet/in.h - empty include file

cl_nodenammemap - was added to Windows user complib

I'll submit patches to changes to the WinOF tree that touch areas outside of
the tools/infiniband-diags directly separate to the ofw mail list.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Tue Feb 17 14:30:54 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:30:54 -0800
Subject: [ofa-general] [PATCH 1/8] [ib-diag] sminfo: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <875F4D297F9E4C0297A87743053C95C7@amr.corp.intel.com>

Allow sminfo to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Note: all patches are also available at:

git://git.openfabrics.org/~shefty/ib-mgmt.git master

 infiniband-diags/src/ibdiag_common.c |   10 ++++------
 infiniband-diags/src/sminfo.c        |   10 +++++-----
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index bda1efa..5f2472d 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -204,7 +204,7 @@ static const struct ibdiag_opt common_opts[] = {
 	{ "usage", 'u', 0, NULL, "usage message" },
 	{ "help", 'h', 0, NULL, "help message" },
 	{ "version", 'V', 0, NULL, "show version" },
-	{}
+	{ 0 }
 };
 
 static void make_opt(struct option *l, const struct ibdiag_opt *o,
@@ -254,11 +254,11 @@ static struct option *make_long_opts(const char *exclude_str,
 
 static void make_str_opts(const struct option *o, char *p, unsigned size)
 {
-	int i, n = 0;
+	unsigned i, n = 0;
 
 	for (n = 0; o->name  && n + 2 + o->has_arg < size; o++) {
-		p[n++] = o->val;
-		for (i = 0; i < o->has_arg; i++)
+		p[n++] = (char) o->val;
+		for (i = 0; i < (unsigned) o->has_arg; i++)
 			p[n++] = ':';
 	}
 	p[n] = '\0';
@@ -273,8 +273,6 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt,
 	char str_opts[1024];
 	const struct ibdiag_opt *o;
 
-	memset(opts_map, 0, sizeof(opts_map));
-
 	prog_name = argv[0];
 	prog_args = usage_args;
 	prog_examples = usage_examples;
diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
index e96c782..549cb81 100644
--- a/infiniband-diags/src/sminfo.c
+++ b/infiniband-diags/src/sminfo.c
@@ -59,10 +59,10 @@ enum {
 };
 
 char *statestr[] = {
-	[SMINFO_NOTACT] "SMINFO_NOTACT",
-	[SMINFO_DISCOVER] "SMINFO_DISCOVER",
-	[SMINFO_STANDBY] "SMINFO_STANDBY",
-	[SMINFO_MASTER] "SMINFO_MASTER",
+	"SMINFO_NOTACT",
+	"SMINFO_DISCOVER",
+	"SMINFO_STANDBY",
+	"SMINFO_MASTER",
 };
 
 #define STATESTR(s)	(((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???")
@@ -100,7 +100,7 @@ int main(int argc, char **argv)
 		{ "state", 's', 1, "<0-3>", "set SM state"},
 		{ "priority", 'p', 1, "<0-15>", "set SM priority"},
 		{ "activity", 'a', 1, NULL, "set activity count"},
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<sm_lid|sm_dr_path> [modifier]";
 

From sean.hefty at intel.com  Tue Feb 17 14:31:31 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:31:31 -0800
Subject: [ofa-general] [PATCH 2/8] [ib-diag] vendstat: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <056C849D67044206B1E98248885FAA00@amr.corp.intel.com>

Allow vendstat to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF repository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/vendstat.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c
index 7e8b162..db87e38 100644
--- a/infiniband-diags/src/vendstat.c
+++ b/infiniband-diags/src/vendstat.c
@@ -134,7 +134,7 @@ int main(int argc, char **argv)
 	const struct ibdiag_opt opts[] = {
 		{ "N", 'N', 0, NULL, "show IS3 general information"},
 		{ "w", 'w', 0, NULL, "show IS3 port xmit wait counters"},
-		{}
+		{ 0 }
 	};
 	char usage_args[] = "<lid|guid>";
 	const char *usage_examples[] = {


From sean.hefty at intel.com  Tue Feb 17 14:32:05 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:32:05 -0800
Subject: [ofa-general] [PATCH 3/8] [ib-diag] ibaddr: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <9C13C064B6594BDF8A52393F37267001@amr.corp.intel.com>

Allow ibaddr to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibaddr.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
index 88ad904..9098699 100644
--- a/infiniband-diags/src/ibaddr.c
+++ b/infiniband-diags/src/ibaddr.c
@@ -112,7 +112,7 @@ int main(int argc, char **argv)
 		{ "gid_show", 'g', 0, NULL, "show gid address only"},
 		{ "lid_show", 'l', 0, NULL, "show lid range only"},
 		{ "Lid_show", 'L', 0, NULL, "show lid range (in decimal) only"},
-		{}
+		{ 0 }
 	};
 	char usage_args[] = "[<lid|dr_path|guid>]";
 	const char *usage_examples[] = {


From sean.hefty at intel.com  Tue Feb 17 14:32:38 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:32:38 -0800
Subject: [ofa-general] [PATCH 4/8] [ib-diag] perfquery: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <C54CFEC70B6746119DC693626CBDB2E6@amr.corp.intel.com>

Allow perfquery to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/perfquery.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
index 8786027..6292743 100644
--- a/infiniband-diags/src/perfquery.c
+++ b/infiniband-diags/src/perfquery.c
@@ -353,7 +353,7 @@ int main(int argc, char **argv)
 		{ "loop_ports", 'l', 0, NULL, "iterate through each port" },
 		{ "reset_after_read", 'r', 0, NULL, "reset counters after read" },
 		{ "Reset_only", 'R', 0, NULL, "only reset counters" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = " [<lid|guid> [[port] [reset_mask]]]";
 	const char *usage_examples[] = {


From weiny2 at llnl.gov  Tue Feb 17 14:28:59 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 14:28:59 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions
	which use pthread
In-Reply-To: <f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
Message-ID: <20090217142859.9e7a7e22.weiny2@llnl.gov>

On Tue, 17 Feb 2009 16:12:12 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Tue, Feb 17, 2009 at 12:19 PM,  <weiny2 at llnl.gov> wrote:
> > Quoting Hal Rosenstock <hal.rosenstock at gmail.com>:
> >
> >> Sasha,
> >>
> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky  <sashak at voltaire.com>
> >> wrote:
> >>>
> >>> I looked at implementation of safe_*() functions (safe_smp_query,
> >>> safe_smp_set and safe_ca_call) and found that they are not actually
> >>> "safe" as declared by its names. The only thread-unsafe thing which
> >>> is used there is static 'mad_portid' structure (from rpc.c),
> >>
> >> I'm not sure that the only thread unsafe thing in the mad rpc
> >> mechanism is the portid.
> >>
> >>> but modification of this structure is not protected by same mutex
> >>> (actually
> >>> not protected at all).
> >>
> >> A first step would be removing the portid as static. If so, portid
> >> would need to be a supplied parameter to various mad routines and the
> >> existing ones relying on madrpc_portid would be deprecated. Does this
> >> make sense to do ? Would you accept such a patch ?
> >>
> 
> > Don't we already have an interface like this with mad_rpc_open_port?
> 
> I'm not sure this was carried all the way through (The basic building
> blocks are there but I think some additional routines are needed).
> 
> Shouldn't the in tree clients be converted over and the old routines
> deprecated ?

For utilities which run once through I think the old functions work just fine.
However, it is pretty confusing which interface to use...  [or even that there
are 2 interfaces, but I digress] (see below)

> 
> > I don't like the void * return but it is "struct ibmadb_port" under the hood.
> 
> Is access into that currently opaque struct needed for something by
> the clients of the library ?

There is nothing the clients need to access but it would be much better to
return some named data type.  This along with some documentation would clarify
what the difference between madrpc and mad_rpc really is.  Furthermore, a
named type will help to "self document" other functions like "mad_rpc".  For
example:

   void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
	      void *payload, void *rcvdata);

Oh now I found it...  Check out smp_[query|set]_via...  Here the interface
changes the parameter name and one has no idea what the type is (without
looking at the code that is! ;-)

   uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
		       unsigned mod, unsigned timeout, const void *srcport);
                                                   ^^^^

   uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
		     unsigned timeout, const void *srcport);
                                   ^^^^
And here is one more...
   int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);

> 
> > Are those calls which use it not thread safe?
> 
> They look OK but I'm not 100% sure yet.

Yea, they look thread safe but I am not sure either.  :-(

I would be in favor of making all the utils use mad_rpc_open_port but it is up
to Shasha if we go down this path.

Ira

> 
> -- Hal
> 
> > Ira
> >
> >
> >> -- Hal
> >>
> >>> As far as I know nothing uses those safe_*() primitives right now outside
> >>> libibmad, so I think it is better to remove this confused functions from
> >>> API (with changing library version, etc.).
> >>>
> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
> >>> hidden static pthread mutex which is not controlled by caller
> >>> application. I think that it will be more robust for multithreaded
> >>> application to use its own synchronization methods (pthread mutex or any
> >>> other) for better control. So let's remove madrpc_lock/unlock() too.
> >>>
> >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >>> ---
> >>>  libibmad/include/infiniband/mad.h |   41
> >>>  -------------------------------------
> >>>  libibmad/libibmad.ver             |    2 +-
> >>>  libibmad/src/libibmad.map         |    2 -
> >>>  libibmad/src/rpc.c                |   15 -------------
> >>>  libibmad/src/sa.c                 |    5 ++-
> >>>  5 files changed, 4 insertions(+), 61 deletions(-)
> >>>
> >>> diff --git a/libibmad/include/infiniband/mad.h
> >>>  b/libibmad/include/infiniband/mad.h
> >>> index eff6738..89b4be5 100644
> >>> --- a/libibmad/include/infiniband/mad.h
> >>> +++ b/libibmad/include/infiniband/mad.h
> >>> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t
> >>>  *dport, ib_rmpp_hdr_t *rmpp,
> >>>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> >>>                   int num_classes);
> >>>  void   madrpc_save_mad(void *madbuf, int len);
> >>> -void   madrpc_lock(void);
> >>> -void   madrpc_unlock(void);
> >>>  void   madrpc_show_errors(int set);
> >>>
> >>>  void * mad_rpc_open_port(char *dev_name, int dev_port, int
> >>> *mgmt_classes,
> >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t  *id,
> >>> unsigned attrid,
> >>>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid,
> >>>  unsigned mod,
> >>>                     unsigned timeout, const void *srcport);
> >>>
> >>> -inline static uint8_t *
> >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
> >>>  unsigned mod,
> >>> -              unsigned timeout)
> >>> -{
> >>> -       uint8_t *p;
> >>> -
> >>> -       madrpc_lock();
> >>> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
> >>> -       madrpc_unlock();
> >>> -
> >>> -       return p;
> >>> -}
> >>> -
> >>> -inline static uint8_t *
> >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
> >>>  unsigned mod,
> >>> -            unsigned timeout)
> >>> -{
> >>> -       uint8_t *p;
> >>> -
> >>> -       madrpc_lock();
> >>> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
> >>> -       madrpc_unlock();
> >>> -
> >>> -       return p;
> >>> -}
> >>> -
> >>>  /* sa.c */
> >>>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
> >>>                 unsigned timeout);
> >>> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid,
> >>>  ibmad_gid_t destgid, ib_portid_t *sm_id,
> >>>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> >>>                         ibmad_gid_t destgid, ib_portid_t *sm_id,  void
> >>> *buf);
> >>>
> >>> -inline static uint8_t *
> >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
> >>> -            unsigned timeout)
> >>> -{
> >>> -       uint8_t *p;
> >>> -
> >>> -       madrpc_lock();
> >>> -       p = sa_call(rcvbuf, portid, sa, timeout);
> >>> -       madrpc_unlock();
> >>> -
> >>> -       return p;
> >>> -}
> >>> -
> >>>  /* resolve.c */
> >>>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
> >>>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
> >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
> >>> index 7e93c16..23d2dc2 100644
> >>> --- a/libibmad/libibmad.ver
> >>> +++ b/libibmad/libibmad.ver
> >>> @@ -6,4 +6,4 @@
> >>>  # API_REV - advance on any added API
> >>>  # RUNNING_REV - advance any change to the vendor files
> >>>  # AGE - number of backward versions the API still supports
> >>> -LIBVERSION=5:0:4
> >>> +LIBVERSION=2:0:0
> >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> >>> index 927e51c..f944d86 100644
> >>> --- a/libibmad/src/libibmad.map
> >>> +++ b/libibmad/src/libibmad.map
> >>> @@ -72,14 +72,12 @@ IBMAD_1.3 {
> >>>               madrpc;
> >>>               madrpc_def_timeout;
> >>>               madrpc_init;
> >>> -               madrpc_lock;
> >>>               madrpc_portid;
> >>>               madrpc_rmpp;
> >>>               madrpc_save_mad;
> >>>               madrpc_set_retries;
> >>>               madrpc_set_timeout;
> >>>               madrpc_show_errors;
> >>> -               madrpc_unlock;
> >>>               ib_path_query;
> >>>               sa_call;
> >>>               sa_rpc_call;
> >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> >>> index 5226540..670a936 100644
> >>> --- a/libibmad/src/rpc.c
> >>> +++ b/libibmad/src/rpc.c
> >>> @@ -38,7 +38,6 @@
> >>>  #include <stdio.h>
> >>>  #include <stdlib.h>
> >>>  #include <unistd.h>
> >>> -#include <pthread.h>
> >>>  #include <string.h>
> >>>  #include <errno.h>
> >>>
> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport,
> >>>  ib_rmpp_hdr_t *rmpp, void *data)
> >>>       return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
> >>>  }
> >>>
> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
> >>> -
> >>> -void
> >>> -madrpc_lock(void)
> >>> -{
> >>> -       pthread_mutex_lock(&rpclock);
> >>> -}
> >>> -
> >>> -void
> >>> -madrpc_unlock(void)
> >>> -{
> >>> -       pthread_mutex_unlock(&rpclock);
> >>> -}
> >>> -
> >>>  void
> >>>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int
> >>>  num_classes)
> >>>  {
> >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> >>> index 27b9d52..c601254 100644
> >>> --- a/libibmad/src/sa.c
> >>> +++ b/libibmad/src/sa.c
> >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport,  ibmad_gid_t
> >>> srcgid, ibmad_gid_t destgid,
> >>>       if (srcport) {
> >>>               p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
> >>>       } else {
> >>> -               p = safe_sa_call(buf, sm_id, &sa, 0);
> >>> +               p = sa_call(buf, sm_id, &sa, 0);
> >>>       }
> >>>       if (!p) {
> >>>               IBWARN("sa call path_query failed");
> >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport,  ibmad_gid_t
> >>> srcgid, ibmad_gid_t destgid,
> >>>       mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
> >>>       return dlid;
> >>>  }
> >>> +
> >>>  int
> >>>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t
> >>>  *sm_id, void *buf)
> >>>  {
> >>> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
> >>> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
> >>>  }
> >>> --
> >>> 1.6.0.4.766.g6fc4a
> >>>
> >>> _______________________________________________
> >>> general mailing list
> >>> general at lists.openfabrics.org
> >>> http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>
> >>> To unsubscribe, please visit http:// 
> >>>  openib.org/mailman/listinfo/openib-general
> >>>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http:// 
> >>  openib.org/mailman/listinfo/openib-general
> >>
> >>
> >
> >
> >
> >
> 


-- 
Ira Weiny <weiny2 at llnl.gov>


From sean.hefty at intel.com  Tue Feb 17 14:33:09 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:33:09 -0800
Subject: [ofa-general] [PATCH 5/8] [ib-diag] ibportstate: add support for
	WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <134CCD11D025456C86E0BB067B25A0CD@amr.corp.intel.com>

Allow ibportstate to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibportstate.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index d1a112b..c0b9b34 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -311,12 +311,12 @@ int main(int argc, char **argv)
 					/* Setup portid for peer port */
 					memcpy(&peerportid, &portid, sizeof(peerportid));
 					peerportid.drpath.cnt = 1;
-					peerportid.drpath.p[1] = portnum;
+					peerportid.drpath.p[1] = (uint8_t) portnum;
 
 					/* Set DrSLID to local lid */
 					if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
 						IBERROR("could not resolve self");
-					peerportid.drpath.drslid = selfportid.lid;
+					peerportid.drpath.drslid = (uint16_t) selfportid.lid;
 					peerportid.drpath.drdlid = 0xffff;
 
 					/* Get peer port NodeInfo to obtain peer port number */


From sean.hefty at intel.com  Tue Feb 17 14:35:40 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:35:40 -0800
Subject: [ofa-general] [PATCH 6/8] [ib-diag] ibstat: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <74CAD5D6EE354A18A32A84C742458C90@amr.corp.intel.com>

Allow ibstat to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Patch is also attached.  Given the lengths of the lines in the code, I'm
guessing that my mailer may wrap the lines.  Patch is also available
through my ib-mgmt.git tree.

 infiniband-diags/src/ibstat.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c
index 5add690..7985be1 100644
--- a/infiniband-diags/src/ibstat.c
+++ b/infiniband-diags/src/ibstat.c
@@ -62,8 +62,8 @@ ca_dump(umad_ca_t *ca)
 {
 	if (!ca->node_type)
 		return;
-	printf("%s '%s'\n", ((uint)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"), ca->ca_name);
-	printf("\t%s type: %s\n", ((uint)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"),ca->ca_type);
+	printf("%s '%s'\n", ((unsigned)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"), ca->ca_name);
+	printf("\t%s type: %s\n", ((unsigned)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"),ca->ca_type);
 	printf("\tNumber of ports: %d\n", ca->numports);
 	printf("\tFirmware version: %s\n", ca->fw_ver);
 	printf("\tHardware version: %s\n", ca->hw_ver);
@@ -105,13 +105,13 @@ port_dump(umad_port_t *port, int alone)
 	}
 
 	printf("%sPort %d:\n", hdrpre, port->portnum);
-	printf("%sState: %s\n", pre, (uint)port->state <= 4 ? port_state_str[port->state] : "???");
-	printf("%sPhysical state: %s\n", pre, (uint)port->state <= 7 ? port_phy_state_str[port->phys_state] : "???");
+	printf("%sState: %s\n", pre, (unsigned)port->state <= 4 ? port_state_str[port->state] : "???");
+	printf("%sPhysical state: %s\n", pre, (unsigned)port->state <= 7 ? port_phy_state_str[port->phys_state] : "???");
 	printf("%sRate: %d\n", pre, port->rate);
 	printf("%sBase lid: %d\n", pre, port->base_lid);
 	printf("%sLMC: %d\n", pre, port->lmc);
 	printf("%sSM lid: %d\n", pre, port->sm_lid);
-	printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohl(port->capmask));
+	printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask));
 	printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid));
 	return 0;
 }
@@ -131,11 +131,11 @@ ca_stat(char *ca_name, int portnum, int no_ports)
 	if (!no_ports && portnum >= 0) {
 		if (portnum > ca.numports || !ca.ports[portnum]) {
 			IBWARN("%s: '%s' has no port number %d - max (%d)",
-				((uint)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"),
+				((unsigned)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"),
 				ca_name, portnum, ca.numports);
 			return -1;
 		}
-		printf("%s: '%s'\n", ((uint)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), ca.ca_name);
+		printf("%s: '%s'\n", ((unsigned)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), ca.ca_name);
 		port_dump(ca.ports[portnum], 1);
 		return 0;
 	}
@@ -200,7 +200,7 @@ int main(int argc, char *argv[])
 		{ "list_of_cas", 'l', 0, NULL, "list all IB devices" },
 		{ "short", 's', 0, NULL, "short output" },
 		{ "port_list", 'p', 0, NULL, "show port list" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<ca_name> [portnum]";
 	const char *usage_examples[] = {


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 06-win-ibstat
Type: application/octet-stream
Size: 3330 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090217/1bba9a02/attachment.obj>

From sean.hefty at intel.com  Tue Feb 17 14:36:20 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:36:20 -0800
Subject: [ofa-general] [PATCH 7/8] [ib-diags] smpdump: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <B54048123DBA4D1F8FDB835F9A6FFA74@amr.corp.intel.com>

Allow smpdump to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/smpdump.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c
index 8618121..6c7f84c 100644
--- a/infiniband-diags/src/smpdump.c
+++ b/infiniband-diags/src/smpdump.c
@@ -102,7 +102,7 @@ drsmp_get_init(void *umad, DRPath *path, int attr, int mod)
 	if (path)
 		memcpy(smp->initial_path, path->path, path->hop_cnt+1);
 
-	smp->hop_cnt = path->hop_cnt;
+	smp->hop_cnt = (uint8_t) path->hop_cnt;
 }
 
 void
@@ -146,7 +146,7 @@ drsmp_set_init(void *umad, DRPath *path, int attr, int mod, void *data)
 	if (data)
 		memcpy(smp->data, data, sizeof smp->data);
 
-	smp->hop_cnt = path->hop_cnt;
+	smp->hop_cnt = (uint8_t) path->hop_cnt;
 }
 
 char *
@@ -172,7 +172,7 @@ str2DRPath(char *str, DRPath *path)
 	while (str && *str) {
 		if ((s = strchr(str, ',')))
 			*s = 0;
-		path->path[++path->hop_cnt] = atoi(str);
+		path->path[++path->hop_cnt] = (char) atoi(str);
 		if (!s)
 			break;
 		str = s+1;
@@ -221,7 +221,7 @@ int main(int argc, char *argv[])
 
 	const struct ibdiag_opt opts[] = {
 		{ "sring", 's', 0, NULL, ""},
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<dlid|dr_path> <attr> [mod]";
 	const char *usage_examples[] = {


From sean.hefty at intel.com  Tue Feb 17 14:37:28 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 14:37:28 -0800
Subject: [ofa-general] [PATCH 8/8] [ib-diags] smpquery: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>

Allow smpquery to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/smpquery.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
index 44280e1..2d3d91b 100644
--- a/infiniband-diags/src/smpquery.c
+++ b/infiniband-diags/src/smpquery.c
@@ -47,7 +47,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -191,7 +191,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc)
 	} else
 		mad_decode_field(data, IB_NODE_PARTITION_CAP_F, &n);
 
-	for (i = 0; i < (n + 31) / 32; i++) {
+	for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) {
 		mod =  i | (portnum << 16);
 		if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0))
 			return "pkey table query failed";
@@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc)
 		return "port info failed";
 	mad_decode_field(data, IB_PORT_GUID_CAP_F, &n);
 
-	for (i = 0; i < (n + 7) / 8; i++) {
+	for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) {
 		mod =  i;
 		if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0))
 			return "guid info query failed";
@@ -412,7 +412,7 @@ int main(int argc, char **argv)
 	const struct ibdiag_opt opts[] = {
 		{ "combined", 'c', 0, NULL, "use Combined route address argument"},
 		{ "node-name-map", 1, 1, "<file>", "node name map file"},
-		{}
+		{ 0 }
 	};
 	const char *usage_examples[] = {
 		"portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier",


From sashak at voltaire.com  Tue Feb 17 14:45:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 00:45:27 +0200
Subject: [ofa-general] [PATCH] opensm: proper config file rescan
Message-ID: <20090217224527.GV7189@sashak.voltaire.com>


Now we have more config options (once it was QoS parameters only) which
can be changed in OpenSM config file "on the fly". However this
introduces the problem - unconditional config parameter rescanning from
config file overwrites command line and console settings, which should
have be a "higher priority" user interface. As result things like 'opensm
-F ./config.file -v' may not work as expected and in this example '-v'
will work only from OpenSM start up to first sweep start.

This patch attempts to address this issue: First OpenSM will parse config
file, then command line options and console commands will be able to
overwrite those settings. When OpenSM will rescan config file again it
will set only config parameters which were changed in the file (not
everything as it is now) (for this last copy of config parameters parsed
out from config file is stored). So a "last user intervention" becomes
an active.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |    1 +
 opensm/opensm/main.c               |    1 -
 opensm/opensm/osm_subnet.c         |   92 ++++++++++++++++++++----------------
 3 files changed, 52 insertions(+), 42 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 8863e47..2dfccda 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -217,6 +217,7 @@ typedef struct osm_subn_opt {
 	char *node_name_map_name;
 	char *prefix_routes_file;
 	boolean_t consolidate_ipv6_snm_req;
+	struct osm_subn_opt *file_opts; /* used for update */
 } osm_subn_opt_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index a632cd7..e22c2c4 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -508,7 +508,6 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm)
 /**********************************************************************
  **********************************************************************/
 #define SET_STR_OPT(opt, val) do { \
-	if (opt) free(opt); \
 	opt = val ? strdup(val) : NULL ; \
 } while (0)
 
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index f12685e..01478be 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -77,7 +77,8 @@ typedef struct opt_rec {
 	const char *name;
 	unsigned long opt_offset;
 	void (*parse_fn)(osm_subn_t *p_subn, char *p_key, char *p_val_str,
-			 void *p_val, void (*)(osm_subn_t *, void *));
+			 void *p_val1, void *p_val2,
+			 void (*)(osm_subn_t *, void *));
 	void (*setup_fn)(osm_subn_t *p_subn, void *p_val);
 	int  can_update;
 } opt_rec_t;
@@ -151,102 +152,102 @@ static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val)
 }
 
 static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
+			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
 {
-	uint64_t *p_val = p_v;
+	uint64_t *p_val1 = p_v1, *p_val2 = p_v2;
 	uint64_t val = strtoull(p_val_str, NULL, 0);
 
-	if (cl_hton64(val) != *p_val) {
+	if (cl_hton64(val) != *p_val1) {
 		log_config_value(p_key, "0x%016" PRIx64, val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = cl_ntoh64(val);
+		*p_val1 = *p_val2 = cl_ntoh64(val);
 	}
 }
 
 static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key,
-			      IN char *p_val_str, IN void *p_v,
+			      IN char *p_val_str, void *p_v1, void *p_v2,
 			      void (*pfn)(osm_subn_t *, void *))
 {
-	uint32_t *p_val = p_v;
+	uint32_t *p_val1 = p_v1, *p_val2 = p_v2;
 	uint32_t val = strtoul(p_val_str, NULL, 0);
 
-	if (val != *p_val) {
+	if (val != *p_val1) {
 		log_config_value(p_key, "%u", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = val;
+		*p_val1 = *p_val2 = val;
 	}
 }
 
 static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
+			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
 {
-	int32_t *p_val = p_v;
+	int32_t *p_val1 = p_v1, *p_val2 = p_v2;
 	int32_t val = strtol(p_val_str, NULL, 0);
 
-	if (val != *p_val) {
+	if (val != *p_val1) {
 		log_config_value(p_key, "%d", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = val;
+		*p_val1 = *p_val2 = val;
 	}
 }
 
 static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key,
-			      IN char *p_val_str, IN void *p_v,
+			      IN char *p_val_str, void *p_v1, void *p_v2,
 			      void (*pfn)(osm_subn_t *, void *))
 {
-	uint16_t *p_val = p_v;
+	uint16_t *p_val1 = p_v1, *p_val2 = p_v2;
 	uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0);
 
-	if (val != *p_val) {
+	if (val != *p_val1) {
 		log_config_value(p_key, "%u", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = val;
+		*p_val1 = *p_val2 = val;
 	}
 }
 
 static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
+			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
 {
-	uint16_t *p_val = p_v;
+	uint16_t *p_val1 = p_v1, *p_val2 = p_v2;
 	uint16_t val = strtoul(p_val_str, NULL, 0);
 
 	CL_ASSERT(val < 0x10000);
-	if (cl_hton16(val) != *p_val) {
+	if (cl_hton16(val) != *p_val1) {
 		log_config_value(p_key, "0x%04x", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = cl_hton16(val);
+		*p_val1 = *p_val2 = cl_hton16(val);
 	}
 }
 
 static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
+			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
 {
-	uint8_t *p_val = p_v;
+	uint8_t *p_val1 = p_v1, *p_val2 = p_v2;
 	uint8_t val = strtoul(p_val_str, NULL, 0);
 
 	CL_ASSERT(val < 0x100);
-	if (val != *p_val) {
+	if (val != *p_val1) {
 		log_config_value(p_key, "%u", val);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = val;
+		*p_val1 = *p_val2 = val;
 	}
 }
 
 static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key,
-			       IN char *p_val_str, IN void *p_v,
+			       IN char *p_val_str, void *p_v1, void *p_v2,
 			       void (*pfn)(osm_subn_t *, void *))
 {
-	boolean_t *p_val = p_v;
+	boolean_t *p_val1 = p_v1, *p_val2 = p_v2;
 	boolean_t val;
 
 	if (!p_val_str)
@@ -257,20 +258,20 @@ static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key,
 	else
 		val = TRUE;
 
-	if (val != *p_val) {
+	if (val != *p_val1) {
 		log_config_value(p_key, "%s", p_val_str);
 		if (pfn)
 			pfn(p_subn, &val);
-		*p_val = val;
+		*p_val1 = *p_val2 = val;
 	}
 }
 
 static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key,
-			     IN char *p_val_str, IN void *p_v,
+			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
 {
-	char **p_val = p_v;
-	const char *current_str = *p_val ? *p_val : null_str ;
+	char **p_val1 = p_v1, **p_val2 = p_v2;
+	const char *current_str = *p_val1 ? *p_val1 : null_str ;
 
 	if (p_val_str && strcmp(p_val_str, current_str)) {
 		char *new;
@@ -279,9 +280,11 @@ static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key,
 		new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL;
 		if (pfn)
 			pfn(p_subn, new);
-		if (*p_val)
-			free(*p_val);
-		*p_val = new;
+		if (*p_val1 && *p_val1 != *p_val2)
+			free(*p_val1);
+		if (*p_val2)
+			free(*p_val2);
+		*p_val1 = *p_val2 = new;
 	}
 }
 
@@ -1121,7 +1124,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 	FILE *opts_file;
 	char *p_key, *p_val;
 	const opt_rec_t *r;
-	void *p_field;
+	void *p_field1, *p_field2;
 
 	opts_file = fopen(file_name, "r");
 	if (!opts_file) {
@@ -1136,6 +1139,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 	cl_log_event("OpenSM", CL_LOG_INFO, line, NULL, 0);
 
 	p_opts->config_file = file_name;
+	if (!p_opts->file_opts && !(p_opts->file_opts = malloc(sizeof(*p_opts))))
+		return -1;
+	memcpy(p_opts->file_opts, p_opts, sizeof(*p_opts));
 
 	while (fgets(line, 1023, opts_file) != NULL) {
 		/* get the first token */
@@ -1149,9 +1155,11 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 			if (strcmp(r->name, p_key))
 				continue;
 
-			p_field = (void *)p_opts + r->opt_offset;
+			p_field1 = (void *)p_opts->file_opts + r->opt_offset;
+			p_field2 = (void *)p_opts + r->opt_offset;
 			/* don't call setup function first time */
-			r->parse_fn(NULL, p_key, p_val, p_field, NULL);
+			r->parse_fn(NULL, p_key, p_val, p_field1, p_field2,
+				    NULL);
 			break;
 		}
 	}
@@ -1169,7 +1177,7 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 	const opt_rec_t *r;
 	FILE *opts_file;
 	char *p_key, *p_val;
-	void *p_field;
+	void *p_field1, *p_field2;
 
 	if (!p_opts->config_file)
 		return 0;
@@ -1202,8 +1210,10 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 			if (!r->can_update || strcmp(r->name, p_key))
 				continue;
 
-			p_field = (void *)p_opts + r->opt_offset;
-			r->parse_fn(p_subn, p_key, p_val, p_field, r->setup_fn);
+			p_field1 = (void *)p_opts->file_opts + r->opt_offset;
+			p_field2 = (void *)p_opts + r->opt_offset;
+			r->parse_fn(p_subn, p_key, p_val, p_field1, p_field2,
+				    r->setup_fn);
 			break;
 		}
 	}
-- 
1.6.1.2.319.gbd9e


From rdreier at cisco.com  Tue Feb 17 14:54:36 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Feb 2009 14:54:36 -0800
Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp,
	do path_free only for newly-created paths
In-Reply-To: <200902171701.36107.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 17 Feb 2009 17:01:35 +0200")
References: <200902171701.36107.jackm@dev.mellanox.co.il>
Message-ID: <adad4dg65eb.fsf@cisco.com>

thanks, applied...

 > Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 > Signed-off-by: Moni Shua <monis at voltaire.com>

This doesn't make any sense... Moni was not involved in sending this
patch at all, and in any case since you are sending the patch your s-o-b
should be last.  If you want to give credit to Moni then include it in
the description as you did for Yossi.

 > I ran checkpatch.pl on this, and compiled it with Sparse.  However, I would still like to continue
 > using KMail.  If you have any editing/formatting problems with the patch, please let me know.
 > The patch was generated by git diff against your kernel git/master branch.

Everything came through fine so no problem with your MUA.

 - R.


From hal.rosenstock at gmail.com  Tue Feb 17 14:56:26 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Feb 2009 17:56:26 -0500
Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec
	question
Message-ID: <f0e08f230902171456l6732e2c6tabf6803013c0a9b3@mail.gmail.com>

In opensm/osm_inform.c:__match_inf_rec, around line 123, there is:

        /* if inform_info.gid is not zero, ignore lid range */
        if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid,
                    sizeof(p_infr_rec->inform_record.inform_info.gid))) {

Shouldn't this be if (memcmp) rather than if (!memcmp) ?

-- Hal


From hal.rosenstock at gmail.com  Tue Feb 17 15:21:02 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Feb 2009 18:21:02 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090217142859.9e7a7e22.weiny2@llnl.gov>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
Message-ID: <f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>

On 2/17/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Tue, 17 Feb 2009 16:12:12 -0500
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
>> On Tue, Feb 17, 2009 at 12:19 PM,  <weiny2 at llnl.gov> wrote:
>> > Quoting Hal Rosenstock <hal.rosenstock at gmail.com>:
>> >
>> >> Sasha,
>> >>
>> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky
>> >> <sashak at voltaire.com>
>> >> wrote:
>> >>>
>> >>> I looked at implementation of safe_*() functions (safe_smp_query,
>> >>> safe_smp_set and safe_ca_call) and found that they are not actually
>> >>> "safe" as declared by its names. The only thread-unsafe thing which
>> >>> is used there is static 'mad_portid' structure (from rpc.c),
>> >>
>> >> I'm not sure that the only thread unsafe thing in the mad rpc
>> >> mechanism is the portid.
>> >>
>> >>> but modification of this structure is not protected by same mutex
>> >>> (actually
>> >>> not protected at all).
>> >>
>> >> A first step would be removing the portid as static. If so, portid
>> >> would need to be a supplied parameter to various mad routines and the
>> >> existing ones relying on madrpc_portid would be deprecated. Does this
>> >> make sense to do ? Would you accept such a patch ?
>> >>
>>
>> > Don't we already have an interface like this with mad_rpc_open_port?
>>
>> I'm not sure this was carried all the way through (The basic building
>> blocks are there but I think some additional routines are needed).
>>
>> Shouldn't the in tree clients be converted over and the old routines
>> deprecated ?
>
> For utilities which run once through I think the old functions work just
> fine.

Well, sort of... Aren't mad_portid "collisions" possible when multiple
programs are run concurrently ?

> However, it is pretty confusing which interface to use...  [or even that
> there
> are 2 interfaces, but I digress] (see below)

I don't think the newer improved interfaces were ever documented.

>> > I don't like the void * return but it is "struct ibmadb_port" under the
>> > hood.
>>
>> Is access into that currently opaque struct needed for something by
>> the clients of the library ?
>
> There is nothing the clients need to access but it would be much better to
> return some named data type.  This along with some documentation would
> clarify
> what the difference between madrpc and mad_rpc really is.  Furthermore, a
> named type will help to "self document" other functions like "mad_rpc".  For
> example:
>
>    void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t
> * dport,
> 	      void *payload, void *rcvdata);
>
> Oh now I found it...  Check out smp_[query|set]_via...  Here the interface
> changes the parameter name and one has no idea what the type is (without
> looking at the code that is! ;-)
>
>    uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> 		       unsigned mod, unsigned timeout, const void *srcport);
>                                                    ^^^^
>
>    uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid,
> unsigned mod,
> 		     unsigned timeout, const void *srcport);
>                                    ^^^^
> And here is one more...
>    int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);

Are you referring to how srcport is used to call either the old madrpc
or the newer mad_rpc API and if the newer one srcport is really a
pointer to a struct ibmad_port ?

-- Hal

>> > Are those calls which use it not thread safe?
>>
>> They look OK but I'm not 100% sure yet.
>
> Yea, they look thread safe but I am not sure either.  :-(
>
> I would be in favor of making all the utils use mad_rpc_open_port but it is
> up
> to Shasha if we go down this path.
>
> Ira
>
>>
>> -- Hal
>>
>> > Ira
>> >
>> >
>> >> -- Hal
>> >>
>> >>> As far as I know nothing uses those safe_*() primitives right now
>> >>> outside
>> >>> libibmad, so I think it is better to remove this confused functions
>> >>> from
>> >>> API (with changing library version, etc.).
>> >>>
>> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
>> >>> hidden static pthread mutex which is not controlled by caller
>> >>> application. I think that it will be more robust for multithreaded
>> >>> application to use its own synchronization methods (pthread mutex or
>> >>> any
>> >>> other) for better control. So let's remove madrpc_lock/unlock() too.
>> >>>
>> >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>> >>> ---
>> >>>  libibmad/include/infiniband/mad.h |   41
>> >>>  -------------------------------------
>> >>>  libibmad/libibmad.ver             |    2 +-
>> >>>  libibmad/src/libibmad.map         |    2 -
>> >>>  libibmad/src/rpc.c                |   15 -------------
>> >>>  libibmad/src/sa.c                 |    5 ++-
>> >>>  5 files changed, 4 insertions(+), 61 deletions(-)
>> >>>
>> >>> diff --git a/libibmad/include/infiniband/mad.h
>> >>>  b/libibmad/include/infiniband/mad.h
>> >>> index eff6738..89b4be5 100644
>> >>> --- a/libibmad/include/infiniband/mad.h
>> >>> +++ b/libibmad/include/infiniband/mad.h
>> >>> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t
>> >>>  *dport, ib_rmpp_hdr_t *rmpp,
>> >>>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
>> >>>                   int num_classes);
>> >>>  void   madrpc_save_mad(void *madbuf, int len);
>> >>> -void   madrpc_lock(void);
>> >>> -void   madrpc_unlock(void);
>> >>>  void   madrpc_show_errors(int set);
>> >>>
>> >>>  void * mad_rpc_open_port(char *dev_name, int dev_port, int
>> >>> *mgmt_classes,
>> >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t
>> >>> *id,
>> >>> unsigned attrid,
>> >>>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid,
>> >>>  unsigned mod,
>> >>>                     unsigned timeout, const void *srcport);
>> >>>
>> >>> -inline static uint8_t *
>> >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
>> >>>  unsigned mod,
>> >>> -              unsigned timeout)
>> >>> -{
>> >>> -       uint8_t *p;
>> >>> -
>> >>> -       madrpc_lock();
>> >>> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
>> >>> -       madrpc_unlock();
>> >>> -
>> >>> -       return p;
>> >>> -}
>> >>> -
>> >>> -inline static uint8_t *
>> >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
>> >>>  unsigned mod,
>> >>> -            unsigned timeout)
>> >>> -{
>> >>> -       uint8_t *p;
>> >>> -
>> >>> -       madrpc_lock();
>> >>> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
>> >>> -       madrpc_unlock();
>> >>> -
>> >>> -       return p;
>> >>> -}
>> >>> -
>> >>>  /* sa.c */
>> >>>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t
>> >>> *sa,
>> >>>                 unsigned timeout);
>> >>> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid,
>> >>>  ibmad_gid_t destgid, ib_portid_t *sm_id,
>> >>>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
>> >>>                         ibmad_gid_t destgid, ib_portid_t *sm_id,  void
>> >>> *buf);
>> >>>
>> >>> -inline static uint8_t *
>> >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
>> >>> -            unsigned timeout)
>> >>> -{
>> >>> -       uint8_t *p;
>> >>> -
>> >>> -       madrpc_lock();
>> >>> -       p = sa_call(rcvbuf, portid, sa, timeout);
>> >>> -       madrpc_unlock();
>> >>> -
>> >>> -       return p;
>> >>> -}
>> >>> -
>> >>>  /* resolve.c */
>> >>>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
>> >>>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
>> >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
>> >>> index 7e93c16..23d2dc2 100644
>> >>> --- a/libibmad/libibmad.ver
>> >>> +++ b/libibmad/libibmad.ver
>> >>> @@ -6,4 +6,4 @@
>> >>>  # API_REV - advance on any added API
>> >>>  # RUNNING_REV - advance any change to the vendor files
>> >>>  # AGE - number of backward versions the API still supports
>> >>> -LIBVERSION=5:0:4
>> >>> +LIBVERSION=2:0:0
>> >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
>> >>> index 927e51c..f944d86 100644
>> >>> --- a/libibmad/src/libibmad.map
>> >>> +++ b/libibmad/src/libibmad.map
>> >>> @@ -72,14 +72,12 @@ IBMAD_1.3 {
>> >>>               madrpc;
>> >>>               madrpc_def_timeout;
>> >>>               madrpc_init;
>> >>> -               madrpc_lock;
>> >>>               madrpc_portid;
>> >>>               madrpc_rmpp;
>> >>>               madrpc_save_mad;
>> >>>               madrpc_set_retries;
>> >>>               madrpc_set_timeout;
>> >>>               madrpc_show_errors;
>> >>> -               madrpc_unlock;
>> >>>               ib_path_query;
>> >>>               sa_call;
>> >>>               sa_rpc_call;
>> >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
>> >>> index 5226540..670a936 100644
>> >>> --- a/libibmad/src/rpc.c
>> >>> +++ b/libibmad/src/rpc.c
>> >>> @@ -38,7 +38,6 @@
>> >>>  #include <stdio.h>
>> >>>  #include <stdlib.h>
>> >>>  #include <unistd.h>
>> >>> -#include <pthread.h>
>> >>>  #include <string.h>
>> >>>  #include <errno.h>
>> >>>
>> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport,
>> >>>  ib_rmpp_hdr_t *rmpp, void *data)
>> >>>       return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
>> >>>  }
>> >>>
>> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
>> >>> -
>> >>> -void
>> >>> -madrpc_lock(void)
>> >>> -{
>> >>> -       pthread_mutex_lock(&rpclock);
>> >>> -}
>> >>> -
>> >>> -void
>> >>> -madrpc_unlock(void)
>> >>> -{
>> >>> -       pthread_mutex_unlock(&rpclock);
>> >>> -}
>> >>> -
>> >>>  void
>> >>>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int
>> >>>  num_classes)
>> >>>  {
>> >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
>> >>> index 27b9d52..c601254 100644
>> >>> --- a/libibmad/src/sa.c
>> >>> +++ b/libibmad/src/sa.c
>> >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport,
>> >>> ibmad_gid_t
>> >>> srcgid, ibmad_gid_t destgid,
>> >>>       if (srcport) {
>> >>>               p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
>> >>>       } else {
>> >>> -               p = safe_sa_call(buf, sm_id, &sa, 0);
>> >>> +               p = sa_call(buf, sm_id, &sa, 0);
>> >>>       }
>> >>>       if (!p) {
>> >>>               IBWARN("sa call path_query failed");
>> >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport,
>> >>> ibmad_gid_t
>> >>> srcgid, ibmad_gid_t destgid,
>> >>>       mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
>> >>>       return dlid;
>> >>>  }
>> >>> +
>> >>>  int
>> >>>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t
>> >>>  *sm_id, void *buf)
>> >>>  {
>> >>> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
>> >>> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
>> >>>  }
>> >>> --
>> >>> 1.6.0.4.766.g6fc4a
>> >>>
>> >>> _______________________________________________
>> >>> general mailing list
>> >>> general at lists.openfabrics.org
>> >>> http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >>>
>> >>> To unsubscribe, please visit http://
>> >>>  openib.org/mailman/listinfo/openib-general
>> >>>
>> >> _______________________________________________
>> >> general mailing list
>> >> general at lists.openfabrics.org
>> >> http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >>
>> >> To unsubscribe, please visit http://
>> >>  openib.org/mailman/listinfo/openib-general
>> >>
>> >>
>> >
>> >
>> >
>> >
>>
>
>
> --
> Ira Weiny <weiny2 at llnl.gov>
>


From sean.hefty at intel.com  Tue Feb 17 16:05:45 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 16:05:45 -0800
Subject: [ofa-general] [PATCH 9/8] [ib-diag] ibping: add support for WinOF
In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
Message-ID: <BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>

Allow ibping to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

For portability, use complib to obtain time stamps.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Converted another diag this afternoon.  I was able to build and execute this,
but apparently I don't have anything on my fabric that responds to the pings.

 infiniband-diags/src/ibping.c |   22 +++++++---------------
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c
index 29c98c2..1994eba 100644
--- a/infiniband-diags/src/ibping.c
+++ b/infiniband-diags/src/ibping.c
@@ -41,24 +41,16 @@
 #include <string.h>
 #include <signal.h>
 #include <getopt.h>
-#include <sys/time.h>
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
+#include <complib/cl_timer.h>
 
 #include "ibdiag_common.h"
 
 static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE];
 static char last_host[IB_VENDOR_RANGE2_DATA_SIZE];
 
-static uint64_t getcurrenttime(void)
-{
-        struct timeval tv;
-
-        gettimeofday(&tv, 0);
-        return (uint64_t)tv.tv_sec * 1000000 + tv.tv_usec;
-}
-
 static void
 get_host_and_domain(char *data, int sz)
 {
@@ -118,7 +110,7 @@ ibping(ib_portid_t *portid, int quiet)
 
 	DEBUG("Ping..");
 
-	start = getcurrenttime();
+	start = cl_get_time_stamp();
 
 	call.method = IB_MAD_METHOD_GET;
 	call.mgmt_class = IB_VENDOR_OPENIB_PING_CLASS;
@@ -129,9 +121,9 @@ ibping(ib_portid_t *portid, int quiet)
 	memset(&call.rmpp, 0, sizeof call.rmpp);
 
 	if (!ib_vendor_call(data, portid, &call))
-		return ~0llu;
+		return ~0ull;
 
-	rtt = getcurrenttime() - start;
+	rtt = cl_get_time_stamp() - start;
 
 	if (!last_host[0])
 		memcpy(last_host, data, sizeof last_host);
@@ -149,7 +141,7 @@ static ib_portid_t portid = {0};
 
 void report(int sig)
 {
-	total_time = getcurrenttime() - start;
+	total_time = cl_get_time_stamp() - start;
 
 	DEBUG("out due signal %d", sig);
 
@@ -203,7 +195,7 @@ int main(int argc, char **argv)
 		{ "flood", 'f', 0, NULL, "flood destination" },
 		{ "oui", 'o', 1, NULL, "use specified OUI number" },
 		{ "Server", 'S', 0, NULL, "start in server mode" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<dest lid|guid>";
 
@@ -238,7 +230,7 @@ int main(int argc, char **argv)
 	signal(SIGINT, report);
 	signal(SIGTERM, report);
 
-	start = getcurrenttime();
+	start = cl_get_time_stamp();
 
 	while (count-- > 0) {
 		ntrans++;


From sashak at voltaire.com  Tue Feb 17 16:28:39 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 02:28:39 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090217142859.9e7a7e22.weiny2@llnl.gov>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
Message-ID: <20090218002839.GW7189@sashak.voltaire.com>

On 14:28 Tue 17 Feb     , Ira Weiny wrote:
> > 
> > > Are those calls which use it not thread safe?
> > 
> > They look OK but I'm not 100% sure yet.
> 
> Yea, they look thread safe but I am not sure either.  :-(

Could you, Guys, be more explicit? Really... :)

> I would be in favor of making all the utils use mad_rpc_open_port but it is up
> to Shasha if we go down this path.

The idea looks fine to me, let's review the patch.

Sasha


From sashak at voltaire.com  Tue Feb 17 16:33:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 02:33:55 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
Message-ID: <20090218003355.GX7189@sashak.voltaire.com>

On 18:21 Tue 17 Feb     , Hal Rosenstock wrote:
> >
> > For utilities which run once through I think the old functions work just
> > fine.
> 
> Well, sort of... Aren't mad_portid "collisions" possible when multiple
> programs are run concurrently ?

No.

> > However, it is pretty confusing which interface to use...  [or even that
> > there
> > are 2 interfaces, but I digress] (see below)
> 
> I don't think the newer improved interfaces were ever documented.

The old interfaces were not documented too. So it is at least consistent
:).

Sasha


From sashak at voltaire.com  Tue Feb 17 16:39:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 02:39:57 +0200
Subject: [ofa-general] [PATCH] libibmad: remove functions which use pthread
In-Reply-To: <f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
Message-ID: <20090218003957.GY7189@sashak.voltaire.com>

On 09:52 Mon 16 Feb     , Hal Rosenstock wrote:
> 
> A first step would be removing the portid as static. If so, portid
> would need to be a supplied parameter to various mad routines and the
> existing ones relying on madrpc_portid would be deprecated. Does this
> make sense to do ?

A first step would be converting all clients and internal usage in
libibmad (if any) to use a newer interface. If this will go smoothly
and things will not become overcomlicated, we could move forward -
to deprecate old interface... etc.. Nothing new.

Sasha


From sean.hefty at intel.com  Tue Feb 17 16:36:06 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Feb 2009 16:36:06 -0800
Subject: [ofa-general] RE: [PATCH 9/8] [ib-diag] ibping: add support for
	WinOF
In-Reply-To: <BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
Message-ID: <AFDFB3EC988E4CB3B253DE94DD57D6FB@amr.corp.intel.com>

> 	signal(SIGINT, report);
> 	signal(SIGTERM, report);

Btw - I worked around adding cdecl before main by disabling the warning.  Since
main must be cdecl by default, the compiler fixes it, but spits out a warning.
For some reason unknown to me, the warning only occurs when building 32-bit
apps. 

However, signal() requires that the function be cdecl as well.  The above two
calls fail to compile on 32-bit Windows platforms, so I'm still working on this.
The simple approach of changing the compiler options doesn't work as easily as
it looks like it should.  The WDK build environment is 'special'.

- Sean


From weiny2 at llnl.gov  Tue Feb 17 16:52:26 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 16:52:26 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions
	which use pthread
In-Reply-To: <f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
Message-ID: <20090217165226.e04949d8.weiny2@llnl.gov>

On Tue, 17 Feb 2009 18:21:02 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On 2/17/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> > On Tue, 17 Feb 2009 16:12:12 -0500
> > Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> >
> >> On Tue, Feb 17, 2009 at 12:19 PM,  <weiny2 at llnl.gov> wrote:
> >> > Quoting Hal Rosenstock <hal.rosenstock at gmail.com>:
> >> >
> >> >> Sasha,
> >> >>
> >> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky
> >> >> <sashak at voltaire.com>
> >> >> wrote:
> >> >>>
> >> >>> I looked at implementation of safe_*() functions (safe_smp_query,
> >> >>> safe_smp_set and safe_ca_call) and found that they are not actually
> >> >>> "safe" as declared by its names. The only thread-unsafe thing which
> >> >>> is used there is static 'mad_portid' structure (from rpc.c),
> >> >>
> >> >> I'm not sure that the only thread unsafe thing in the mad rpc
> >> >> mechanism is the portid.
> >> >>
> >> >>> but modification of this structure is not protected by same mutex
> >> >>> (actually
> >> >>> not protected at all).
> >> >>
> >> >> A first step would be removing the portid as static. If so, portid
> >> >> would need to be a supplied parameter to various mad routines and the
> >> >> existing ones relying on madrpc_portid would be deprecated. Does this
> >> >> make sense to do ? Would you accept such a patch ?
> >> >>
> >>
> >> > Don't we already have an interface like this with mad_rpc_open_port?
> >>
> >> I'm not sure this was carried all the way through (The basic building
> >> blocks are there but I think some additional routines are needed).
> >>
> >> Shouldn't the in tree clients be converted over and the old routines
> >> deprecated ?
> >
> > For utilities which run once through I think the old functions work just
> > fine.
> 
> Well, sort of... Aren't mad_portid "collisions" possible when multiple
> programs are run concurrently ?

I was only thinking of threading but I guess you are right.

> 
> > However, it is pretty confusing which interface to use...  [or even that
> > there
> > are 2 interfaces, but I digress] (see below)
> 
> I don't think the newer improved interfaces were ever documented.
> 
> >> > I don't like the void * return but it is "struct ibmadb_port" under the
> >> > hood.
> >>
> >> Is access into that currently opaque struct needed for something by
> >> the clients of the library ?
> >
> > There is nothing the clients need to access but it would be much better to
> > return some named data type.  This along with some documentation would
> > clarify
> > what the difference between madrpc and mad_rpc really is.  Furthermore, a
> > named type will help to "self document" other functions like "mad_rpc".  For
> > example:
> >
> >    void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t
> > * dport,
> > 	      void *payload, void *rcvdata);
> >
> > Oh now I found it...  Check out smp_[query|set]_via...  Here the interface
> > changes the parameter name and one has no idea what the type is (without
> > looking at the code that is! ;-)
> >
> >    uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> > 		       unsigned mod, unsigned timeout, const void *srcport);
> >                                                    ^^^^
> >
> >    uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid,
> > unsigned mod,
> > 		     unsigned timeout, const void *srcport);
> >                                    ^^^^
> > And here is one more...
> >    int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> > 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
> 
> Are you referring to how srcport is used to call either the old madrpc
> or the newer mad_rpc API and if the newer one srcport is really a
> pointer to a struct ibmad_port ?

Ok, I did not catch that srcport could be NULL to use the old interface, but
that could just be documented...

Currently mad_rpc takes a void *ibmad_port.  But ib_path_query_via takes a
void *srcport.  If you look under the covers they are the same type "struct
ibmad_port", if you need them.  mad_rpc names it ibmad_port which gives you
some clue about the type however srcport is generic altogether.

Whoa!  Then you look in rpc.c and mad_rpc takes void *port_id.

mad.h:
   void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
	      void *payload, void *rcvdata);
rpc.c:
   void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
	      void *payload, void *rcvdata)

<sigh>  I figured this all out for libibnetdisc since I am using the "mad_rpc"
interface but I could see where someone could get very confused, or at best
waste a lot of time looking at the code to figure out how to use the
interface.

Ira

> -- Hal
> 
> >> > Are those calls which use it not thread safe?
> >>
> >> They look OK but I'm not 100% sure yet.
> >
> > Yea, they look thread safe but I am not sure either.  :-(
> >
> > I would be in favor of making all the utils use mad_rpc_open_port but it is
> > up
> > to Shasha if we go down this path.
> >
> > Ira
> >
> >>
> >> -- Hal
> >>
> >> > Ira
> >> >
> >> >
> >> >> -- Hal
> >> >>
> >> >>> As far as I know nothing uses those safe_*() primitives right now
> >> >>> outside
> >> >>> libibmad, so I think it is better to remove this confused functions
> >> >>> from
> >> >>> API (with changing library version, etc.).
> >> >>>
> >> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to
> >> >>> hidden static pthread mutex which is not controlled by caller
> >> >>> application. I think that it will be more robust for multithreaded
> >> >>> application to use its own synchronization methods (pthread mutex or
> >> >>> any
> >> >>> other) for better control. So let's remove madrpc_lock/unlock() too.
> >> >>>
> >> >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >> >>> ---
> >> >>>  libibmad/include/infiniband/mad.h |   41
> >> >>>  -------------------------------------
> >> >>>  libibmad/libibmad.ver             |    2 +-
> >> >>>  libibmad/src/libibmad.map         |    2 -
> >> >>>  libibmad/src/rpc.c                |   15 -------------
> >> >>>  libibmad/src/sa.c                 |    5 ++-
> >> >>>  5 files changed, 4 insertions(+), 61 deletions(-)
> >> >>>
> >> >>> diff --git a/libibmad/include/infiniband/mad.h
> >> >>>  b/libibmad/include/infiniband/mad.h
> >> >>> index eff6738..89b4be5 100644
> >> >>> --- a/libibmad/include/infiniband/mad.h
> >> >>> +++ b/libibmad/include/infiniband/mad.h
> >> >>> @@ -703,8 +703,6 @@ void *  madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t
> >> >>>  *dport, ib_rmpp_hdr_t *rmpp,
> >> >>>  void   madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> >> >>>                   int num_classes);
> >> >>>  void   madrpc_save_mad(void *madbuf, int len);
> >> >>> -void   madrpc_lock(void);
> >> >>> -void   madrpc_unlock(void);
> >> >>>  void   madrpc_show_errors(int set);
> >> >>>
> >> >>>  void * mad_rpc_open_port(char *dev_name, int dev_port, int
> >> >>> *mgmt_classes,
> >> >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t
> >> >>> *id,
> >> >>> unsigned attrid,
> >> >>>  uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid,
> >> >>>  unsigned mod,
> >> >>>                     unsigned timeout, const void *srcport);
> >> >>>
> >> >>> -inline static uint8_t *
> >> >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
> >> >>>  unsigned mod,
> >> >>> -              unsigned timeout)
> >> >>> -{
> >> >>> -       uint8_t *p;
> >> >>> -
> >> >>> -       madrpc_lock();
> >> >>> -       p = smp_query(rcvbuf, portid, attrid, mod, timeout);
> >> >>> -       madrpc_unlock();
> >> >>> -
> >> >>> -       return p;
> >> >>> -}
> >> >>> -
> >> >>> -inline static uint8_t *
> >> >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid,
> >> >>>  unsigned mod,
> >> >>> -            unsigned timeout)
> >> >>> -{
> >> >>> -       uint8_t *p;
> >> >>> -
> >> >>> -       madrpc_lock();
> >> >>> -       p = smp_set(rcvbuf, portid, attrid, mod, timeout);
> >> >>> -       madrpc_unlock();
> >> >>> -
> >> >>> -       return p;
> >> >>> -}
> >> >>> -
> >> >>>  /* sa.c */
> >> >>>  uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t
> >> >>> *sa,
> >> >>>                 unsigned timeout);
> >> >>> @@ -761,19 +733,6 @@ int        ib_path_query(ibmad_gid_t srcgid,
> >> >>>  ibmad_gid_t destgid, ib_portid_t *sm_id,
> >> >>>  int    ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> >> >>>                         ibmad_gid_t destgid, ib_portid_t *sm_id,  void
> >> >>> *buf);
> >> >>>
> >> >>> -inline static uint8_t *
> >> >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa,
> >> >>> -            unsigned timeout)
> >> >>> -{
> >> >>> -       uint8_t *p;
> >> >>> -
> >> >>> -       madrpc_lock();
> >> >>> -       p = sa_call(rcvbuf, portid, sa, timeout);
> >> >>> -       madrpc_unlock();
> >> >>> -
> >> >>> -       return p;
> >> >>> -}
> >> >>> -
> >> >>>  /* resolve.c */
> >> >>>  int    ib_resolve_smlid(ib_portid_t *sm_id, int timeout);
> >> >>>  int    ib_resolve_guid(ib_portid_t *portid, uint64_t *guid,
> >> >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver
> >> >>> index 7e93c16..23d2dc2 100644
> >> >>> --- a/libibmad/libibmad.ver
> >> >>> +++ b/libibmad/libibmad.ver
> >> >>> @@ -6,4 +6,4 @@
> >> >>>  # API_REV - advance on any added API
> >> >>>  # RUNNING_REV - advance any change to the vendor files
> >> >>>  # AGE - number of backward versions the API still supports
> >> >>> -LIBVERSION=5:0:4
> >> >>> +LIBVERSION=2:0:0
> >> >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> >> >>> index 927e51c..f944d86 100644
> >> >>> --- a/libibmad/src/libibmad.map
> >> >>> +++ b/libibmad/src/libibmad.map
> >> >>> @@ -72,14 +72,12 @@ IBMAD_1.3 {
> >> >>>               madrpc;
> >> >>>               madrpc_def_timeout;
> >> >>>               madrpc_init;
> >> >>> -               madrpc_lock;
> >> >>>               madrpc_portid;
> >> >>>               madrpc_rmpp;
> >> >>>               madrpc_save_mad;
> >> >>>               madrpc_set_retries;
> >> >>>               madrpc_set_timeout;
> >> >>>               madrpc_show_errors;
> >> >>> -               madrpc_unlock;
> >> >>>               ib_path_query;
> >> >>>               sa_call;
> >> >>>               sa_rpc_call;
> >> >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> >> >>> index 5226540..670a936 100644
> >> >>> --- a/libibmad/src/rpc.c
> >> >>> +++ b/libibmad/src/rpc.c
> >> >>> @@ -38,7 +38,6 @@
> >> >>>  #include <stdio.h>
> >> >>>  #include <stdlib.h>
> >> >>>  #include <unistd.h>
> >> >>> -#include <pthread.h>
> >> >>>  #include <string.h>
> >> >>>  #include <errno.h>
> >> >>>
> >> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport,
> >> >>>  ib_rmpp_hdr_t *rmpp, void *data)
> >> >>>       return mad_rpc_rmpp(&port, rpc, dport, rmpp, data);
> >> >>>  }
> >> >>>
> >> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER;
> >> >>> -
> >> >>> -void
> >> >>> -madrpc_lock(void)
> >> >>> -{
> >> >>> -       pthread_mutex_lock(&rpclock);
> >> >>> -}
> >> >>> -
> >> >>> -void
> >> >>> -madrpc_unlock(void)
> >> >>> -{
> >> >>> -       pthread_mutex_unlock(&rpclock);
> >> >>> -}
> >> >>> -
> >> >>>  void
> >> >>>  madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int
> >> >>>  num_classes)
> >> >>>  {
> >> >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> >> >>> index 27b9d52..c601254 100644
> >> >>> --- a/libibmad/src/sa.c
> >> >>> +++ b/libibmad/src/sa.c
> >> >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport,
> >> >>> ibmad_gid_t
> >> >>> srcgid, ibmad_gid_t destgid,
> >> >>>       if (srcport) {
> >> >>>               p = sa_rpc_call (srcport, buf, sm_id, &sa, 0);
> >> >>>       } else {
> >> >>> -               p = safe_sa_call(buf, sm_id, &sa, 0);
> >> >>> +               p = sa_call(buf, sm_id, &sa, 0);
> >> >>>       }
> >> >>>       if (!p) {
> >> >>>               IBWARN("sa call path_query failed");
> >> >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport,
> >> >>> ibmad_gid_t
> >> >>> srcgid, ibmad_gid_t destgid,
> >> >>>       mad_decode_field(p, IB_SA_PR_DLID_F, &dlid);
> >> >>>       return dlid;
> >> >>>  }
> >> >>> +
> >> >>>  int
> >> >>>  ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t
> >> >>>  *sm_id, void *buf)
> >> >>>  {
> >> >>> -       return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf);
> >> >>> +       return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf);
> >> >>>  }
> >> >>> --
> >> >>> 1.6.0.4.766.g6fc4a
> >> >>>
> >> >>> _______________________________________________
> >> >>> general mailing list
> >> >>> general at lists.openfabrics.org
> >> >>> http://   lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >>>
> >> >>> To unsubscribe, please visit http:// 
> >> >>>  openib.org/mailman/listinfo/openib-general
> >> >>>
> >> >> _______________________________________________
> >> >> general mailing list
> >> >> general at lists.openfabrics.org
> >> >> http://   lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >>
> >> >> To unsubscribe, please visit http:// 
> >> >>  openib.org/mailman/listinfo/openib-general
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> > --
> > Ira Weiny <weiny2 at llnl.gov>
> >
> 


-- 
Ira Weiny <weiny2 at llnl.gov>


From sashak at voltaire.com  Tue Feb 17 17:03:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 03:03:03 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
	for the newly discovered port of the known node
In-Reply-To: <499AB068.2020205@dev.mellanox.co.il>
References: <499AB068.2020205@dev.mellanox.co.il>
Message-ID: <20090218010303.GZ7189@sashak.voltaire.com>

Hi Yevgeny,

On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
> 
> This patch fixes bugzilla issue #1515:
> 
> Topology:
>                  |---------------|
>                  |      SW2      |
>                  |---------------|
>                    |x |y    |z |v
>               |----|  |     |  |----|
>               |       |     |       |
>               |  |----|     |----|  |
>               |  |               |  |
>              a| b|              c| d|
>       |---------------|     |---------------|
>       |       SW1     |     |     SW3       |
>       |---------------|     |---------------|
>           |                             |
>           |                             |
>        HCA with SM                      HCA
> 
> During the discovery:
> 
> SM sends NodeInfo request to SW1
> SM sends NodeInfo request to SW2 through link a->x
> SM discovers new node SW2:
>   - updates DR to SW2 to go through link a->x
>   - creates physp x

And requests SwitchInfo from SW2, and on response sends PortInfo to all
switch ports. PortInfo receiver will initialize all switch ports. Isn't
it?

Sasha

> SM sends NodeInfo request to SW2 through link b->y
> SM discovers a known node SW2
>   - DOES NOT create physp y
>   - updates DR to SW2 to go through link b->y
> 
> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
> port y any more, leaving it uninitialized (no physp object for this port).
> 
> The fix is to create physp for the newly discovered port of the known
> switch node, same way as it is done for HCAs.
> I also added one log message for the case that showed the problem - when
> one of the link sides is uninitialized (no valid ports check). Perhaps
> this log message should be an error message instead?
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>  1 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
> index c52c0d5..7da3103 100644
> --- a/opensm/opensm/osm_node_info_rcv.c
> +++ b/opensm/opensm/osm_node_info_rcv.c
> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>  	 */
>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>  					   p_neighbor_node,
> -					   p_ni_context->port_num))
> +					   p_ni_context->port_num)) {
> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>  		goto _exit;
> +	}
> 
>  	if (osm_node_link_exists(p_node, port_num,
>  				 p_neighbor_node, p_ni_context->port_num)) {
> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
>  				     IN osm_node_t * const p_node,
>  				     IN const osm_madw_t * const p_madw)
>  {
> +
> +	ib_smp_t *p_smp;
> +	ib_node_info_t *p_ni;
> +	uint8_t port_num;
> +
>  	OSM_LOG_ENTER(sm->p_log);
> 
> +	p_smp = osm_madw_get_smp_ptr(p_madw);
> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
> +	port_num = ib_node_info_get_local_port_num(p_ni);
> +
> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> +			"Creating physp for node GUID:0x%"
> +			PRIx64 ", port %u\n",
> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
> +			port_num);
> +		osm_node_init_physp(p_node, p_madw);
> +	}
> +
>  	/*
>  	   If this switch has already been probed during this sweep,
>  	   then don't bother reprobing it.
> -- 
> 1.5.1.4
> 


From sashak at voltaire.com  Tue Feb 17 17:15:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 03:15:05 +0200
Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec
	question
In-Reply-To: <f0e08f230902171456l6732e2c6tabf6803013c0a9b3@mail.gmail.com>
References: <f0e08f230902171456l6732e2c6tabf6803013c0a9b3@mail.gmail.com>
Message-ID: <20090218011457.GA7189@sashak.voltaire.com>

On 17:56 Tue 17 Feb     , Hal Rosenstock wrote:
> In opensm/osm_inform.c:__match_inf_rec, around line 123, there is:
> 
>         /* if inform_info.gid is not zero, ignore lid range */
>         if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid,
>                     sizeof(p_infr_rec->inform_record.inform_info.gid))) {
> 
> Shouldn't this be if (memcmp) rather than if (!memcmp) ?

Yes, seems it should be without '!'. I can track it up to:

commit ce7f839355b9674c8d806747169d404066194235
Author: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
Date:   Mon Nov 27 16:08:42 2006 +0000

    r10169: OpenSM: Comparing InformInfo records

, where this code was introduced.

Yevgeny! Do you remember was it just a typo?

Sasha


From weiny2 at llnl.gov  Tue Feb 17 17:23:02 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 17:23:02 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions
	which use pthread
In-Reply-To: <20090218002839.GW7189@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<20090218002839.GW7189@sashak.voltaire.com>
Message-ID: <20090217172302.49a11f17.weiny2@llnl.gov>

On Wed, 18 Feb 2009 02:28:39 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 14:28 Tue 17 Feb     , Ira Weiny wrote:
> > > 
> > > > Are those calls which use it not thread safe?
> > > 
> > > They look OK but I'm not 100% sure yet.
> > 
> > Yea, they look thread safe but I am not sure either.  :-(
> 
> Could you, Guys, be more explicit? Really... :)

Neither interface is thread safe without the user implementing some
sort of locking around the calls.

> 
> > I would be in favor of making all the utils use mad_rpc_open_port but it is up
> > to Shasha if we go down this path.
> 
> The idea looks fine to me, let's review the patch.

Working on a patch series now...  This is mainly to clean up the
interface.  Thread safety is a 2nd consideration...

Ira


From sashak at voltaire.com  Tue Feb 17 17:49:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 03:49:55 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090217172302.49a11f17.weiny2@llnl.gov>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<20090218002839.GW7189@sashak.voltaire.com>
	<20090217172302.49a11f17.weiny2@llnl.gov>
Message-ID: <20090218014955.GB7189@sashak.voltaire.com>

On 17:23 Tue 17 Feb     , Ira Weiny wrote:
> 
> Neither interface is thread safe without the user implementing some
> sort of locking around the calls.

Really? What about this:

int plus_three(int a)
{
	return a + 3;
}

We could extrapolate of course.

Sasha


From hal.rosenstock at gmail.com  Tue Feb 17 18:18:32 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Feb 2009 21:18:32 -0500
Subject: [ofa-general] [PATCH] libibmad: remove functions which use 
	pthread
In-Reply-To: <20090218003957.GY7189@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090218003957.GY7189@sashak.voltaire.com>
Message-ID: <f0e08f230902171818t70459c6egd6f494fad77867c1@mail.gmail.com>

On Tue, Feb 17, 2009 at 7:39 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 09:52 Mon 16 Feb     , Hal Rosenstock wrote:
>>
>> A first step would be removing the portid as static. If so, portid
>> would need to be a supplied parameter to various mad routines and the
>> existing ones relying on madrpc_portid would be deprecated. Does this
>> make sense to do ?
>
> A first step would be converting all clients and internal usage in
> libibmad (if any) to use a newer interface. If this will go smoothly
> and things will not become overcomlicated, we could move forward -
> to deprecate old interface... etc.. Nothing new.

Why nothing new ? I think there are higher level support functions
which need to support the newer API.

-- Hal

> Sasha
>


From weiny2 at llnl.gov  Tue Feb 17 20:38:58 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 20:38:58 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions
	which use pthread
In-Reply-To: <20090218014955.GB7189@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<20090218002839.GW7189@sashak.voltaire.com>
	<20090217172302.49a11f17.weiny2@llnl.gov>
	<20090218014955.GB7189@sashak.voltaire.com>
Message-ID: <20090217203858.46abf45a.weiny2@llnl.gov>

On Wed, 18 Feb 2009 03:49:55 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 17:23 Tue 17 Feb     , Ira Weiny wrote:
> > 
> > Neither interface is thread safe without the user implementing some
> > sort of locking around the calls.
> 
> Really? What about this:
> 
> int plus_three(int a)
> {
> 	return a + 3;
> }
> 
> We could extrapolate of course.
> 

I don't get it?

Having static data like:

static int mad_portid = -1;
static int class_agent[MAX_CLASS];

Makes some functions dangerous.

Then the other interface does not provide any locking...  Oh I guess you are
saying that "a" is ibmad_port thingy...  Then yes if you don't have threads
modifying it at the same time you will be ok.

Ira


From weiny2 at llnl.gov  Tue Feb 17 21:06:39 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:39 -0800
Subject: [ofa-general] [PATCH 0/8] libibmad/infiniband-diags -- begin
 converting to "new" interface.
Message-ID: <20090217210639.9ef74a75.weiny2@llnl.gov>

Here are 8 patches which move a long way toward using just the new interface.

ibping caused some new functions to be created.

Like I said before this has less to do with thread safeness than it does with
creating a common clean interface.  If nothing else it moves toward a more
complete "new" interface.

Let me know what you think,
Ira

-- 
Ira Weiny <weiny2 at llnl.gov>


From weiny2 at llnl.gov  Tue Feb 17 21:06:42 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:42 -0800
Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface
Message-ID: <20090217210642.41c64624.weiny2@llnl.gov>


>From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 17:32:15 -0800
Subject: [PATCH] Clean up "new" interface

   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
   Create new mad_rpc_portid(struct ibmad_port *srcport) function
      which mirrors madrpc_portid(void)

Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 libibmad/include/infiniband/mad.h |   58 ++++++++++++++++++++++--------------
 libibmad/src/gs.c                 |   19 ++++++------
 libibmad/src/libibmad.map         |    1 +
 libibmad/src/resolve.c            |   10 ++++--
 libibmad/src/rpc.c                |   29 +++++++++---------
 libibmad/src/sa.c                 |    4 +-
 libibmad/src/smp.c                |    4 +-
 7 files changed, 71 insertions(+), 54 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 1aaaa1b..56b87e6 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt)
 }
 
 /* rpc.c */
+/* Depricated interface */
 MAD_EXPORT int madrpc_portid(void);
-MAD_EXPORT int madrpc_set_retries(int retries);
-MAD_EXPORT int madrpc_set_timeout(int timeout);
 void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
 void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
 		  void *data);
 MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
 			    int num_classes);
 void madrpc_save_mad(void *madbuf, int len);
-MAD_EXPORT void madrpc_show_errors(int set);
 
-void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
+/* New interface */
+MAD_EXPORT void madrpc_show_errors(int set);
+MAD_EXPORT int madrpc_set_retries(int retries);
+MAD_EXPORT int madrpc_set_timeout(int timeout);
+struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
 			int num_classes);
-void mad_rpc_close_port(void *ibmad_port);
-void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
+void mad_rpc_close_port(struct ibmad_port *srcport);
+void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
 	      void *payload, void *rcvdata);
-void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
+void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
 		   ib_rmpp_hdr_t * rmpp, void *data);
+MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
 
 /* smp.c */
 MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
 			      unsigned mod, unsigned timeout);
 MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
 			    unsigned mod, unsigned timeout);
+
+/* smp.c new interface */
 MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
-		       unsigned mod, unsigned timeout, const void *srcport);
+		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
 uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
-		     unsigned timeout, const void *srcport);
+		     unsigned timeout, const struct ibmad_port *srcport);
 
 /* sa.c */
 uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
 		 unsigned timeout);
-uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
-		     ib_sa_call_t * sa, unsigned timeout);
 MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);	/* returns lid */
-int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
+
+/* sa.c new interface */
+uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
+		     ib_sa_call_t * sa, unsigned timeout);
+int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
 
 /* resolve.c */
@@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
 MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
 			       ibmad_gid_t * gid);
 
-int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
+/* resolve.c new interface */
+int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport);
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
-			ib_portid_t * sm_id, int timeout, const void *srcport);
+			ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport);
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 			      enum MAD_DEST dest, ib_portid_t * sm_id,
-			      const void *srcport);
+			      const struct ibmad_port *srcport);
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
-			const void *srcport);
+			const struct ibmad_port *srcport);
 
 /* gs.c */
 MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
@@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
 MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
 					      int port, unsigned timeout);
 
+/* gs.c new interface */
 uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned timeout,
-				      const void *srcport);
+				      const struct ibmad_port *srcport);
 uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
-				    unsigned timeout, const void *srcport);
+				    unsigned timeout, const struct ibmad_port *srcport);
 uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
 				    unsigned mask, unsigned timeout,
-				    const void *srcport);
+				    const struct ibmad_port *srcport);
 uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport);
+					const struct ibmad_port *srcport);
 uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned mask,
-					unsigned timeout, const void *srcport);
+					unsigned timeout,
+					const struct ibmad_port *srcport);
 uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport);
+					const struct ibmad_port *srcport);
 uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
 				       int port, unsigned timeout,
-				       const void *srcport);
+				       const struct ibmad_port *srcport);
 /* dump.c */
 MAD_EXPORT ib_mad_dump_fn
     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
index d2c4574..e302caf 100644
--- a/libibmad/src/gs.c
+++ b/libibmad/src/gs.c
@@ -47,7 +47,7 @@
 
 static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
 			      unsigned timeout, unsigned id,
-			      const void *srcport)
+			      const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 	int lid = dest->lid;
@@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
 
 uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned timeout,
-				      const void *srcport)
+				      const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
 			     srcport);
@@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
 }
 
 uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
-				    unsigned timeout, const void *srcport)
+				    unsigned timeout, const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_COUNTERS, srcport);
@@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned mask, unsigned timeout,
-				      unsigned id, const void *srcport)
+				      unsigned id, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 	int lid = dest->lid;
@@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
 				    unsigned mask, unsigned timeout,
-				    const void *srcport)
+				    const struct ibmad_port *srcport)
 {
 	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
 				     IB_GSI_PORT_COUNTERS, srcport);
@@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport)
+					const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_COUNTERS_EXT, srcport);
@@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned mask,
-					unsigned timeout, const void *srcport)
+					unsigned timeout,
+					const struct ibmad_port *srcport)
 {
 	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
 				     IB_GSI_PORT_COUNTERS_EXT, srcport);
@@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport)
+					const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_SAMPLES_CONTROL, srcport);
@@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
 				       int port, unsigned timeout,
-				       const void *srcport)
+				       const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_SAMPLES_RESULT, srcport);
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index f944d86..94d7762 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -69,6 +69,7 @@ IBMAD_1.3 {
 		mad_rpc_close_port;
 		mad_rpc;
 		mad_rpc_rmpp;
+		mad_rpc_portid;
 		madrpc;
 		madrpc_def_timeout;
 		madrpc_init;
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index 553949d..3291f43 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -45,7 +45,8 @@
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
 
-int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
+int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t self = { 0 };
 	uint8_t portinfo[64];
@@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
 }
 
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
-			ib_portid_t * sm_id, int timeout, const void *srcport)
+			ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t sm_portid;
 	char buf[IB_SA_DATA_SIZE] = { 0 };
@@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 			      enum MAD_DEST dest_type, ib_portid_t * sm_id,
-			      const void *srcport)
+			      const struct ibmad_port *srcport)
 {
 	uint64_t guid;
 	int lid;
@@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
 }
 
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
-			const void *srcport)
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t self = { 0 };
 	uint8_t portinfo[64];
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index e811526..d47873b 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -100,6 +100,11 @@ int madrpc_portid(void)
 	return mad_portid;
 }
 
+int mad_rpc_portid(struct ibmad_port *srcport)
+{
+	return (srcport->port_id);
+}
+
 static int
 _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 	   int timeout)
@@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 	return -1;
 }
 
-void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
+void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
 	      void *payload, void *rcvdata)
 {
-	const struct ibmad_port *p = port_id;
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
@@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
-			      p->class_agents[rpc->mgtclass],
+	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
+			      port->class_agents[rpc->mgtclass],
 			      len, rpc->timeout)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
@@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	return rcvdata;
 }
 
-void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
+void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
 		   ib_rmpp_hdr_t * rmpp, void *data)
 {
-	const struct ibmad_port *p = port_id;
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
@@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
-			      p->class_agents[rpc->mgtclass],
+	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
+			      port->class_agents[rpc->mgtclass],
 			      len, rpc->timeout)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
@@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
 	}
 }
 
-void *mad_rpc_open_port(char *dev_name, int dev_port,
+struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
 			int *mgmt_classes, int num_classes)
 {
 	struct ibmad_port *p;
@@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
 	return p;
 }
 
-void mad_rpc_close_port(void *port_id)
+void mad_rpc_close_port(struct ibmad_port *port)
 {
-	struct ibmad_port *p = port_id;
-
-	umad_close_port(p->port_id);
-	free(p);
+	umad_close_port(port->port_id);
+	free(port);
 }
 
 uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
index 7403d4f..ddeb152 100644
--- a/libibmad/src/sa.c
+++ b/libibmad/src/sa.c
@@ -44,7 +44,7 @@
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
 
-uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
+uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
 		     ib_sa_call_t * sa, unsigned timeout)
 {
 	ib_rpc_t rpc = { 0 };
@@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
 			IB_PR_COMPMASK_SGID |\
 			IB_PR_COMPMASK_NUMBPATH)
 
-int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
+int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
 {
 	int npath;
diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
index fad263c..e5489b3 100644
--- a/libibmad/src/smp.c
+++ b/libibmad/src/smp.c
@@ -45,7 +45,7 @@
 #define DEBUG 	if (ibdebug)	IBWARN
 
 uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
-		     unsigned mod, unsigned timeout, const void *srcport)
+		     unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 
@@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
 }
 
 uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
-		       unsigned mod, unsigned timeout, const void *srcport)
+		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:45 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:45 -0800
Subject: [ofa-general] [PATCH 2/8] Remove unused function madrpc_save_mad
Message-ID: <20090217210645.e4762c94.weiny2@llnl.gov>

>From 17ff2ea4947b64453e00b93ab1dfd639a69a7c35 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 17:40:01 -0800
Subject: [PATCH] Remove unused function madrpc_save_mad

   including the internal data used by it.

Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 libibmad/include/infiniband/mad.h |    1 -
 libibmad/src/libibmad.map         |    1 -
 libibmad/src/rpc.c                |   14 --------------
 3 files changed, 0 insertions(+), 16 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 56b87e6..5806e70 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -731,7 +731,6 @@ void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
 		  void *data);
 MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
 			    int num_classes);
-void madrpc_save_mad(void *madbuf, int len);
 
 /* New interface */
 MAD_EXPORT void madrpc_show_errors(int set);
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 94d7762..6f0c0b5 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -75,7 +75,6 @@ IBMAD_1.3 {
 		madrpc_init;
 		madrpc_portid;
 		madrpc_rmpp;
-		madrpc_save_mad;
 		madrpc_set_retries;
 		madrpc_set_timeout;
 		madrpc_show_errors;
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index d47873b..20eeb89 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -57,8 +57,6 @@ static int iberrs;
 
 static int madrpc_retries = MAD_DEF_RETRIES;
 static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS;
-static void *save_mad;
-static int save_mad_len = 256;
 
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
@@ -71,12 +69,6 @@ void madrpc_show_errors(int set)
 	iberrs = set;
 }
 
-void madrpc_save_mad(void *madbuf, int len)
-{
-	save_mad = madbuf;
-	save_mad_len = len;
-}
-
 int madrpc_set_retries(int retries)
 {
 	if (retries > 0)
@@ -121,12 +113,6 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 		xdump(stderr, "send buf\n", sndbuf, umad_size() + len);
 	}
 
-	if (save_mad) {
-		memcpy(save_mad, umad_get_mad(sndbuf),
-		       save_mad_len < len ? save_mad_len : len);
-		save_mad = 0;
-	}
-
 	trid =
 	    (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F);
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:46 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:46 -0800
Subject: [ofa-general] [PATCH 3/8] Convert ibaddr to "new" ibmad interface
Message-ID: <20090217210646.5e74b9ed.weiny2@llnl.gov>


>From 5bdf4bdf8ccba45f1a9a56b1c617fc711e73300d Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 17:56:12 -0800
Subject: [PATCH] Convert ibaddr to "new" ibmad interface


Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibaddr.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
index 88ad904..fa62dbc 100644
--- a/infiniband-diags/src/ibaddr.c
+++ b/infiniband-diags/src/ibaddr.c
@@ -45,6 +45,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int
 ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid)
 {
@@ -55,10 +57,10 @@ ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid)
 	ibmad_gid_t gid;
 	int lmc;
 
-	if (!smp_query(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return -1;
 
-	if (!smp_query(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid);
@@ -137,17 +139,22 @@ int main(int argc, char **argv)
 	if (!show_lid && !show_gid)
 		show_lid = show_gid = 1;
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (argc) {
-		if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+						ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[0]);
 	} else {
-		if (ib_resolve_self(&portid, &port, 0) < 0)
+		if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0)
 			IBERROR("can't resolve self port %s", argv[0]);
 	}
 
 	if (ib_resolve_addr(&portid, port, show_lid, show_gid) < 0)
 		IBERROR("can't resolve requested address");
+
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:48 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:48 -0800
Subject: [ofa-general] [PATCH 4/8] convert ibping to "new" ibmad interface
Message-ID: <20090217210648.403309e0.weiny2@llnl.gov>

>From d109788f46b5839698f3f4a1f75bcbfe22a3b46d Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 20:08:53 -0800
Subject: [PATCH] convert ibping to "new" ibmad interface

   To do this I needed the following additional functions
   mad_register_client_via
   mad_register_server_via
   mad_send_via
   mad_receive_via
   mad_respond_via
   ib_vendor_call_via

Also further mark some functions as depricated and clean up interface a bit
more.

Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibping.c     |   21 +++++++++----
 libibmad/include/infiniband/mad.h |   29 ++++++++++++++++++
 libibmad/src/libibmad.map         |    5 +++
 libibmad/src/mad_internal.h       |   44 ++++++++++++++++++++++++++++
 libibmad/src/register.c           |   58 ++++++++++++++++++++++++++++++-------
 libibmad/src/rpc.c                |    8 +----
 libibmad/src/serv.c               |   39 +++++++++++++++++++++++--
 libibmad/src/vendor.c             |   15 ++++++++-
 8 files changed, 190 insertions(+), 29 deletions(-)
 create mode 100644 libibmad/src/mad_internal.h

diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c
index 29c98c2..7d458bf 100644
--- a/infiniband-diags/src/ibping.c
+++ b/infiniband-diags/src/ibping.c
@@ -48,6 +48,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE];
 static char last_host[IB_VENDOR_RANGE2_DATA_SIZE];
 
@@ -90,7 +92,7 @@ ibping_serv(void)
 
 	DEBUG("starting to serve...");
 
-	while ((umad = mad_receive(0, -1))) {
+	while ((umad = mad_receive_via(0, -1, srcport))) {
 
 		mad = umad_get_mad(umad);
 		data = (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS;
@@ -99,7 +101,7 @@ ibping_serv(void)
 
 		DEBUG("Pong: %s", data);
 
-		if (mad_respond(umad, 0, 0) < 0)
+		if (mad_respond_via(umad, 0, 0, srcport) < 0)
 			DEBUG("respond failed");
 
 		mad_free(umad);
@@ -128,7 +130,7 @@ ibping(ib_portid_t *portid, int quiet)
 	call.timeout = 0;
 	memset(&call.rmpp, 0, sizeof call.rmpp);
 
-	if (!ib_vendor_call(data, portid, &call))
+	if (!ib_vendor_call_via(data, portid, &call, srcport))
 		return ~0llu;
 
 	rtt = getcurrenttime() - start;
@@ -216,10 +218,12 @@ int main(int argc, char **argv)
 	if (!argc && !server)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (server) {
-		if (mad_register_server(ping_class, 0, 0, oui) < 0)
+		if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0)
 			IBERROR("can't serve class %d on this port", ping_class);
 
 		get_host_and_domain(host_and_domain, sizeof host_and_domain);
@@ -229,10 +233,11 @@ int main(int argc, char **argv)
 		exit(0);
 	}
 
-	if (mad_register_client(ping_class, 0) < 0)
+	if (mad_register_client_via(ping_class, 0, srcport) < 0)
 		IBERROR("can't register ping class %d on this port", ping_class);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+					ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	signal(SIGINT, report);
@@ -260,5 +265,7 @@ int main(int argc, char **argv)
 
 	report(0);
 
+	mad_rpc_close_port(srcport);
+
 	exit(-1);
 }
diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 5806e70..8e61395 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -650,6 +650,7 @@ enum MAD_NODE_TYPE {
 };
 
 /******************************************************************************/
+struct ibmad_port;
 
 /* portid.c */
 MAD_EXPORT char *portid2str(ib_portid_t * portid);
@@ -692,26 +693,50 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport,
 			     ib_rmpp_hdr_t * rmpp, void *data);
 
 /* register.c */
+/* depricated */
 MAD_EXPORT int mad_register_port_client(int port_id, int mgmt,
 					uint8_t rmpp_version);
 MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version);
 MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version,
 				   long method_mask[16 / sizeof(long)],
 				   uint32_t class_oui);
+
+/* register.c new interface */
+MAD_EXPORT int mad_register_client_via(int mgmt, uint8_t rmpp_version,
+				struct ibmad_port *srcport);
+MAD_EXPORT int mad_register_server_via(int mgmt, uint8_t rmpp_version,
+				long method_mask[16 / sizeof(long)],
+				uint32_t class_oui,
+				struct ibmad_port *srcport);
 MAD_EXPORT int mad_class_agent(int mgmt);
 MAD_EXPORT int mad_agent_class(int agent);
 
 /* serv.c */
+/* depricated */
 MAD_EXPORT int mad_send(ib_rpc_t * rpc, ib_portid_t * dport,
 			ib_rmpp_hdr_t * rmpp, void *data);
 MAD_EXPORT void *mad_receive(void *umad, int timeout);
 MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus);
+
+/* serv.c new interface */
+MAD_EXPORT int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport,
+			ib_rmpp_hdr_t * rmpp, void *data,
+			struct ibmad_port *srcport);
+MAD_EXPORT void *mad_receive_via(void *umad, int timeout,
+			struct ibmad_port *srcport);
+MAD_EXPORT int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus,
+			struct ibmad_port *srcport);
 MAD_EXPORT void *mad_alloc(void);
 MAD_EXPORT void mad_free(void *umad);
 
 /* vendor.c */
+/* depricated */
 MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
 				   ib_vendor_call_t * call);
+/* vendor.c new interface */
+MAD_EXPORT uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid,
+				   ib_vendor_call_t * call,
+				   struct ibmad_port *srcport);
 
 static inline int mad_is_vendor_range1(int mgmt)
 {
@@ -746,6 +771,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t
 MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
 
 /* smp.c */
+/* depricated */
 MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
 			      unsigned mod, unsigned timeout);
 MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
@@ -758,6 +784,7 @@ uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
 		     unsigned timeout, const struct ibmad_port *srcport);
 
 /* sa.c */
+/* depricated */
 uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
 		 unsigned timeout);
 MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);	/* returns lid */
@@ -769,6 +796,7 @@ int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
 
 /* resolve.c */
+/* depricated */
 MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
 MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
 			       ib_portid_t * sm_id, int timeout);
@@ -790,6 +818,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
 			const struct ibmad_port *srcport);
 
 /* gs.c */
+/* depricated */
 MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
 					     int port, unsigned timeout);
 MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest,
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 6f0c0b5..ee1804a 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -60,6 +60,8 @@ IBMAD_1.3 {
 		mad_class_agent;
 		mad_register_client;
 		mad_register_server;
+		mad_register_client_via;
+		mad_register_server_via;
 		ib_resolve_guid;
 		ib_resolve_portid_str;
 		ib_resolve_self;
@@ -85,10 +87,13 @@ IBMAD_1.3 {
 		mad_free;
 		mad_receive;
 		mad_respond;
+		mad_receive_via;
+		mad_respond_via;
 		mad_send;
 		smp_query;
 		smp_set;
 		ib_vendor_call;
+		ib_vendor_call_via;
 		smp_query_via;
 		smp_set_via;
 		ib_path_query_via;
diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
new file mode 100644
index 0000000..9afe7a9
--- /dev/null
+++ b/libibmad/src/mad_internal.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _MAD_INTERNAL_H_
+#define _MAD_INTERNAL_H_
+
+#define MAX_CLASS 256
+
+struct ibmad_port {
+	int port_id;		/* file descriptor returned by umad_open() */
+	int class_agents[MAX_CLASS];	/* class2agent mapper */
+};
+
+#endif /* _MAD_INTERNAL_H_ */
diff --git a/libibmad/src/register.c b/libibmad/src/register.c
index 4d91ff8..4aabd7c 100644
--- a/libibmad/src/register.c
+++ b/libibmad/src/register.c
@@ -43,10 +43,11 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
+#include "mad_internal.h"
+
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
 
-#define MAX_CLASS	256
 #define MAX_AGENTS	256
 
 static int class_agent[MAX_CLASS];
@@ -136,22 +137,57 @@ int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version)
 
 int mad_register_client(int mgmt, uint8_t rmpp_version)
 {
+	int rc = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	rc = mad_register_client_via(mgmt, rmpp_version, &port);
+	if (rc < 0)
+		return rc;
+	return register_agent(port.class_agents[mgmt], mgmt);
+}
+
+int mad_register_client_via(int mgmt, uint8_t rmpp_version,
+			struct ibmad_port *srcport)
+{
 	int agent;
 
-	agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version);
+	if (!srcport)
+		return -1;
+
+	agent = mad_register_port_client(mad_rpc_portid(srcport), mgmt, rmpp_version);
 	if (agent < 0)
 		return agent;
 
-	return register_agent(agent, mgmt);
+	srcport->class_agents[mgmt] = agent;
+	return 0;
 }
 
 int
 mad_register_server(int mgmt, uint8_t rmpp_version,
 		    long method_mask[], uint32_t class_oui)
 {
+	int rc = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	port.class_agents[mgmt] = class_agent[mgmt];
+	rc = mad_register_server_via(mgmt, rmpp_version,
+				method_mask, class_oui,
+				&port);
+	if (rc < 0)
+		return rc;
+	return register_agent(port.class_agents[mgmt], mgmt);
+}
+
+int
+mad_register_server_via(int mgmt, uint8_t rmpp_version,
+		    long method_mask[], uint32_t class_oui,
+		    struct ibmad_port *srcport)
+{
 	long class_method_mask[16 / sizeof(long)];
 	uint8_t oui[3];
-	int agent, vers, mad_portid;
+	int agent, vers;
 
 	if (method_mask)
 		memcpy(class_method_mask, method_mask,
@@ -159,11 +195,12 @@ mad_register_server(int mgmt, uint8_t rmpp_version,
 	else
 		memset(class_method_mask, 0xff, sizeof(class_method_mask));
 
-	if ((mad_portid = madrpc_portid()) < 0)
+	if (!srcport)
 		return -1;
 
-	if (class_agent[mgmt] >= 0) {
-		DEBUG("Class 0x%x already registered", mgmt);
+	if (srcport->class_agents[mgmt] >= 0) {
+		DEBUG("Class 0x%x already registered %d",
+			mgmt, srcport->class_agents[mgmt]);
 		return -1;
 	}
 	if ((vers = mgmt_class_vers(mgmt)) <= 0) {
@@ -175,19 +212,18 @@ mad_register_server(int mgmt, uint8_t rmpp_version,
 		oui[0] = (class_oui >> 16) & 0xff;
 		oui[1] = (class_oui >> 8) & 0xff;
 		oui[2] = class_oui & 0xff;
-		if ((agent = umad_register_oui(mad_portid, mgmt, rmpp_version,
+		if ((agent = umad_register_oui(srcport->port_id, mgmt, rmpp_version,
 					       oui, class_method_mask)) < 0) {
 			DEBUG("Can't register agent for class %d", mgmt);
 			return -1;
 		}
-	} else if ((agent = umad_register(mad_portid, mgmt, vers, rmpp_version,
+	} else if ((agent = umad_register(srcport->port_id, mgmt, vers, rmpp_version,
 					  class_method_mask)) < 0) {
 		DEBUG("Can't register agent for class %d", mgmt);
 		return -1;
 	}
 
-	if (register_agent(agent, mgmt) < 0)
-		return -1;
+	srcport->class_agents[mgmt] = agent;
 
 	return agent;
 }
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index 20eeb89..bcb0a75 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -43,12 +43,7 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
-#define MAX_CLASS 256
-
-struct ibmad_port {
-	int port_id;		/* file descriptor returned by umad_open() */
-	int class_agents[MAX_CLASS];	/* class2agent mapper */
-};
+#include "mad_internal.h"
 
 int ibdebug;
 
@@ -325,6 +320,7 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
 		return NULL;
 	}
 
+	memset(p->class_agents, 0xff, sizeof p->class_agents);
 	while (num_classes--) {
 		uint8_t rmpp_version = 0;
 		int mgmt = *mgmt_classes++;
diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c
index c7631bb..0ce1660 100644
--- a/libibmad/src/serv.c
+++ b/libibmad/src/serv.c
@@ -42,12 +42,25 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
+#include "mad_internal.h"
+
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
 
 int
 mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass);
+	return mad_send_via(rpc, dport, rmpp, data, &port);
+}
+
+int
+mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data,
+		struct ibmad_port *srcport)
+{
 	uint8_t pktbuf[1024];
 	void *umad = pktbuf;
 
@@ -64,7 +77,7 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 		      (char *)umad_get_mad(umad) + rpc->dataoffs, rpc->datasz);
 	}
 
-	if (umad_send(madrpc_portid(), mad_class_agent(rpc->mgtclass),
+	if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass],
 		      umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) {
 		IBWARN("send failed; %m");
 		return -1;
@@ -75,6 +88,18 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 
 int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 {
+	int i = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	for (i = 1; i < MAX_CLASS; i++)
+		port.class_agents[i] = mad_class_agent(i);
+	return mad_respond_via(umad, portid, rstatus, &port);
+}
+
+int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus,
+		struct ibmad_port *srcport)
+{
 	uint8_t *mad = umad_get_mad(umad);
 	ib_mad_addr_t *mad_addr;
 	ib_rpc_t rpc = { 0 };
@@ -138,7 +163,7 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 	if (ibdebug > 1)
 		xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE);
 
-	if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad,
+	if (umad_send(srcport->port_id, srcport->class_agents[rpc.mgtclass], umad,
 		      IB_MAD_SIZE, rpc.timeout, 0) < 0) {
 		DEBUG("send failed; %m");
 		return -1;
@@ -149,11 +174,19 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 
 void *mad_receive(void *umad, int timeout)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	return mad_receive_via(umad, timeout, &port);
+}
+
+void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport)
+{
 	void *mad = umad ? umad : umad_alloc(1, umad_size() + IB_MAD_SIZE);
 	int agent;
 	int length = IB_MAD_SIZE;
 
-	if ((agent = umad_recv(madrpc_portid(), mad, &length, timeout)) < 0) {
+	if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) {
 		if (!umad)
 			umad_free(mad);
 		DEBUG("recv failed: %m");
diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c
index 50a878e..1a129e5 100644
--- a/libibmad/src/vendor.c
+++ b/libibmad/src/vendor.c
@@ -40,6 +40,7 @@
 #include <string.h>
 
 #include <infiniband/mad.h>
+#include "mad_internal.h"
 
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
@@ -53,6 +54,16 @@ static inline int response_expected(int method)
 uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
 			ib_vendor_call_t * call)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	return ib_vendor_call_via(data, portid, call, &port);
+}
+
+uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid,
+			ib_vendor_call_t * call,
+			struct ibmad_port *srcport)
+{
 	ib_rpc_t rpc = { 0 };
 	int range1 = 0, resp_expected;
 
@@ -90,7 +101,7 @@ uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
 		portid->qkey = IB_DEFAULT_QP1_QKEY;
 
 	if (resp_expected)
-		return madrpc_rmpp(&rpc, portid, 0, data);	/* FIXME: no RMPP for now */
+		return mad_rpc_rmpp(srcport, &rpc, portid, 0, data);	/* FIXME: no RMPP for now */
 
-	return mad_send(&rpc, portid, 0, data) < 0 ? 0 : data;	/* FIXME: no RMPP for now */
+	return mad_send_via(&rpc, portid, 0, data, srcport) < 0 ? 0 : data;	/* FIXME: no RMPP for now */
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:50 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:50 -0800
Subject: [ofa-general] [PATCH 5/8] Convert ibportstate to "new" ibmad
	interface
Message-ID: <20090217210650.3397dd72.weiny2@llnl.gov>

>From dacabed9a22d308d9f61beb6f4906f2414a5ee29 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 20:21:14 -0800
Subject: [PATCH] Convert ibportstate to "new" ibmad interface


Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibportstate.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index d1a112b..4edafd0 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -46,6 +46,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 /*******************************************/
 
 static int
@@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data)
 {
 	int node_type;
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return -1;
 
 	node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
@@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
 	char buf[2048];
 	char val[64];
 
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	if (port_op != 4) {
@@ -223,9 +225,12 @@ int main(int argc, char **argv)
 	if (argc < 2)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	/* First, make sure it is a switch port if it is a "set" */
@@ -314,7 +319,8 @@ int main(int argc, char **argv)
 					peerportid.drpath.p[1] = portnum;
 
 					/* Set DrSLID to local lid */
-					if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
+					if (ib_resolve_self_via(&selfportid,
+							&selfport, 0, srcport) < 0)
 						IBERROR("could not resolve self");
 					peerportid.drpath.drslid = selfportid.lid;
 					peerportid.drpath.drdlid = 0xffff;
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:53 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:53 -0800
Subject: [ofa-general] [PATCH 6/8] Convert ibroute to "new" ibmad interface
Message-ID: <20090217210653.9c88786f.weiny2@llnl.gov>

>From 2edbb6ec9d7828bfd75777dbaab8918675d3bd06 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 20:28:21 -0800
Subject: [PATCH] Convert ibroute to "new" ibmad interface


Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibroute.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 144d1b2..60bfdd8 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -49,6 +49,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int brief, dump_all, multicast;
 
 /*******************************************/
@@ -61,12 +63,12 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid,
 	int type;
 
 	DEBUG("checking node type");
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, 0)) {
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) {
 		xdump(stderr, "nodeinfo\n", ni, sizeof ni);
 		return "node info failed: valid addr?";
 	}
 
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, 0))
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, 0, srcport))
 		return "node desc failed";
 
 	mad_decode_field(ni, IB_NODE_TYPE_F, &type);
@@ -77,7 +79,7 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid,
 	mad_decode_field(ni, IB_NODE_NPORTS_F, nports);
 	mad_decode_field(ni, IB_NODE_GUID_F, guid);
 
-	if (!smp_query(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0))
+	if (!smp_query_via(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0, srcport))
 		return "switch info failed: is a switch node?";
 
 	return 0;
@@ -195,7 +197,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid)
 			mod = (block - IB_MIN_MCAST_LID/IB_MLIDS_IN_BLOCK) | (j << 28);
 
 			DEBUG("reading block %x chunk %d mod %x", block, j, mod);
-			if (!smp_query(mft + j, portid, IB_ATTR_MULTICASTFORWTBL, mod, 0))
+			if (!smp_query_via(mft + j, portid,
+					IB_ATTR_MULTICASTFORWTBL, mod, 0, srcport))
 				return "multicast forwarding table get failed";
 		}
 
@@ -259,9 +262,9 @@ dump_lid(char *str, int strlen, int lid, int valid)
 	portguid = 0;
 	lidport.lid = lid;
 
-	if (!smp_query(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100) ||
-	    !smp_query(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100) ||
-	    !smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100))
+	if (!smp_query_via(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100, srcport) ||
+	    !smp_query_via(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100, srcport) ||
+	    !smp_query_via(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100, srcport))
 		return snprintf(str, strlen, ": (unknown node and type)");
 
 	mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid);
@@ -316,7 +319,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
 	endblock = ALIGN(endlid, IB_SMP_DATA_SIZE) / IB_SMP_DATA_SIZE;
 	for (block = startblock; block <= endblock; block++) {
 		DEBUG("reading block %d", block);
-		if (!smp_query(lft, portid, IB_ATTR_LINEARFORWTBL, block, 0))
+		if (!smp_query_via(lft, portid, IB_ATTR_LINEARFORWTBL, block,
+				0, srcport))
 			return "linear forwarding table get failed";
 		i = block * IB_SMP_DATA_SIZE;
 		e = i + IB_SMP_DATA_SIZE;
@@ -403,12 +407,15 @@ int main(int argc, char **argv)
 	if (argc > 2)
 		endlid = strtoul(argv[2], 0, 0);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (!argc) {
-		if (ib_resolve_self(&portid, 0, 0) < 0)
+		if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0)
 			IBERROR("can't resolve self addr");
-	} else if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	} else if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[1]);
 
 	if (multicast)
@@ -419,5 +426,6 @@ int main(int argc, char **argv)
 	if (err)
 		IBERROR("dump tables: %s", err);
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:54 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:54 -0800
Subject: [ofa-general] [PATCH 7/8] Convert ibsendtrap to "new" ibmad
	interface
Message-ID: <20090217210654.b70a38d3.weiny2@llnl.gov>

>From ac3d76c8ed77ab406a3297c1ba15598ae7cc15d2 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 20:45:16 -0800
Subject: [PATCH] Convert ibsendtrap to "new" ibmad interface

   also make mad_send_via public to do the conversion

Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibsendtrap.c |   13 +++++++++----
 libibmad/src/libibmad.map         |    1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ba6aa8b..d038dff 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -47,6 +47,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int send_144_node_desc_update(void)
 {
 	ib_portid_t sm_port;
@@ -55,10 +57,10 @@ static int send_144_node_desc_update(void)
 	ib_rpc_t trap_rpc;
 	ib_mad_notice_attr_t notice;
 
-	if (ib_resolve_self(&selfportid, &selfport, NULL))
+	if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport))
 		IBERROR("can't resolve self");
 
-	if (ib_resolve_smlid(&sm_port, 0))
+	if (ib_resolve_smlid_via(&sm_port, 0, srcport))
 		IBERROR("can't resolve SM destination port");
 
 	memset(&trap_rpc, 0, sizeof(trap_rpc));
@@ -80,7 +82,7 @@ static int send_144_node_desc_update(void)
 	notice.data_details.ntc_144.change_flgs =
 	    TRAP_144_MASK_NODE_DESCRIPTION_CHANGE;
 
-	return (mad_send(&trap_rpc, &sm_port, NULL, &notice));
+	return (mad_send_via(&trap_rpc, &sm_port, NULL, &notice, srcport));
 }
 
 typedef struct _trap_def {
@@ -137,7 +139,10 @@ int main(int argc, char **argv)
 	}
 
 	madrpc_show_errors(1);
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	return (send_trap(trap_name));
 }
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index ee1804a..4a44f02 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -90,6 +90,7 @@ IBMAD_1.3 {
 		mad_receive_via;
 		mad_respond_via;
 		mad_send;
+		mad_send_via;
 		smp_query;
 		smp_set;
 		ib_vendor_call;
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Feb 17 21:06:56 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 17 Feb 2009 21:06:56 -0800
Subject: [ofa-general] [PATCH 8/8] Convert ibtracert to "new" ibmad interface
Message-ID: <20090217210656.598be400.weiny2@llnl.gov>

>From 69db58d3e525031f5a975403574b36d6b9b3adf2 Mon Sep 17 00:00:00 2001
From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
Date: Tue, 17 Feb 2009 20:56:40 -0800
Subject: [PATCH] Convert ibtracert to "new" ibmad interface


Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
---
 infiniband-diags/src/ibtracert.c |   36 ++++++++++++++++++++++++------------
 1 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c
index ea5662b..1965aa0 100644
--- a/infiniband-diags/src/ibtracert.c
+++ b/infiniband-diags/src/ibtracert.c
@@ -50,6 +50,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 #define MAXHOPS	63
 
 static char *node_type_str[] = {
@@ -116,10 +118,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 	void *pi = port->portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
 	char *s, *e;
 
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport))
 		return -1;
 
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout))
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport))
 		return -1;
 
 	for (s = nd, e = s + 64; s < e; s++) {
@@ -129,7 +131,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 			*s = ' ';
 	}
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport))
 		return -1;
 
 	mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid);
@@ -151,7 +153,7 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid)
 {
 	void *si = sw->switchinfo, *fdb = sw->fdb;
 
-	if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+	if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport))
 		return -1;
 
 	mad_decode_field(si, IB_SW_LINEAR_FDB_CAP_F, &sw->linearcap);
@@ -160,7 +162,8 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid)
 	if (lid > sw->linearcap && lid > sw->linearFDBtop)
 		return -1;
 
-	if (!smp_query(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, timeout))
+	if (!smp_query_via(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64,
+			timeout, srcport))
 		return -1;
 
 	DEBUG("portid %s: forward lid %d to port %d",
@@ -382,7 +385,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid)
 
 	port->portnum = portnum;
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout,
+			srcport))
 		return -1;
 
 	mad_decode_field(pi, IB_PORT_LID_F, &port->lid);
@@ -439,7 +443,7 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map)
 
 	memset(map, 0, 256);
 
-	if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+	if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport))
 		return -1;
 
 	mlid -= 0xc000;
@@ -453,8 +457,8 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map)
 	maxsets = (node->numports + 15) / 16;		/* round up */
 
 	for (set = 0; set < maxsets; set++) {
-		if (!smp_query(mdb, portid, IB_ATTR_MULTICASTFORWTBL,
-		    block | (set << 28), timeout))
+		if (!smp_query_via(mdb, portid, IB_ATTR_MULTICASTFORWTBL,
+		    block | (set << 28), timeout, srcport))
 			return -1;
 
 		for (i = 0; i < 16; i++, map++) {
@@ -746,13 +750,18 @@ int main(int argc, char **argv)
 	if (ibd_timeout)
 		timeout = ibd_timeout;
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+
 	node_name_map = open_node_name_map(node_name_map_file);
 
-	if (ib_resolve_portid_str(&src_portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&src_portid, argv[0], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve source port %s", argv[0]);
 
-	if (ib_resolve_portid_str(&dest_portid, argv[1], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&dest_portid, argv[1], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[1]);
 
 	if (ibd_dest_type == IB_DEST_DRPATH) {
@@ -796,5 +805,8 @@ int main(int argc, char **argv)
 	dump_mcpath(endnode, dumplevel);
 
 	close_node_name_map(node_name_map);
+
+	mad_rpc_close_port(srcport);
+
 	exit(0);
 }
-- 
1.5.4.5


From jackm at dev.mellanox.co.il  Tue Feb 17 23:13:15 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 18 Feb 2009 09:13:15 +0200
Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp,
	do path_free only for newly-created paths
In-Reply-To: <adad4dg65eb.fsf@cisco.com>
References: <200902171701.36107.jackm@dev.mellanox.co.il>
	<adad4dg65eb.fsf@cisco.com>
Message-ID: <200902180913.16171.jackm@dev.mellanox.co.il>

On Wednesday 18 February 2009 00:54, Roland Dreier wrote:
>  > Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
>  > Signed-off-by: Moni Shua <monis at voltaire.com>
> 
> This doesn't make any sense... Moni was not involved in sending this
> patch at all, and in any case since you are sending the patch your s-o-b
> should be last.  If you want to give credit to Moni then include it in
> the description as you did for Yossi.
> 

Yossi identified the problem flow. I wrote and tested the actual patch.
Moni reviewed it, and I wrote the final version. I always thought that
the first s-o-b was for the patch writer. Next time, I'll do it right.


From monis at Voltaire.COM  Wed Feb 18 00:07:08 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 18 Feb 2009 10:07:08 +0200
Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp,	do path_free
	only for newly-created paths
In-Reply-To: <adad4dg65eb.fsf@cisco.com>
References: <200902171701.36107.jackm@dev.mellanox.co.il>
	<adad4dg65eb.fsf@cisco.com>
Message-ID: <499BC1AC.6010908@Voltaire.COM>

Roland Dreier wrote:
> thanks, applied...
> 
>  > Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
>  > Signed-off-by: Moni Shua <monis at voltaire.com>
> 
> This doesn't make any sense... Moni was not involved in sending this
> patch at all, and in any case since you are sending the patch your s-o-b
> should be last.  If you want to give credit to Moni then include it in
> the description as you did for Yossi.
> 
This is  fine with me (if it's still relevant)


From kliteyn at dev.mellanox.co.il  Wed Feb 18 01:15:02 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 18 Feb 2009 11:15:02 +0200
Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec
	question
In-Reply-To: <20090218011457.GA7189@sashak.voltaire.com>
References: <f0e08f230902171456l6732e2c6tabf6803013c0a9b3@mail.gmail.com>
	<20090218011457.GA7189@sashak.voltaire.com>
Message-ID: <499BD196.8070504@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> On 17:56 Tue 17 Feb     , Hal Rosenstock wrote:
>> In opensm/osm_inform.c:__match_inf_rec, around line 123, there is:
>>
>>         /* if inform_info.gid is not zero, ignore lid range */
>>         if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid,
>>                     sizeof(p_infr_rec->inform_record.inform_info.gid))) {
>>
>> Shouldn't this be if (memcmp) rather than if (!memcmp) ?
> 
> Yes, seems it should be without '!'. I can track it up to:
> 
> commit ce7f839355b9674c8d806747169d404066194235
> Author: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> Date:   Mon Nov 27 16:08:42 2006 +0000
> 
>     r10169: OpenSM: Comparing InformInfo records
> 
> , where this code was introduced.
> 
> Yevgeny! Do you remember was it just a typo?

Can't think of any reason for the '!' to be there.
Looks like a typo.

-- Yevgeny

> Sasha
> 


From kliteyn at dev.mellanox.co.il  Wed Feb 18 01:31:07 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 18 Feb 2009 11:31:07 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
 for the newly discovered port of the known node
In-Reply-To: <20090218010303.GZ7189@sashak.voltaire.com>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218010303.GZ7189@sashak.voltaire.com>
Message-ID: <499BD55B.3090606@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
>> This patch fixes bugzilla issue #1515:
>>
>> Topology:
>>                  |---------------|
>>                  |      SW2      |
>>                  |---------------|
>>                    |x |y    |z |v
>>               |----|  |     |  |----|
>>               |       |     |       |
>>               |  |----|     |----|  |
>>               |  |               |  |
>>              a| b|              c| d|
>>       |---------------|     |---------------|
>>       |       SW1     |     |     SW3       |
>>       |---------------|     |---------------|
>>           |                             |
>>           |                             |
>>        HCA with SM                      HCA
>>
>> During the discovery:
>>
>> SM sends NodeInfo request to SW1
>> SM sends NodeInfo request to SW2 through link a->x
>> SM discovers new node SW2:
>>   - updates DR to SW2 to go through link a->x
>>   - creates physp x
> 
> And requests SwitchInfo from SW2, and on response sends PortInfo to all
> switch ports. PortInfo receiver will initialize all switch ports. Isn't
> it?

Links are created only by getting NodeInfo response. W/o the
fix, when SW1 gets NodeInfo from SW2 through link b->y, it
doesn't initialize physp for y, hence the link can't be created.
So the only chance for the link to be created is when
SW2 will send NodeInfo request to SW1 through link y->b.
But this isn't happening, because DR for SW2 is updated
to contain this link, so SM doesn't probe the remote side
of y to avoid loop.

BTW, thing happens with every other link that connects
same nodes. In the example above, link v<->d will be
missing as well.

-- Yevgeny

> Sasha
> 
>> SM sends NodeInfo request to SW2 through link b->y
>> SM discovers a known node SW2
>>   - DOES NOT create physp y
>>   - updates DR to SW2 to go through link b->y
>>
>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
>> port y any more, leaving it uninitialized (no physp object for this port).
>>
>> The fix is to create physp for the newly discovered port of the known
>> switch node, same way as it is done for HCAs.
>> I also added one log message for the case that showed the problem - when
>> one of the link sides is uninitialized (no valid ports check). Perhaps
>> this log message should be an error message instead?
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>>  1 files changed, 23 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
>> index c52c0d5..7da3103 100644
>> --- a/opensm/opensm/osm_node_info_rcv.c
>> +++ b/opensm/opensm/osm_node_info_rcv.c
>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>>  	 */
>>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>>  					   p_neighbor_node,
>> -					   p_ni_context->port_num))
>> +					   p_ni_context->port_num)) {
>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
>> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>>  		goto _exit;
>> +	}
>>
>>  	if (osm_node_link_exists(p_node, port_num,
>>  				 p_neighbor_node, p_ni_context->port_num)) {
>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
>>  				     IN osm_node_t * const p_node,
>>  				     IN const osm_madw_t * const p_madw)
>>  {
>> +
>> +	ib_smp_t *p_smp;
>> +	ib_node_info_t *p_ni;
>> +	uint8_t port_num;
>> +
>>  	OSM_LOG_ENTER(sm->p_log);
>>
>> +	p_smp = osm_madw_get_smp_ptr(p_madw);
>> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
>> +	port_num = ib_node_info_get_local_port_num(p_ni);
>> +
>> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>> +			"Creating physp for node GUID:0x%"
>> +			PRIx64 ", port %u\n",
>> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
>> +			port_num);
>> +		osm_node_init_physp(p_node, p_madw);
>> +	}
>> +
>>  	/*
>>  	   If this switch has already been probed during this sweep,
>>  	   then don't bother reprobing it.
>> -- 
>> 1.5.1.4
>>
> 


From sashak at voltaire.com  Wed Feb 18 01:52:30 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 11:52:30 +0200
Subject: [ofa-general] Re: [PATCH 8/8] [ib-diags] smpquery: add support for
	WinOF
In-Reply-To: <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
Message-ID: <20090218095230.GC7189@sashak.voltaire.com>

Hi Sean,

On 14:37 Tue 17 Feb     , Sean Hefty wrote:
> Allow smpquery to build and run on both Linux and Windows.  Window
> build files are maintained in the WinOF respository.  These changes
> allow dropping the infiniband-diags into the WinOF build environment.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> 
>  infiniband-diags/src/smpquery.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
> index 44280e1..2d3d91b 100644
> --- a/infiniband-diags/src/smpquery.c
> +++ b/infiniband-diags/src/smpquery.c
> @@ -47,7 +47,7 @@
>  
>  #include <infiniband/umad.h>
>  #include <infiniband/mad.h>
> -#include <infiniband/complib/cl_nodenamemap.h>
> +#include <complib/cl_nodenamemap.h>

Is it needed? Rest tools use similar path with leading 'infiniband'.

>  
>  #include "ibdiag_common.h"
>  
> @@ -191,7 +191,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc)
>  	} else
>  		mad_decode_field(data, IB_NODE_PARTITION_CAP_F, &n);
>  
> -	for (i = 0; i < (n + 31) / 32; i++) {
> +	for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) {

Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't
make any sense here.

>  		mod =  i | (portnum << 16);
>  		if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0))
>  			return "pkey table query failed";
> @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc)
>  		return "port info failed";
>  	mad_decode_field(data, IB_PORT_GUID_CAP_F, &n);
>  
> -	for (i = 0; i < (n + 7) / 8; i++) {
> +	for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) {

Ditto.

Sasha

>  		mod =  i;
>  		if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0))
>  			return "guid info query failed";
> @@ -412,7 +412,7 @@ int main(int argc, char **argv)
>  	const struct ibdiag_opt opts[] = {
>  		{ "combined", 'c', 0, NULL, "use Combined route address argument"},
>  		{ "node-name-map", 1, 1, "<file>", "node name map file"},
> -		{}
> +		{ 0 }
>  	};
>  	const char *usage_examples[] = {
>  		"portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier",
> 
> 
> 


From sashak at voltaire.com  Wed Feb 18 01:54:15 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 11:54:15 +0200
Subject: [ofa-general] Re: [PATCH 7/8] [ib-diags] smpdump: add support for
	WinOF
In-Reply-To: <B54048123DBA4D1F8FDB835F9A6FFA74@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<B54048123DBA4D1F8FDB835F9A6FFA74@amr.corp.intel.com>
Message-ID: <20090218095415.GD7189@sashak.voltaire.com>

On 14:36 Tue 17 Feb     , Sean Hefty wrote:
> Allow smpdump to build and run on both Linux and Windows.  Window
> build files are maintained in the WinOF respository.  These changes
> allow dropping the infiniband-diags into the WinOF build environment.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Applied (patches 1-7). Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 18 02:00:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 12:00:08 +0200
Subject: [ofa-general] Re: [PATCH 9/8] [ib-diag] ibping: add support for
	WinOF
In-Reply-To: <BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
Message-ID: <20090218100008.GE7189@sashak.voltaire.com>

On 16:05 Tue 17 Feb     , Sean Hefty wrote:
> Allow ibping to build and run on both Linux and Windows.  Window
> build files are maintained in the WinOF respository.  These changes
> allow dropping the infiniband-diags into the WinOF build environment.
> 
> For portability, use complib to obtain time stamps.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Applied. Thanks.

> ---
> Converted another diag this afternoon.  I was able to build and execute this,
> but apparently I don't have anything on my fabric that responds to the pings.

You need to run ibping server ('ibping -S') on one side and then run
ibping <ibping-server-lid>.

Sasha


From sashak at voltaire.com  Wed Feb 18 02:17:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 12:17:21 +0200
Subject: [ofa-general] [PATCH] libibmad: remove functions which use pthread
In-Reply-To: <f0e08f230902171818t70459c6egd6f494fad77867c1@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090218003957.GY7189@sashak.voltaire.com>
	<f0e08f230902171818t70459c6egd6f494fad77867c1@mail.gmail.com>
Message-ID: <20090218101721.GF7189@sashak.voltaire.com>

On 21:18 Tue 17 Feb     , Hal Rosenstock wrote:
> On Tue, Feb 17, 2009 at 7:39 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 09:52 Mon 16 Feb     , Hal Rosenstock wrote:
> >>
> >> A first step would be removing the portid as static. If so, portid
> >> would need to be a supplied parameter to various mad routines and the
> >> existing ones relying on madrpc_portid would be deprecated. Does this
> >> make sense to do ?
> >
> > A first step would be converting all clients and internal usage in
> > libibmad (if any) to use a newer interface. If this will go smoothly
> > and things will not become overcomlicated, we could move forward -
> > to deprecate old interface... etc.. Nothing new.
> 
> Why nothing new ? I think there are higher level support functions
> which need to support the newer API.

Meant "nothing new" in API replace/upgrade procedure.

Sasha


From sashak at voltaire.com  Wed Feb 18 02:30:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 12:30:18 +0200
Subject: [ofa-general] Re: [PATCH 9/8] [ib-diag] ibping: add support for
	WinOF
In-Reply-To: <AFDFB3EC988E4CB3B253DE94DD57D6FB@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
	<AFDFB3EC988E4CB3B253DE94DD57D6FB@amr.corp.intel.com>
Message-ID: <20090218103018.GG7189@sashak.voltaire.com>

On 16:36 Tue 17 Feb     , Sean Hefty wrote:
> > 	signal(SIGINT, report);
> > 	signal(SIGTERM, report);
> 
> Btw - I worked around adding cdecl before main by disabling the warning.  Since
> main must be cdecl by default, the compiler fixes it, but spits out a warning.
> For some reason unknown to me, the warning only occurs when building 32-bit
> apps. 
> 
> However, signal() requires that the function be cdecl as well.

Guess it is about report() function. Why to not make everything cdecl
(by using compiler/linker flag or some super-#pragma in config.h or so)?

> The above two
> calls fail to compile on 32-bit Windows platforms, so I'm still working on this.
> The simple approach of changing the compiler options doesn't work as easily as
> it looks like it should.  The WDK build environment is 'special'.

Ugh, I really fail to understand why WinOF cannot evaluate an option of
using less "special" build tools for WDK insensitive code (such as
user-space programs ported from linux) - it would solve all those issues
just magically. And we are not entered yet a more complicated porting
areas such as pthreads...

Sasha


From volker.jaenisch at inqbus.de  Wed Feb 18 02:47:04 2009
From: volker.jaenisch at inqbus.de (Dr. Volker Jaenisch)
Date: Wed, 18 Feb 2009 11:47:04 +0100
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26
 under Debian Lenny 
Message-ID: <499BE728.8080002@inqbus.de>

Hello Ofa-List!

Compiling the ofa-kernel modules from OFED-1.4 on Debian Lenny Kernel 
2.6.26 (on amd64) gives
me the following trace:

[..]
/usr/bin/make -f scripts/Makefile.build 
obj=/usr/src/modules/ofa-kernel-source/drivers/scsi
gcc-4.1 
-Wp,-MD,/usr/src/modules/ofa-kernel-source/drivers/scsi/.scsi_transport_iscsi.o.d 
-nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.1.3/include 
-D__KERNEL__ \
-include include/linux/autoconf.h \
-include /usr/src/modules/ofa-kernel-source/include/linux/autoconf.h \
-I/usr/src/modules/ofa-kernel-source/kernel_addons/backport/2.6.26/include/ 
\
\
\
-I/usr/src/modules/ofa-kernel-source/include \
-I/usr/src/modules/ofa-kernel-source/drivers/infiniband/debug \
-I/usr/local/include/scst \
-I/usr/src/modules/ofa-kernel-source/drivers/infiniband/ulp/srpt \
-I/usr/src/modules/ofa-kernel-source/drivers/net/cxgb3 \
-Iinclude \
\
-I/usr/src/linux-headers-2.6.26-1-amd64/arch/x86_64/include \
-Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing 
-fno-common -Werror-implicit-function-declaration -Os 
-fno-stack-protector -m64 -mtune=generic -mno-red-zone -mcmodel=kernel 
-funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 
-DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare 
-fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow 
-Iinclude/asm-x86/mach-default -fomit-frame-pointer -g 
-Wdeclaration-after-statement -Wno-pointer-sign -DMODULE 
-D"KBUILD_STR(s)=#s" 
-D"KBUILD_BASENAME=KBUILD_STR(scsi_transport_iscsi)" 
-D"KBUILD_MODNAME=KBUILD_STR(scsi_transport_iscsi)" -c -o 
/usr/src/modules/ofa-kernel-source/drivers/scsi/.tmp_scsi_transport_iscsi.o 
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c: 
In function ‘iscsi_create_endpoint’:
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:174: 
warning: passing argument 3 of ‘class_find_device’ from incompatible 
pointer type
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:174: 
error: too many arguments to function ‘class_find_device’
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c: 
In function ‘iscsi_lookup_endpoint’:
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:226: 
warning: passing argument 3 of ‘class_find_device’ from incompatible 
pointer type
/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:226: 
error: too many arguments to function ‘class_find_device’
make[5]: *** 
[/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.o] 
Fehler 1
make[4]: *** [/usr/src/modules/ofa-kernel-source/drivers/scsi] Fehler 2
make[3]: *** [_module_/usr/src/modules/ofa-kernel-source] Fehler 2
make[3]: Leaving directory `/usr/src/linux-headers-2.6.26-1-amd64'
make[2]: *** [kernel] Fehler 2
make[2]: Leaving directory `/usr/src/modules/ofa-kernel-source'
make[1]: *** [binary-modules] Fehler 2
make[1]: Leaving directory `/usr/src/modules/ofa-kernel-source'
make: *** [kdist_build] Fehler 2

The code is backported correctly to 2.6.26

[..]
for templ in `ls debian/*.modules.in` ; do \
test -e ${templ%.modules.in}.backup || cp ${templ%.modules.in} 
${templ%.modules.in}.backup 2>/dev/null || true; \
sed -e 's/##KVERS##/2.6.26-1-amd64/g ;s/#KVERS#/2.6.26-1-amd64/g ; 
s/_KVERS_/2.6.26-1-amd64/g ; s/##KDREV##/2.6.26-13/g ; s/#KDREV#/2.6.2
6-13/g ; s/_KDREV_/2.6.26-13/g ' < $templ > ${templ%.modules.in}; \
done
./ofed_scripts/ofed_patch.sh --kernel-version=2.6.26
mkdir -p /usr/src/modules/ofa-kernel-source/patches
[..]

At google I found this thread 
http://groups.google.com/group/open-iscsi/browse_thread/thread/9bdb0cf059c1b3d3
that describes a similiar problem. But in that case there are too few 
parameter not to many.

The complete trace you may find at 
http://www.inqbus-hosting.de/ofa-kernel-source.buildlog.2.6.26-1-amd64.1234948099

Any help welcome

Volker Jaenisch

-- 
====================================================
   inqbus it-consulting      +49 ( 341 )  5643800
   Dr.  Volker Jaenisch      http://www.inqbus.de
   Herloßsohnstr.    12      0 4 1 5 5    Leipzig
   N  O  T -  F Ä L L E      +49 ( 170 )  3113748
====================================================


From vlad at lists.openfabrics.org  Wed Feb 18 03:17:29 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 18 Feb 2009 03:17:29 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090218-0200 daily build status
Message-ID: <20090218111729.A7559E60E43@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at voltaire.com  Wed Feb 18 05:35:41 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Feb 2009 15:35:41 +0200
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499BE728.8080002@inqbus.de>
References: <499BE728.8080002@inqbus.de>
Message-ID: <499C0EAD.7040604@voltaire.com>

Dr. Volker Jaenisch wrote:
> Hello Ofa-List!  Compiling the ofa-kernel modules from OFED-1.4 on 
> Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace:
First, this list is related to the development of the Linux RDMA stack 
not, please refer with ofed issues to ewg at lists.openfabrics.org Second, 
what makes you want to replace the IB stack that comes with Debian and 
not update the distro?


Or.


From kovlensky at interia.pl  Wed Feb 18 06:22:03 2009
From: kovlensky at interia.pl (kovlensky at interia.pl)
Date: 18 Feb 2009 15:22:03 +0100
Subject: [ofa-general] ***SPAM*** ofed 1.2.5.5 for SLES10 SP2?
Message-ID: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl>

Hi all,

Are there any plans for making ofed 1.2.5.5 compile on SLES10 SP2? In backport directory I can see 2.6.16_sles10 and 2.6.16_sles10_sp1 only. Compiling ib kernel modules from ofed 1.2.5.5 on SP2 makes compilation process use directory 2.6.16_sles10_sp1, which is in disagreement about few typedefs and compilation process fails. The problem lies in kernel version change - 2.6.16.46-0.12-smp from SP1 was changed to 2.6.16.60-0.21-smp in SP2 and the latter one has few typedefs changed.

Regards,

Kovlensky Vladimir

----------------------------------------------------------------------
Promocja w Speak Up. Angielski 50% gratis! 
Liczba miejsc ograniczona. Sprawdź!>> http://link.interia.pl/f205c


From tziporet at dev.mellanox.co.il  Wed Feb 18 06:34:08 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Feb 2009 16:34:08 +0200
Subject: [ofa-general] ***SPAM*** ofed 1.2.5.5 for SLES10 SP2?
In-Reply-To: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl>
References: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl>
Message-ID: <499C1C60.3090501@mellanox.co.il>

kovlensky at interia.pl wrote:
> Hi all,
>
> Are there any plans for making ofed 1.2.5.5 compile on SLES10 SP2? In backport directory I can see 2.6.16_sles10 and 2.6.16_sles10_sp1 only. Compiling ib kernel modules from ofed 1.2.5.5 on SP2 makes compilation process use directory 2.6.16_sles10_sp1, which is in disagreement about few typedefs and compilation process fails. The problem lies in kernel version change - 2.6.16.46-0.12-smp from SP1 was changed to 2.6.16.60-0.21-smp in SP2 and the latter one has few typedefs changed.
>
>   

No plan like this
Please use 1.3.1 or 1.4 for SLES 10 SP2.
Of course you can add the backports yourself.

Tziporet


From tziporet at dev.mellanox.co.il  Wed Feb 18 06:40:26 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Feb 2009 16:40:26 +0200
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <1234893143.21802.96.camel@pc.interlinx.bc.ca>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
	<49994BB2.3010206@mellanox.co.il>
	<7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>
	<499A8A20.1090507@mellanox.co.il>
	<1234893143.21802.96.camel@pc.interlinx.bc.ca>
Message-ID: <499C1DDA.3060601@mellanox.co.il>

Brian J. Murrell wrote:
> Ahhh.  But should he just include <ofed-prefix>/src/openib/include/ or
> also
> <ofed-prefix>/src/openib/kernel_addons/backport/<kernel_ver>/include/
> (as described in <ofed-prefix>/src/openib/ofed_patch.mk as well?
>
> And in what order should these be specified in?
>
>   
You need both
Order not important

Tziporet


From hnrose at comcast.net  Wed Feb 18 07:10:15 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:10:15 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_inform.c: Fix sense of
	zero GID compare in __match_inf_rec
Message-ID: <20090218151015.GA6482@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---

diff --git a/opensm/opensm/osm_inform.c b/opensm/opensm/osm_inform.c
index 4c773f6..6763a2a 100644
--- a/opensm/opensm/osm_inform.c
+++ b/opensm/opensm/osm_inform.c
@@ -121,7 +121,7 @@ __match_inf_rec(IN const cl_list_item_t * const p_list_item, IN void *context)
 	memset(&all_zero_gid, 0, sizeof(ib_gid_t));
 
 	/* if inform_info.gid is not zero, ignore lid range */
-	if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid,
+	if (memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid,
 		    sizeof(p_infr_rec->inform_record.inform_info.gid))) {
 		if (memcmp(&p_infr->inform_record.inform_info.gid,
 			   &p_infr_rec->inform_record.inform_info.gid,


From swise at opengridcomputing.com  Wed Feb 18 07:12:46 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 18 Feb 2009 09:12:46 -0600
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: logical-/bit-or confusion?
In-Reply-To: <499BD470.4080705@gmail.com>
References: <499BD470.4080705@gmail.com>
Message-ID: <499C256E.7050004@opengridcomputing.com>

Roel Kluin wrote:
> Please review.
> --------------------------->8-------------8<------------------------------
> Logical-/bit-or typo
>
> Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
> ---
> diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
> index 44e936e..61889e6 100644
> --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
> +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
> @@ -890,7 +890,7 @@ static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
>  	 */
>  	state_set(&ep->com, FPDU_MODE);
>  	ep->mpa_attr.initiator = 1;
> -	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
> +	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0;
>  	ep->mpa_attr.recv_marker_enabled = markers_enabled;
>  	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
>  	ep->mpa_attr.version = mpa_rev;
>   
This is a typo, but the logic behaves the same either way, which is why 
it wasn't detected I guess. 

But it should really be ||.

Reviewed-by: Steve Wise <swise at opengridcomputing.com>


From hal.rosenstock at gmail.com  Wed Feb 18 07:20:13 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:20:13 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090218003355.GX7189@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
	<20090218003355.GX7189@sashak.voltaire.com>
Message-ID: <f0e08f230902180720w25f74a8cs8c659757f331d425@mail.gmail.com>

On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 18:21 Tue 17 Feb     , Hal Rosenstock wrote:
>> >
>> > For utilities which run once through I think the old functions work just
>> > fine.
>>
>> Well, sort of... Aren't mad_portid "collisions" possible when multiple
>> programs are run concurrently ?
>
> No.

With the old API, mad_portid can be overwritten by another process or
thread. Another thread is not an expected use case but it is possible.

>> > However, it is pretty confusing which interface to use...  [or even that
>> > there
>> > are 2 interfaces, but I digress] (see below)
>>
>> I don't think the newer improved interfaces were ever documented.
>
> The old interfaces were not documented too. So it is at least consistent
> :).

There are no man pages but there is a doc (libibmad.txt) which is
somewhat out of date as it was never updated for the new interfaces.

-- Hal

> Sasha


From hnrose at comcast.net  Wed Feb 18 07:29:13 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:29:13 -0500
Subject: [ofa-general] [PATCH] opensm/man/opensm.8.in: Indicate ROUTER_EXP
	deprecated
Message-ID: <20090218152913.GC8489@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/man/opensm.8.in |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 7690980..6a5d833 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -569,8 +569,8 @@ opensm will return the path to the first available matching router.
 A configuration file with a single line where both prefix and GUID
 are wild-carded means that a path record query specifying any
 off-subnet DGID should return a path to the first available router.
-This configuration yields the same behaviour formerly achieved by
-compiling opensm with -DROUTER_EXP.
+This configuration yields the same behavior formerly achieved by
+compiling opensm with -DROUTER_EXP which has been deprecated.
 
 .SH ROUTING
 .PP
-- 
1.5.6.4


From hnrose at comcast.net  Wed Feb 18 07:32:27 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:32:27 -0500
Subject: [ofa-general] ***SPAM*** opensm/osm_console.c: Improve perfmgr
	print_counters error message
Message-ID: <20090218153227.GF8489@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/opensm/osm_console.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 00e2a94..da66ee5 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1158,7 +1158,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 							   p_cmd, out);
 			} else {
 				fprintf(out,
-					"print_counters requires a node name to be specified\n");
+					"print_counters requires a node name or node GUID to be specified\n");
 			}
 		} else if (strcmp(p_cmd, "sweep_time") == 0) {
 			p_cmd = next_token(p_last);
-- 
1.5.6.4


From hnrose at comcast.net  Wed Feb 18 07:30:16 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:30:16 -0500
Subject: [ofa-general] [PATCH] Add pkey table support to
	osm_get_all_port_attrs
Message-ID: <20090218153016.GD8489@comcast.net>


Only supported in osm_vendor_ibumad.c (separate patch for other
vendor layers)
Also, update applications using this (osmtest, opensm)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++-----
 opensm/opensm/main.c                 |    6 ++++++
 opensm/osmtest/main.c                |   11 +++++++++++
 opensm/osmtest/osmtest.c             |    7 +++++++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
index 734a860..861bfbe 100644
--- a/opensm/libvendor/osm_vendor_ibumad.c
+++ b/opensm/libvendor/osm_vendor_ibumad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	umad_ca_t ca;
 	ib_port_attr_t *attr = p_attr_array;
 	unsigned done = 0;
-	int r, i, j;
+	int r, i, j, k;
 
 	OSM_LOG_ENTER(p_vend->p_log);
 
 	CL_ASSERT(p_vend && p_num_ports);
 
+	r = 0;
 	if (!*p_num_ports) {
 		r = IB_INVALID_PARAMETER;
 		OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
@@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	}
 
 	for (i = 0; i < p_vend->ca_count && !done; i++) {
-		/*
-		 * For each CA, retrieve the port guids
-		 */
+		/* For each CA, retrieve the port attributes */
 		if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
 			if (ca.node_type < 1 || ca.node_type > 3)
 				continue;
@@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 				attr->port_num = ca.ports[j]->portnum;
 				attr->sm_lid = ca.ports[j]->sm_lid;
 				attr->link_state = ca.ports[j]->state;
+				attr->num_pkeys = ca.ports[j]->pkeys_size;
+				if (attr->num_pkeys && attr->p_pkey_table) {
+					if (attr->num_pkeys < ca.ports[j]->pkeys_size) {
+						r = IB_INSUFFICIENT_MEMORY;
+						OSM_LOG(p_vend->p_log,
+							OSM_LOG_ERROR,
+							"ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
+							j,
+							ca.ports[j]->pkeys_size);
+						goto Exit;
+					}
+					for (k = 0; k < attr->num_pkeys; k++)
+						attr->p_pkey_table[k] =
+							cl_hton16(ca.ports[j]->pkeys[k]);
+				}
 				attr++;
 				if (attr - p_attr_array > *p_num_ports) {
 					done = 1;
@@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	}
 
 	*p_num_ports = attr - p_attr_array;
-	r = 0;
 
 Exit:
 	OSM_LOG_EXIT(p_vend->p_log);
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 73a6274..503d7fa 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
 	uint32_t i, choice = 0;
 	ib_api_status_t status;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/* Call the transport layer for a list of local port GUID values */
 	status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array,
 					      &num_ports);
diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
index b360af6..83c1e13 100644
--- a/opensm/osmtest/main.c
+++ b/opensm/osmtest/main.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
 	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
 	int i;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	   Call the transport layer for a list of local port
 	   GUID values.
@@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
 	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
 	int i;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	   Call the transport layer for a list of local port
 	   GUID values.
diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
index a7b343f..986a8d2 100644
--- a/opensm/osmtest/osmtest.c
+++ b/opensm/osmtest/osmtest.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
 	ib_api_status_t status;
 	uint32_t num_ports = MAX_LOCAL_IBPORTS;
 	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
+	int i;
 
 	OSM_LOG_ENTER(&p_osmt->log);
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	 * Call the transport layer for a list of local port
 	 * GUID values.
-- 
1.5.6.4


From hnrose at comcast.net  Wed Feb 18 07:31:32 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:31:32 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/libvendor: Add pkey table
	request handling in osm_get_all_port_attrs
Message-ID: <20090218153132.GE8489@comcast.net>


in all other (than osm_vendor_ibumad) OpenSM vendor layers
Done by code inspection; not even compile tested

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 opensm/libvendor/osm_vendor_al.c            |    4 ++++
 opensm/libvendor/osm_vendor_mlx_hca.c       |    4 ++++
 opensm/libvendor/osm_vendor_mlx_hca_anafa.c |    5 ++++-
 opensm/libvendor/osm_vendor_mlx_hca_pfs.c   |    4 ++++
 opensm/libvendor/osm_vendor_mlx_hca_sim.c   |    4 ++++
 opensm/libvendor/osm_vendor_mlx_sa.c        |    7 +++++++
 opensm/libvendor/osm_vendor_mtl_hca_guid.c  |    9 +++++++++
 7 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/opensm/libvendor/osm_vendor_al.c b/opensm/libvendor/osm_vendor_al.c
index d5d78c9..2bcbf9f 100644
--- a/opensm/libvendor/osm_vendor_al.c
+++ b/opensm/libvendor/osm_vendor_al.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -670,6 +671,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 			num_ports = osm_ca_info_get_num_ports(p_ca_info);
 
 			for (port_num = 0; port_num < num_ports; port_num++) {
+				if (p_attr_array[port_count].num_pkeys &&
+				    p_attr_array[port_count].p_pkey_table)
+					status = IB_UNSUPPORTED;
 				p_attr_array[port_count] =
 				    *__osm_ca_info_get_port_attr_ptr(p_ca_info,
 								     port_num);
diff --git a/opensm/libvendor/osm_vendor_mlx_hca.c b/opensm/libvendor/osm_vendor_mlx_hca.c
index e98e272..554fd87 100644
--- a/opensm/libvendor/osm_vendor_mlx_hca.c
+++ b/opensm/libvendor/osm_vendor_mlx_hca.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -367,6 +368,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 			num_ports = p_ca_infos[ca].p_attr->num_ports;
 
 			for (port_num = 0; port_num < num_ports; port_num++) {
+				if (p_attr_array[port_count].num_pkeys &&
+				    p_attr_array[port_count].p_pkey_table)
+					status = IB_UNSUPPORTED;
 				p_attr_array[port_count] =
 				    *__osm_ca_info_get_port_attr_ptr(&p_ca_infos
 								     [ca],
diff --git a/opensm/libvendor/osm_vendor_mlx_hca_anafa.c b/opensm/libvendor/osm_vendor_mlx_hca_anafa.c
index 81506e4..d1b11e5 100644
--- a/opensm/libvendor/osm_vendor_mlx_hca_anafa.c
+++ b/opensm/libvendor/osm_vendor_mlx_hca_anafa.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -182,8 +183,10 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 
 	*p_num_ports = 1;
 
-	p_attr_array[0] = ca_info.attr.p_port_attr[0];	/* anafa has only one port */
 	status = IB_SUCCESS;
+	if (p_attr_array[0].num_pkeys && p_attr_array[0].p_pkey_table)
+		status = IB_UNSUPPORTED;
+	p_attr_array[0] = ca_info.attr.p_port_attr[0];	/* anafa has only one port */
 
 Exit:
 
diff --git a/opensm/libvendor/osm_vendor_mlx_hca_pfs.c b/opensm/libvendor/osm_vendor_mlx_hca_pfs.c
index 512b7bf..8c879a9 100644
--- a/opensm/libvendor/osm_vendor_mlx_hca_pfs.c
+++ b/opensm/libvendor/osm_vendor_mlx_hca_pfs.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -649,6 +650,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 			num_ports = p_ca_infos[caIdx - 1].p_attr->num_ports;
 
 			for (port_num = 0; port_num < num_ports; port_num++) {
+				if (p_attr_array[port_count].num_pkeys &&
+				    p_attr_array[port_count].p_pkey_table)
+					status = IB_UNSUPPORTED;
 				p_attr_array[port_count] =
 				    *__osm_ca_info_get_port_attr_ptr(&p_ca_infos
 								     [caIdx -
diff --git a/opensm/libvendor/osm_vendor_mlx_hca_sim.c b/opensm/libvendor/osm_vendor_mlx_hca_sim.c
index b6c0193..d46b869 100644
--- a/opensm/libvendor/osm_vendor_mlx_hca_sim.c
+++ b/opensm/libvendor/osm_vendor_mlx_hca_sim.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -695,6 +696,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 			num_ports = p_ca_infos[caIdx - 1].p_attr->num_ports;
 
 			for (port_num = 0; port_num < num_ports; port_num++) {
+				if (p_attr_array[port_count].num_pkeys &&
+				    p_attr_array[port_count].p_pkey_table)
+					status = IB_UNSUPPORTED;
 				p_attr_array[port_count] =
 				    *__osm_ca_info_get_port_attr_ptr(&p_ca_infos
 								     [caIdx -
diff --git a/opensm/libvendor/osm_vendor_mlx_sa.c b/opensm/libvendor/osm_vendor_mlx_sa.c
index 7bd5aea..a76c330 100644
--- a/opensm/libvendor/osm_vendor_mlx_sa.c
+++ b/opensm/libvendor/osm_vendor_mlx_sa.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -242,6 +243,7 @@ __osmv_get_lid_and_sm_lid_by_port_guid(IN osm_vendor_t * const p_vend,
 	ib_port_attr_t *p_attr_array;
 	uint32_t num_ports;
 	uint32_t port_num;
+	int i;
 
 	OSM_LOG_ENTER(p_vend->p_log);
 
@@ -278,6 +280,11 @@ __osmv_get_lid_and_sm_lid_by_port_guid(IN osm_vendor_t * const p_vend,
 	p_attr_array =
 	    (ib_port_attr_t *) malloc(sizeof(ib_port_attr_t) * num_ports);
 
+	for (i = 0; i < num_ports; i++) {
+		p_attr_array[i].num_pkeys = 0;
+		p_attr_array[i].p_pkey_table = NULL;
+	}
+
 	/* obtain the attributes */
 	status = osm_vendor_get_all_port_attr(p_vend, p_attr_array, &num_ports);
 	if (status != IB_SUCCESS) {
diff --git a/opensm/libvendor/osm_vendor_mtl_hca_guid.c b/opensm/libvendor/osm_vendor_mtl_hca_guid.c
index 58d961a..c48d9db 100644
--- a/opensm/libvendor/osm_vendor_mtl_hca_guid.c
+++ b/opensm/libvendor/osm_vendor_mtl_hca_guid.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -389,6 +390,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 			num_ports = osm_ca_info_get_num_ports(p_ca_info);
 
 			for (port_num = 0; port_num < num_ports; port_num++) {
+				if (p_attr_array[port_count].num_pkeys &&
+				    p_attr_array[port_count].p_pkey_table)
+					status = IB_UNSUPPORTED;
 				p_attr_array[port_count] =
 				    *__osm_ca_info_get_port_attr_ptr(p_ca_info,
 								     port_num);
@@ -571,6 +575,11 @@ ib_net64_t get_port_guid()
 	p_vend = &vend;
 	p_vend->p_log = p_osm_log;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	 * Call the transport layer for a list of local port
 	 * GUID values.
-- 
1.5.6.4


From hnrose at comcast.net  Wed Feb 18 07:28:16 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:28:16 -0500
Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free allocated
	umad prior to exit
Message-ID: <20090218152816.GB8489@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 infiniband-diags/src/smpdump.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c
index 35fcb81..6731546 100644
--- a/infiniband-diags/src/smpdump.c
+++ b/infiniband-diags/src/smpdump.c
@@ -289,7 +289,7 @@ int main(int argc, char *argv[])
 		xdump(stdout, 0, smp->data, 64);
 		if (smp->status)
 			fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
-		return 0;
+		goto Exit;
 	}
 
 	desc = smp->data;
@@ -301,5 +301,8 @@ int main(int argc, char *argv[])
 	putchar('\n');
 	if (smp->status)
 		fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
+
+Exit:
+	umad_free(umad);
 	return 0;
 }
-- 
1.5.6.4


From roel.kluin at gmail.com  Wed Feb 18 01:27:12 2009
From: roel.kluin at gmail.com (Roel Kluin)
Date: Wed, 18 Feb 2009 10:27:12 +0100
Subject: [ofa-general] ***SPAM*** [PATCH] RDMA/cxgb3: logical-/bit-or
	confusion?
Message-ID: <499BD470.4080705@gmail.com>

Please review.
--------------------------->8-------------8<------------------------------
Logical-/bit-or typo

Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
---
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 44e936e..61889e6 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -890,7 +890,7 @@ static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
 	 */
 	state_set(&ep->com, FPDU_MODE);
 	ep->mpa_attr.initiator = 1;
-	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0;
 	ep->mpa_attr.recv_marker_enabled = markers_enabled;
 	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
 	ep->mpa_attr.version = mpa_rev;


From leonid at mellanox.co.il  Wed Feb 18 03:24:22 2009
From: leonid at mellanox.co.il (Leonid Keller)
Date: Wed, 18 Feb 2009 13:24:22 +0200
Subject: [ofa-general] [ofw][patch][WinVerbs tests] fix IPv6 related
	connection problem 
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01CDE206@mtlexch01.mtl.com>

All WinVerbs test (client part) require host IP address as a parameter.
We used to use IPv4 address as it is more comfortable.
But if IPv6 protocol is installed, which is default for Win2008, the
connection code in the tests doesn't work right.
This patch suggest a fix, that limiting the usage of host addresses to
IPv4 only.
The same limitation exists today also in tools\perftests.
 
Index: tests/perftest/rdma_bw/rdma_bw.c
===================================================================
--- tests/perftest/rdma_bw/rdma_bw.c (revision 1976)
+++ tests/perftest/rdma_bw/rdma_bw.c (working copy)
@@ -215,6 +215,8 @@
   rdma_ack_cm_event(event);
  } else {
   for (t = res; t; t = t->ai_next) {
+   if (t->ai_family != AF_INET)
+    continue;
    sockfd = socket(t->ai_family, t->ai_socktype,
        t->ai_protocol);
    if (sockfd != INVALID_SOCKET) {
@@ -382,6 +384,8 @@
   rdma_ack_cm_event(event); 
  } else {
   for (t = res; t; t = t->ai_next) {
+   if (t->ai_family != AF_INET)
+    continue;
    sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
    if (sockfd != INVALID_SOCKET) {
     n = 1;
Index: tests/perftest/rdma_lat/rdma_lat.c
===================================================================
--- tests/perftest/rdma_lat/rdma_lat.c (revision 1976)
+++ tests/perftest/rdma_lat/rdma_lat.c (working copy)
@@ -294,6 +294,8 @@
   rdma_ack_cm_event(event);
  } else {
   for (t = res; t; t = t->ai_next) {
+   if (t->ai_family != AF_INET)
+    continue;
    sockfd = socket(t->ai_family, t->ai_socktype,
        t->ai_protocol);
    if (sockfd != INVALID_SOCKET) {
@@ -437,6 +439,8 @@
   rdma_ack_cm_event(event); 
  } else {
   for (t = res; t; t = t->ai_next) {
+   if (t->ai_family != AF_INET)
+    continue;
    sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
    if (sockfd != INVALID_SOCKET) {
     n = 1;
Index: tests/perftest/read_bw/read_bw.c
===================================================================
--- tests/perftest/read_bw/read_bw.c (revision 1976)
+++ tests/perftest/read_bw/read_bw.c (working copy)
@@ -126,6 +126,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -206,6 +208,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: tests/perftest/read_lat/read_lat.c
===================================================================
--- tests/perftest/read_lat/read_lat.c (revision 1976)
+++ tests/perftest/read_lat/read_lat.c (working copy)
@@ -201,6 +201,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -250,6 +252,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: tests/perftest/send_bw/send_bw.c
===================================================================
--- tests/perftest/send_bw/send_bw.c (revision 1976)
+++ tests/perftest/send_bw/send_bw.c (working copy)
@@ -142,6 +142,8 @@
 
  for (t = res; t; t = t->ai_next) {
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
+  if (t->ai_family != AF_INET)
+   continue;
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
     break;
@@ -221,6 +223,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: tests/perftest/send_lat/send_lat.c
===================================================================
--- tests/perftest/send_lat/send_lat.c (revision 1976)
+++ tests/perftest/send_lat/send_lat.c (working copy)
@@ -212,6 +212,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -261,6 +263,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: tests/perftest/write_bw/write_bw.c
===================================================================
--- tests/perftest/write_bw/write_bw.c (revision 1976)
+++ tests/perftest/write_bw/write_bw.c (working copy)
@@ -135,6 +135,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -215,6 +217,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: tests/perftest/write_bw_postlist/write_bw_postlist.c
===================================================================
--- tests/perftest/write_bw_postlist/write_bw_postlist.c (revision 1976)
+++ tests/perftest/write_bw_postlist/write_bw_postlist.c (working copy)
@@ -138,6 +138,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd >= 0) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -218,6 +220,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd >= 0) {
    n = 1;
Index: tests/perftest/write_lat/write_lat.c
===================================================================
--- tests/perftest/write_lat/write_lat.c (revision 1976)
+++ tests/perftest/write_lat/write_lat.c (working copy)
@@ -198,6 +198,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -247,6 +249,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c
===================================================================
--- ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c (revision 1976)
+++ ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c (working copy)
@@ -137,6 +137,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -205,6 +207,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c
===================================================================
--- ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c (revision 1976)
+++ ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c (working copy)
@@ -162,6 +162,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd >= 0) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -246,6 +248,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd >= 0) {
    n = 1;
Index: ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c
===================================================================
--- ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c (revision 1976)
+++ ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c (working copy)
@@ -124,6 +124,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -192,6 +194,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;
Index: ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c
===================================================================
--- ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c (revision 1976)
+++ ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c (working copy)
@@ -126,6 +126,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    if (!connect(sockfd, t->ai_addr, t->ai_addrlen))
@@ -193,6 +195,8 @@
  }
 
  for (t = res; t; t = t->ai_next) {
+  if (t->ai_family != AF_INET)
+   continue;
   sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol);
   if (sockfd != INVALID_SOCKET) {
    n = 1;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090218/9685a493/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wv_tests.patch
Type: application/octet-stream
Size: 9663 bytes
Desc: wv_tests.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090218/9685a493/attachment.obj>

From hnrose at comcast.net  Wed Feb 18 07:27:28 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:27:28 -0500
Subject: [ofa-general] [PATCH] management/libibmad.txt: Remove
	madrpc_lock/unlock
Message-ID: <20090218152728.GA8489@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 doc/libibmad.txt |   20 --------------------
 1 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/doc/libibmad.txt b/doc/libibmad.txt
index 42a61d4..9fb74c3 100644
--- a/doc/libibmad.txt
+++ b/doc/libibmad.txt
@@ -143,26 +143,6 @@ packets, this function has to be called repeatedly after each RPC operation.
 Bugs:
 	Not applicable to mad_receive
 
-madrpc_lock:
-
-Synopsis:
-	void	madrpc_lock(void);
-
-Description: Locks the mad RPC mechanism until madrpc_unlock() is called. Calls
-to this function while the RPC mechanism is already locked cause the calling
-process to be blocked until madrpc_unlock(). This function should be used
-only by multiple-threaded applications.
-
-See also:
-	madrpc_unlock
-
-madrpc_unlock:
-
-Synopsis:
-	void	madrpc_unlock(void);
-
-Description: Unlock the mad RPC mechanism. See madrpc_lock() for details.
-
 madrpc_show_errors:
 
 Synopsis:
-- 
1.5.6.4


From hal.rosenstock at gmail.com  Wed Feb 18 07:40:34 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:40:34 -0500
Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer
In-Reply-To: <20090214152533.GG14416@sashak.voltaire.com>
References: <f0e08f230902061112v599ee5e7r1189ecb6e994de82@mail.gmail.com>
	<20090207123355.GP17713@sashak.voltaire.com>
	<f0e08f230902120441q3af66510n5c6c4fbb0dd1e13f@mail.gmail.com>
	<20090212200025.GC14416@sashak.voltaire.com>
	<f0e08f230902121641q3357b511s615be93b3e8c8050@mail.gmail.com>
	<20090214152533.GG14416@sashak.voltaire.com>
Message-ID: <f0e08f230902180740j4f590747pfe647075b75ecbcd@mail.gmail.com>

Sasha,

On 2/14/09, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 19:41 Thu 12 Feb     , Hal Rosenstock wrote:
>> >
>> > It is already supplied by libibumad - by umad_get_ca()
>> > (ca.ports[i]->pkeys). I think you just need to copy this to
>> > ib_port_attr_t structure.
>>
>> Yes but rather than using supplied pointers (as inputs for the per
>> port pkey/guid tables), the other vendor layers require a large enough
>> buffer for these tables and set the port pointers appropriately (on
>> output) rather than supplying these pointers as input parameters. So
>> if we use these as input, then we definitely break the other vendor
>> layers.
>
> Ok, if you already have an usage example, this is even simpler - just
> alloc mem and copy pkey table.

I ended up going with the original approach.

-- Hal

> Sasha
>


From hnrose at comcast.net  Wed Feb 18 07:55:37 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 10:55:37 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/smpdump.c: Fix
	usage examples
Message-ID: <20090218155537.GA8762@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c
index e224909..7a8119f 100644
--- a/infiniband-diags/src/smpdump.c
+++ b/infiniband-diags/src/smpdump.c
@@ -226,11 +226,11 @@ int main(int argc, char *argv[])
 	char usage_args[] = "<dlid|dr_path> <attr> [mod]";
 	const char *usage_examples[] = {
 		" -- DR routed examples:",
-		"%s -D 0,1,2,3,5 16	# NODE DESC",
-		"%s -D 0,1,2 0x15 2	# PORT INFO, port 2",
+		"-D 0,1,2,3,5 16	# NODE DESC",
+		"-D 0,1,2 0x15 2	# PORT INFO, port 2",
 		" -- LID routed examples:",
-		"%s 3 0x15 2	# PORT INFO, lid 3 port 2",
-		"%s 0xa0 0x11	# NODE INFO, lid 0xa0",
+		"3 0x15 2	# PORT INFO, lid 3 port 2",
+		"0xa0 0x11	# NODE INFO, lid 0xa0",
 		NULL
 	};
 

From sean.hefty at intel.com  Wed Feb 18 08:50:33 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 08:50:33 -0800
Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface
In-Reply-To: <20090217210642.41c64624.weiny2@llnl.gov>
References: <20090217210642.41c64624.weiny2@llnl.gov>
Message-ID: <65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com>

>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
>      which mirrors madrpc_portid(void)

If you're planning on having someone use the new functions, they need to have
MAD_EXPORT added in front of them.  (Where MAD_EXPORT doesn't exist in mad.h
probably means that there isn't a user of that call, or we just haven't ported
the user that does use it to Windows yet.)

Do you have a published git tree with these patches?

- Sean


From sean.hefty at intel.com  Wed Feb 18 08:51:58 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 08:51:58 -0800
Subject: [ofa-general] [PATCH 3/8] Convert ibaddr to "new" ibmad interface
In-Reply-To: <20090217210646.5e74b9ed.weiny2@llnl.gov>
References: <20090217210646.5e74b9ed.weiny2@llnl.gov>
Message-ID: <A0EAE2284AC141B3860205E3D2DD3225@amr.corp.intel.com>

>+       srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>+       if (!srcport)
>+               IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
>

>+
>+       mad_rpc_close_port(srcport);
>        exit(0);

need MAD_EXPORT 


From sean.hefty at intel.com  Wed Feb 18 08:57:09 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 08:57:09 -0800
Subject: [ofa-general] [PATCH 5/8] Convert ibportstate to "new"
	ibmad	interface
In-Reply-To: <20090217210650.3397dd72.weiny2@llnl.gov>
References: <20090217210650.3397dd72.weiny2@llnl.gov>
Message-ID: <6A0B953C20FB428691700974E2C86B0C@amr.corp.intel.com>

>-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) <
>0)
>+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
>+				ibd_sm_id, srcport) < 0)

needs MAD_EXPORT

>-					if (ib_resolve_self(&selfportid,
&selfport, 0) <
>0)
>+					if (ib_resolve_self_via(&selfportid,
>+							&selfport, 0, srcport) <
0)

ditto


From sean.hefty at intel.com  Wed Feb 18 09:06:10 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 09:06:10 -0800
Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free
	allocated	umad prior to exit
In-Reply-To: <20090218152816.GB8489@comcast.net>
References: <20090218152816.GB8489@comcast.net>
Message-ID: <0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com>

>-		return 0;
>+		goto Exit;
> 	}
>
> 	desc = smp->data;
>@@ -301,5 +301,8 @@ int main(int argc, char *argv[])
> 	putchar('\n');
> 	if (smp->status)
> 		fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
>+
>+Exit:

nit: can we use all lowercase


From hal.rosenstock at gmail.com  Wed Feb 18 09:07:15 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Feb 2009 12:07:15 -0500
Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface
In-Reply-To: <20090217210642.41c64624.weiny2@llnl.gov>
References: <20090217210642.41c64624.weiny2@llnl.gov>
Message-ID: <f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>

On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
>
> From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001
> From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> Date: Tue, 17 Feb 2009 17:32:15 -0800
> Subject: [PATCH] Clean up "new" interface
>
>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
>      which mirrors madrpc_portid(void)
>
> Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> ---
>  libibmad/include/infiniband/mad.h |   58 ++++++++++++++++++++++--------------
>  libibmad/src/gs.c                 |   19 ++++++------
>  libibmad/src/libibmad.map         |    1 +
>  libibmad/src/resolve.c            |   10 ++++--
>  libibmad/src/rpc.c                |   29 +++++++++---------
>  libibmad/src/sa.c                 |    4 +-
>  libibmad/src/smp.c                |    4 +-
>  7 files changed, 71 insertions(+), 54 deletions(-)
>
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 1aaaa1b..56b87e6 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt)
>  }
>
>  /* rpc.c */
> +/* Depricated interface */

typo - Deprecated

>  MAD_EXPORT int madrpc_portid(void);
> -MAD_EXPORT int madrpc_set_retries(int retries);
> -MAD_EXPORT int madrpc_set_timeout(int timeout);

I thought initially we weren't going to remove APIs but move over to
the new ones ? A subsequent step would be to deprecate the old APIs
and then eventually remove the old APIs.

-- Hal

>  void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
>  void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
>                  void *data);
>  MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
>                            int num_classes);
>  void madrpc_save_mad(void *madbuf, int len);
> -MAD_EXPORT void madrpc_show_errors(int set);
>
> -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> +/* New interface */
> +MAD_EXPORT void madrpc_show_errors(int set);
> +MAD_EXPORT int madrpc_set_retries(int retries);
> +MAD_EXPORT int madrpc_set_timeout(int timeout);
> +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
>                        int num_classes);
> -void mad_rpc_close_port(void *ibmad_port);
> -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> +void mad_rpc_close_port(struct ibmad_port *srcport);
> +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
>              void *payload, void *rcvdata);
> -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
>                   ib_rmpp_hdr_t * rmpp, void *data);
> +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
>
>  /* smp.c */
>  MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
>                              unsigned mod, unsigned timeout);
>  MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
>                            unsigned mod, unsigned timeout);
> +
> +/* smp.c new interface */
>  MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> -                      unsigned mod, unsigned timeout, const void *srcport);
> +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
>  uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> -                    unsigned timeout, const void *srcport);
> +                    unsigned timeout, const struct ibmad_port *srcport);
>
>  /* sa.c */
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
>                 unsigned timeout);
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> -                    ib_sa_call_t * sa, unsigned timeout);
>  MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +
> +/* sa.c new interface */
> +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
> +                    ib_sa_call_t * sa, unsigned timeout);
> +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
>
>  /* resolve.c */
> @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
>  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
>                               ibmad_gid_t * gid);
>
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> +/* resolve.c new interface */
> +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport);
>  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -                       ib_portid_t * sm_id, int timeout, const void *srcport);
> +                       ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport);
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>                              enum MAD_DEST dest, ib_portid_t * sm_id,
> -                             const void *srcport);
> +                             const struct ibmad_port *srcport);
>  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -                       const void *srcport);
> +                       const struct ibmad_port *srcport);
>
>  /* gs.c */
>  MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
> @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
>  MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
>                                              int port, unsigned timeout);
>
> +/* gs.c new interface */
>  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned timeout,
> -                                     const void *srcport);
> +                                     const struct ibmad_port *srcport);
>  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -                                   unsigned timeout, const void *srcport);
> +                                   unsigned timeout, const struct ibmad_port *srcport);
>  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>                                    unsigned mask, unsigned timeout,
> -                                   const void *srcport);
> +                                   const struct ibmad_port *srcport);
>  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport);
> +                                       const struct ibmad_port *srcport);
>  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned mask,
> -                                       unsigned timeout, const void *srcport);
> +                                       unsigned timeout,
> +                                       const struct ibmad_port *srcport);
>  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport);
> +                                       const struct ibmad_port *srcport);
>  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>                                       int port, unsigned timeout,
> -                                      const void *srcport);
> +                                      const struct ibmad_port *srcport);
>  /* dump.c */
>  MAD_EXPORT ib_mad_dump_fn
>     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
> diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
> index d2c4574..e302caf 100644
> --- a/libibmad/src/gs.c
> +++ b/libibmad/src/gs.c
> @@ -47,7 +47,7 @@
>
>  static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
>                              unsigned timeout, unsigned id,
> -                             const void *srcport)
> +                             const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>        int lid = dest->lid;
> @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
>
>  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned timeout,
> -                                     const void *srcport)
> +                                     const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
>                             srcport);
> @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
>  }
>
>  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -                                   unsigned timeout, const void *srcport)
> +                                   unsigned timeout, const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_COUNTERS, srcport);
> @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned mask, unsigned timeout,
> -                                     unsigned id, const void *srcport)
> +                                     unsigned id, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>        int lid = dest->lid;
> @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>                                    unsigned mask, unsigned timeout,
> -                                   const void *srcport)
> +                                   const struct ibmad_port *srcport)
>  {
>        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>                                     IB_GSI_PORT_COUNTERS, srcport);
> @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport)
> +                                       const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned mask,
> -                                       unsigned timeout, const void *srcport)
> +                                       unsigned timeout,
> +                                       const struct ibmad_port *srcport)
>  {
>        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>                                     IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport)
> +                                       const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_SAMPLES_CONTROL, srcport);
> @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>                                       int port, unsigned timeout,
> -                                      const void *srcport)
> +                                      const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_SAMPLES_RESULT, srcport);
> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> index f944d86..94d7762 100644
> --- a/libibmad/src/libibmad.map
> +++ b/libibmad/src/libibmad.map
> @@ -69,6 +69,7 @@ IBMAD_1.3 {
>                mad_rpc_close_port;
>                mad_rpc;
>                mad_rpc_rmpp;
> +               mad_rpc_portid;
>                madrpc;
>                madrpc_def_timeout;
>                madrpc_init;
> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> index 553949d..3291f43 100644
> --- a/libibmad/src/resolve.c
> +++ b/libibmad/src/resolve.c
> @@ -45,7 +45,8 @@
>  #undef DEBUG
>  #define DEBUG  if (ibdebug)    IBWARN
>
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
> +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t self = { 0 };
>        uint8_t portinfo[64];
> @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
>  }
>
>  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -                       ib_portid_t * sm_id, int timeout, const void *srcport)
> +                       ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t sm_portid;
>        char buf[IB_SA_DATA_SIZE] = { 0 };
> @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>                              enum MAD_DEST dest_type, ib_portid_t * sm_id,
> -                             const void *srcport)
> +                             const struct ibmad_port *srcport)
>  {
>        uint64_t guid;
>        int lid;
> @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
>  }
>
>  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -                       const void *srcport)
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t self = { 0 };
>        uint8_t portinfo[64];
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index e811526..d47873b 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -100,6 +100,11 @@ int madrpc_portid(void)
>        return mad_portid;
>  }
>
> +int mad_rpc_portid(struct ibmad_port *srcport)
> +{
> +       return (srcport->port_id);
> +}
> +
>  static int
>  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>           int timeout)
> @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>        return -1;
>  }
>
> -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>              void *payload, void *rcvdata)
>  {
> -       const struct ibmad_port *p = port_id;
>        int status, len;
>        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>
> @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
>                return 0;
>
> -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -                             p->class_agents[rpc->mgtclass],
> +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +                             port->class_agents[rpc->mgtclass],
>                              len, rpc->timeout)) < 0) {
>                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>                return 0;
> @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        return rcvdata;
>  }
>
> -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>                   ib_rmpp_hdr_t * rmpp, void *data)
>  {
> -       const struct ibmad_port *p = port_id;
>        int status, len;
>        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>
> @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
>                return 0;
>
> -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -                             p->class_agents[rpc->mgtclass],
> +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +                             port->class_agents[rpc->mgtclass],
>                              len, rpc->timeout)) < 0) {
>                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>                return 0;
> @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
>        }
>  }
>
> -void *mad_rpc_open_port(char *dev_name, int dev_port,
> +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
>                        int *mgmt_classes, int num_classes)
>  {
>        struct ibmad_port *p;
> @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
>        return p;
>  }
>
> -void mad_rpc_close_port(void *port_id)
> +void mad_rpc_close_port(struct ibmad_port *port)
>  {
> -       struct ibmad_port *p = port_id;
> -
> -       umad_close_port(p->port_id);
> -       free(p);
> +       umad_close_port(port->port_id);
> +       free(port);
>  }
>
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> index 7403d4f..ddeb152 100644
> --- a/libibmad/src/sa.c
> +++ b/libibmad/src/sa.c
> @@ -44,7 +44,7 @@
>  #undef DEBUG
>  #define DEBUG  if (ibdebug)    IBWARN
>
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>                     ib_sa_call_t * sa, unsigned timeout)
>  {
>        ib_rpc_t rpc = { 0 };
> @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>                        IB_PR_COMPMASK_SGID |\
>                        IB_PR_COMPMASK_NUMBPATH)
>
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
>  {
>        int npath;
> diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
> index fad263c..e5489b3 100644
> --- a/libibmad/src/smp.c
> +++ b/libibmad/src/smp.c
> @@ -45,7 +45,7 @@
>  #define DEBUG  if (ibdebug)    IBWARN
>
>  uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
> -                    unsigned mod, unsigned timeout, const void *srcport)
> +                    unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>
> @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
>  }
>
>  uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
> -                      unsigned mod, unsigned timeout, const void *srcport)
> +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>
> --
> 1.5.4.5
>
>


From sean.hefty at intel.com  Wed Feb 18 09:17:31 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 09:17:31 -0800
Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface
In-Reply-To: <f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
References: <20090217210642.41c64624.weiny2@llnl.gov>
	<f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
Message-ID: <400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com>

>>  MAD_EXPORT int madrpc_portid(void);
>> -MAD_EXPORT int madrpc_set_retries(int retries);
>> -MAD_EXPORT int madrpc_set_timeout(int timeout);
>
>I thought initially we weren't going to remove APIs but move over to
>the new ones ? A subsequent step would be to deprecate the old APIs
>and then eventually remove the old APIs.

He moved these down in the code

>> +MAD_EXPORT int madrpc_set_retries(int retries);
>> +MAD_EXPORT int madrpc_set_timeout(int timeout);

probably so that they aren't listed under a 'deprecated' section.

- Sean


From hal.rosenstock at gmail.com  Wed Feb 18 09:22:19 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Feb 2009 12:22:19 -0500
Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free allocated 
	umad prior to exit
In-Reply-To: <0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com>
References: <20090218152816.GB8489@comcast.net>
	<0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com>
Message-ID: <f0e08f230902180922n1e47cc6at40530e58d1e38638@mail.gmail.com>

On Wed, Feb 18, 2009 at 12:06 PM, Sean Hefty <sean.hefty at intel.com> wrote:
>>-              return 0;
>>+              goto Exit;
>>       }
>>
>>       desc = smp->data;
>>@@ -301,5 +301,8 @@ int main(int argc, char *argv[])
>>       putchar('\n');
>>       if (smp->status)
>>               fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
>>+
>>+Exit:
>
> nit: can we use all lowercase

Sure; v2 patch shortly.

-- Hal

>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hnrose at comcast.net  Wed Feb 18 09:19:32 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 18 Feb 2009 12:19:32 -0500
Subject: [ofa-general] ***SPAM*** [PATCHv2] infiniband-diags/smpdump.c:
	Release umad resources on exit
Message-ID: <20090218171932.GA15139@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c
index 6c7f84c..414975c 100644
--- a/infiniband-diags/src/smpdump.c
+++ b/infiniband-diags/src/smpdump.c
@@ -289,7 +289,7 @@ int main(int argc, char *argv[])
 		xdump(stdout, 0, smp->data, 64);
 		if (smp->status)
 			fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
-		return 0;
+		goto exit;
 	}
 
 	desc = smp->data;
@@ -301,5 +301,8 @@ int main(int argc, char *argv[])
 	putchar('\n');
 	if (smp->status)
 		fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status));
+
+exit:
+	umad_free(umad);
 	return 0;
 }


From weiny2 at llnl.gov  Wed Feb 18 09:27:34 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 18 Feb 2009 09:27:34 -0800
Subject: [ofa-general] Re: [PATCH 1/8] Clean up "new" interface
In-Reply-To: <f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
References: <20090217210642.41c64624.weiny2@llnl.gov>
	<f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
Message-ID: <20090218092734.31ca1062.weiny2@llnl.gov>

On Wed, 18 Feb 2009 12:07:15 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >
> > From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001
> > From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> > Date: Tue, 17 Feb 2009 17:32:15 -0800
> > Subject: [PATCH] Clean up "new" interface
> >
> >   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
> >   Create new mad_rpc_portid(struct ibmad_port *srcport) function
> >      which mirrors madrpc_portid(void)
> >
> > Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> > ---
> >  libibmad/include/infiniband/mad.h |   58 ++++++++++++++++++++++--------------
> >  libibmad/src/gs.c                 |   19 ++++++------
> >  libibmad/src/libibmad.map         |    1 +
> >  libibmad/src/resolve.c            |   10 ++++--
> >  libibmad/src/rpc.c                |   29 +++++++++---------
> >  libibmad/src/sa.c                 |    4 +-
> >  libibmad/src/smp.c                |    4 +-
> >  7 files changed, 71 insertions(+), 54 deletions(-)
> >
> > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> > index 1aaaa1b..56b87e6 100644
> > --- a/libibmad/include/infiniband/mad.h
> > +++ b/libibmad/include/infiniband/mad.h
> > @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt)
> >  }
> >
> >  /* rpc.c */
> > +/* Depricated interface */
> 
> typo - Deprecated

Some day I will learn to spell this...  :-(

> 
> >  MAD_EXPORT int madrpc_portid(void);
> > -MAD_EXPORT int madrpc_set_retries(int retries);
> > -MAD_EXPORT int madrpc_set_timeout(int timeout);
> 
> I thought initially we weren't going to remove APIs but move over to
> the new ones ? A subsequent step would be to deprecate the old APIs
> and then eventually remove the old APIs.

They were not removed... [see below]

> 
> -- Hal
> 
> >  void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
> >  void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
> >                  void *data);
> >  MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> >                            int num_classes);
> >  void madrpc_save_mad(void *madbuf, int len);
> > -MAD_EXPORT void madrpc_show_errors(int set);
> >
> > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> > +/* New interface */
> > +MAD_EXPORT void madrpc_show_errors(int set);
> > +MAD_EXPORT int madrpc_set_retries(int retries);
> > +MAD_EXPORT int madrpc_set_timeout(int timeout);

... but moved down here to indicate they were _not_ deprecated.  We could
deprecate them and make 'retries' and 'timeout' associated with each
ibmad_port but I thought those were pretty global to the instance of the lib.

Ira

> > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> >                        int num_classes);
> > -void mad_rpc_close_port(void *ibmad_port);
> > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void mad_rpc_close_port(struct ibmad_port *srcport);
> > +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> >              void *payload, void *rcvdata);
> > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> >                   ib_rmpp_hdr_t * rmpp, void *data);
> > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
> >
> >  /* smp.c */
> >  MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
> >                              unsigned mod, unsigned timeout);
> >  MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
> >                            unsigned mod, unsigned timeout);
> > +
> > +/* smp.c new interface */
> >  MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> > -                      unsigned mod, unsigned timeout, const void *srcport);
> > +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
> >  uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> > -                    unsigned timeout, const void *srcport);
> > +                    unsigned timeout, const struct ibmad_port *srcport);
> >
> >  /* sa.c */
> >  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> >                 unsigned timeout);
> > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> > -                    ib_sa_call_t * sa, unsigned timeout);
> >  MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */
> > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> > +
> > +/* sa.c new interface */
> > +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
> > +                    ib_sa_call_t * sa, unsigned timeout);
> > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
> >                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
> >
> >  /* resolve.c */
> > @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> >  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
> >                               ibmad_gid_t * gid);
> >
> > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> > +/* resolve.c new interface */
> > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport);
> >  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> > -                       ib_portid_t * sm_id, int timeout, const void *srcport);
> > +                       ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport);
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >                              enum MAD_DEST dest, ib_portid_t * sm_id,
> > -                             const void *srcport);
> > +                             const struct ibmad_port *srcport);
> >  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> > -                       const void *srcport);
> > +                       const struct ibmad_port *srcport);
> >
> >  /* gs.c */
> >  MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
> > @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
> >  MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
> >                                              int port, unsigned timeout);
> >
> > +/* gs.c new interface */
> >  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned timeout,
> > -                                     const void *srcport);
> > +                                     const struct ibmad_port *srcport);
> >  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> > -                                   unsigned timeout, const void *srcport);
> > +                                   unsigned timeout, const struct ibmad_port *srcport);
> >  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                                    unsigned mask, unsigned timeout,
> > -                                   const void *srcport);
> > +                                   const struct ibmad_port *srcport);
> >  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport);
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned mask,
> > -                                       unsigned timeout, const void *srcport);
> > +                                       unsigned timeout,
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport);
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                       int port, unsigned timeout,
> > -                                      const void *srcport);
> > +                                      const struct ibmad_port *srcport);
> >  /* dump.c */
> >  MAD_EXPORT ib_mad_dump_fn
> >     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
> > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
> > index d2c4574..e302caf 100644
> > --- a/libibmad/src/gs.c
> > +++ b/libibmad/src/gs.c
> > @@ -47,7 +47,7 @@
> >
> >  static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                              unsigned timeout, unsigned id,
> > -                             const void *srcport)
> > +                             const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >        int lid = dest->lid;
> > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
> >
> >  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned timeout,
> > -                                     const void *srcport)
> > +                                     const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
> >                             srcport);
> > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
> >  }
> >
> >  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> > -                                   unsigned timeout, const void *srcport)
> > +                                   unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_COUNTERS, srcport);
> > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned mask, unsigned timeout,
> > -                                     unsigned id, const void *srcport)
> > +                                     unsigned id, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >        int lid = dest->lid;
> > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                                    unsigned mask, unsigned timeout,
> > -                                   const void *srcport)
> > +                                   const struct ibmad_port *srcport)
> >  {
> >        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
> >                                     IB_GSI_PORT_COUNTERS, srcport);
> > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport)
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_COUNTERS_EXT, srcport);
> > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned mask,
> > -                                       unsigned timeout, const void *srcport)
> > +                                       unsigned timeout,
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
> >                                     IB_GSI_PORT_COUNTERS_EXT, srcport);
> > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport)
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_SAMPLES_CONTROL, srcport);
> > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                       int port, unsigned timeout,
> > -                                      const void *srcport)
> > +                                      const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_SAMPLES_RESULT, srcport);
> > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> > index f944d86..94d7762 100644
> > --- a/libibmad/src/libibmad.map
> > +++ b/libibmad/src/libibmad.map
> > @@ -69,6 +69,7 @@ IBMAD_1.3 {
> >                mad_rpc_close_port;
> >                mad_rpc;
> >                mad_rpc_rmpp;
> > +               mad_rpc_portid;
> >                madrpc;
> >                madrpc_def_timeout;
> >                madrpc_init;
> > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> > index 553949d..3291f43 100644
> > --- a/libibmad/src/resolve.c
> > +++ b/libibmad/src/resolve.c
> > @@ -45,7 +45,8 @@
> >  #undef DEBUG
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
> > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t self = { 0 };
> >        uint8_t portinfo[64];
> > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
> >  }
> >
> >  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> > -                       ib_portid_t * sm_id, int timeout, const void *srcport)
> > +                       ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t sm_portid;
> >        char buf[IB_SA_DATA_SIZE] = { 0 };
> > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> >
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >                              enum MAD_DEST dest_type, ib_portid_t * sm_id,
> > -                             const void *srcport)
> > +                             const struct ibmad_port *srcport)
> >  {
> >        uint64_t guid;
> >        int lid;
> > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> >  }
> >
> >  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> > -                       const void *srcport)
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t self = { 0 };
> >        uint8_t portinfo[64];
> > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> > index e811526..d47873b 100644
> > --- a/libibmad/src/rpc.c
> > +++ b/libibmad/src/rpc.c
> > @@ -100,6 +100,11 @@ int madrpc_portid(void)
> >        return mad_portid;
> >  }
> >
> > +int mad_rpc_portid(struct ibmad_port *srcport)
> > +{
> > +       return (srcport->port_id);
> > +}
> > +
> >  static int
> >  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
> >           int timeout)
> > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
> >        return -1;
> >  }
> >
> > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
> >              void *payload, void *rcvdata)
> >  {
> > -       const struct ibmad_port *p = port_id;
> >        int status, len;
> >        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
> >
> > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
> >                return 0;
> >
> > -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> > -                             p->class_agents[rpc->mgtclass],
> > +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> > +                             port->class_agents[rpc->mgtclass],
> >                              len, rpc->timeout)) < 0) {
> >                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
> >                return 0;
> > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        return rcvdata;
> >  }
> >
> > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
> >                   ib_rmpp_hdr_t * rmpp, void *data)
> >  {
> > -       const struct ibmad_port *p = port_id;
> >        int status, len;
> >        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
> >
> > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
> >                return 0;
> >
> > -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> > -                             p->class_agents[rpc->mgtclass],
> > +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> > +                             port->class_agents[rpc->mgtclass],
> >                              len, rpc->timeout)) < 0) {
> >                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
> >                return 0;
> > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
> >        }
> >  }
> >
> > -void *mad_rpc_open_port(char *dev_name, int dev_port,
> > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
> >                        int *mgmt_classes, int num_classes)
> >  {
> >        struct ibmad_port *p;
> > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
> >        return p;
> >  }
> >
> > -void mad_rpc_close_port(void *port_id)
> > +void mad_rpc_close_port(struct ibmad_port *port)
> >  {
> > -       struct ibmad_port *p = port_id;
> > -
> > -       umad_close_port(p->port_id);
> > -       free(p);
> > +       umad_close_port(port->port_id);
> > +       free(port);
> >  }
> >
> >  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> > index 7403d4f..ddeb152 100644
> > --- a/libibmad/src/sa.c
> > +++ b/libibmad/src/sa.c
> > @@ -44,7 +44,7 @@
> >  #undef DEBUG
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> >                     ib_sa_call_t * sa, unsigned timeout)
> >  {
> >        ib_rpc_t rpc = { 0 };
> > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> >                        IB_PR_COMPMASK_SGID |\
> >                        IB_PR_COMPMASK_NUMBPATH)
> >
> > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
> >                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
> >  {
> >        int npath;
> > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
> > index fad263c..e5489b3 100644
> > --- a/libibmad/src/smp.c
> > +++ b/libibmad/src/smp.c
> > @@ -45,7 +45,7 @@
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> >  uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
> > -                    unsigned mod, unsigned timeout, const void *srcport)
> > +                    unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >
> > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
> >  }
> >
> >  uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
> > -                      unsigned mod, unsigned timeout, const void *srcport)
> > +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >
> > --
> > 1.5.4.5
> >
> >


-- 
Ira Weiny <weiny2 at llnl.gov>


From weiny2 at llnl.gov  Wed Feb 18 09:28:18 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 18 Feb 2009 09:28:18 -0800
Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface
In-Reply-To: <65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com>
References: <20090217210642.41c64624.weiny2@llnl.gov>
	<65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com>
Message-ID: <20090218092818.3e931fe3.weiny2@llnl.gov>

On Wed, 18 Feb 2009 08:50:33 -0800
"Sean Hefty" <sean.hefty at intel.com> wrote:

> >   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
> >   Create new mad_rpc_portid(struct ibmad_port *srcport) function
> >      which mirrors madrpc_portid(void)
> 
> If you're planning on having someone use the new functions, they need to have
> MAD_EXPORT added in front of them.  (Where MAD_EXPORT doesn't exist in mad.h
> probably means that there isn't a user of that call, or we just haven't ported
> the user that does use it to Windows yet.)
> 
> Do you have a published git tree with these patches?

Not published no...  I will clean up with MAD_EXPORT on all the new
functions and fix my spelling errors from Hal's comment.

Ira


> 
> - Sean
> 


-- 
Ira Weiny <weiny2 at llnl.gov>


From hal.rosenstock at gmail.com  Wed Feb 18 09:31:06 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Feb 2009 12:31:06 -0500
Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface
In-Reply-To: <400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com>
References: <20090217210642.41c64624.weiny2@llnl.gov>
	<f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
	<400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com>
Message-ID: <f0e08f230902180931k22d0a25apd45e819a9e4b34dc@mail.gmail.com>

On Wed, Feb 18, 2009 at 12:17 PM, Sean Hefty <sean.hefty at intel.com> wrote:
>>>  MAD_EXPORT int madrpc_portid(void);
>>> -MAD_EXPORT int madrpc_set_retries(int retries);
>>> -MAD_EXPORT int madrpc_set_timeout(int timeout);
>>
>>I thought initially we weren't going to remove APIs but move over to
>>the new ones ? A subsequent step would be to deprecate the old APIs
>>and then eventually remove the old APIs.
>
> He moved these down in the code

Missed that. It was a general comment. I think there are many (old)
routines which end up in the 'to be deprecated' category.

-- Hal

>>> +MAD_EXPORT int madrpc_set_retries(int retries);
>>> +MAD_EXPORT int madrpc_set_timeout(int timeout);
>
> probably so that they aren't listed under a 'deprecated' section.
>
> - Sean
>
>


From sean.hefty at intel.com  Wed Feb 18 09:32:30 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 09:32:30 -0800
Subject: [ofa-general] RE: [PATCH 8/8] [ib-diags] smpquery: add support for
	WinOF
In-Reply-To: <20090218095230.GC7189@sashak.voltaire.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
	<20090218095230.GC7189@sashak.voltaire.com>
Message-ID: <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com>

>>  #include <infiniband/umad.h>
>>  #include <infiniband/mad.h>
>> -#include <infiniband/complib/cl_nodenamemap.h>
>> +#include <complib/cl_nodenamemap.h>
>
>Is it needed? Rest tools use similar path with leading 'infiniband'.

That directory path doesn't exist in Windows.  I think this makes sense.
Complib is a separate library, independent of infiniband.

>> -	for (i = 0; i < (n + 31) / 32; i++) {
>> +	for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) {
>
>Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't
>make any sense here.
>
>>  		mod =  i | (portnum << 16);
>>  		if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0))
>>  			return "pkey table query failed";
>> @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc)
>>  		return "port info failed";
>>  	mad_decode_field(data, IB_PORT_GUID_CAP_F, &n);
>>
>> -	for (i = 0; i < (n + 7) / 8; i++) {
>> +	for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) {

fixed


From sashak at voltaire.com  Wed Feb 18 09:42:18 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 19:42:18 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
	for the newly discovered port of the known node
In-Reply-To: <499BD55B.3090606@dev.mellanox.co.il>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218010303.GZ7189@sashak.voltaire.com>
	<499BD55B.3090606@dev.mellanox.co.il>
Message-ID: <20090218174218.GT5910@sashak.voltaire.com>

Hi Yevgeny,

On 11:31 Wed 18 Feb     , Yevgeny Kliteynik wrote:
> Hi Sasha,
>
> Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>> On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
>>> This patch fixes bugzilla issue #1515:
>>>
>>> Topology:
>>>                  |---------------|
>>>                  |      SW2      |
>>>                  |---------------|
>>>                    |x |y    |z |v
>>>               |----|  |     |  |----|
>>>               |       |     |       |
>>>               |  |----|     |----|  |
>>>               |  |               |  |
>>>              a| b|              c| d|
>>>       |---------------|     |---------------|
>>>       |       SW1     |     |     SW3       |
>>>       |---------------|     |---------------|
>>>           |                             |
>>>           |                             |
>>>        HCA with SM                      HCA
>>>
>>> During the discovery:
>>>
>>> SM sends NodeInfo request to SW1
>>> SM sends NodeInfo request to SW2 through link a->x
>>> SM discovers new node SW2:
>>>   - updates DR to SW2 to go through link a->x
>>>   - creates physp x
>> And requests SwitchInfo from SW2, and on response sends PortInfo to all
>> switch ports. PortInfo receiver will initialize all switch ports. Isn't
>> it?
>
> Links are created only by getting NodeInfo response. W/o the
> fix, when SW1 gets NodeInfo from SW2 through link b->y, it
> doesn't initialize physp for y, hence the link can't be created.
> So the only chance for the link to be created is when
> SW2 will send NodeInfo request to SW1 through link y->b.
> But this isn't happening, because DR for SW2 is updated
> to contain this link, so SM doesn't probe the remote side
> of y to avoid loop.

Ok, so whole story should be caused by race between SW2 SwitchInfo
receiving (using a->x) and SW2 NodeInfo (using b->y). As far as I can
see only in this case SW2 port 0 path will be altered (and PortInfo will
be requested using new path). Right?

> BTW, thing happens with every other link that connects
> same nodes. In the example above, link v<->d will be
> missing as well.

Hmm, I was not able to reproduce this using two switch setup. But if it
is resulted by race it also should not be 100% reproducible.

Basically I'm not against proposed physp initialization, but want to
understand the problem better.

Sasha

>
> -- Yevgeny
>
>> Sasha
>>> SM sends NodeInfo request to SW2 through link b->y
>>> SM discovers a known node SW2
>>>   - DOES NOT create physp y
>>>   - updates DR to SW2 to go through link b->y
>>>
>>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal 
>>> with
>>> port y any more, leaving it uninitialized (no physp object for this 
>>> port).
>>>
>>> The fix is to create physp for the newly discovered port of the known
>>> switch node, same way as it is done for HCAs.
>>> I also added one log message for the case that showed the problem - when
>>> one of the link sides is uninitialized (no valid ports check). Perhaps
>>> this log message should be an error message instead?
>>>
>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>> ---
>>>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>>>  1 files changed, 23 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/opensm/opensm/osm_node_info_rcv.c 
>>> b/opensm/opensm/osm_node_info_rcv.c
>>> index c52c0d5..7da3103 100644
>>> --- a/opensm/opensm/osm_node_info_rcv.c
>>> +++ b/opensm/opensm/osm_node_info_rcv.c
>>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>>>  	 */
>>>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>>>  					   p_neighbor_node,
>>> -					   p_ni_context->port_num))
>>> +					   p_ni_context->port_num)) {
>>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>>> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
>>> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>>>  		goto _exit;
>>> +	}
>>>
>>>  	if (osm_node_link_exists(p_node, port_num,
>>>  				 p_neighbor_node, p_ni_context->port_num)) {
>>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * 
>>> sm,
>>>  				     IN osm_node_t * const p_node,
>>>  				     IN const osm_madw_t * const p_madw)
>>>  {
>>> +
>>> +	ib_smp_t *p_smp;
>>> +	ib_node_info_t *p_ni;
>>> +	uint8_t port_num;
>>> +
>>>  	OSM_LOG_ENTER(sm->p_log);
>>>
>>> +	p_smp = osm_madw_get_smp_ptr(p_madw);
>>> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
>>> +	port_num = ib_node_info_get_local_port_num(p_ni);
>>> +
>>> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
>>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>>> +			"Creating physp for node GUID:0x%"
>>> +			PRIx64 ", port %u\n",
>>> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
>>> +			port_num);
>>> +		osm_node_init_physp(p_node, p_madw);
>>> +	}
>>> +
>>>  	/*
>>>  	   If this switch has already been probed during this sweep,
>>>  	   then don't bother reprobing it.
>>> -- 
>>> 1.5.1.4
>>>
>


From sean.hefty at intel.com  Wed Feb 18 09:50:21 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 09:50:21 -0800
Subject: [ofa-general] RE: [PATCH 9/8] [ib-diag] ibping: add support for
	WinOF
In-Reply-To: <20090218103018.GG7189@sashak.voltaire.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<BCD910E957F1447DB4829A7E8B0757FA@amr.corp.intel.com>
	<AFDFB3EC988E4CB3B253DE94DD57D6FB@amr.corp.intel.com>
	<20090218103018.GG7189@sashak.voltaire.com>
Message-ID: <8302DC6B01C6408D8EE72B0D10AFEDB4@amr.corp.intel.com>

>Guess it is about report() function. Why to not make everything cdecl
>(by using compiler/linker flag or some super-#pragma in config.h or so)?

The WDK build environment uses stdcall by default.  Visual Studio uses cdecl.  I
have not yet figured out how to override the WDK using stdcall.  Simply adding a
switch (/Gd or whatever it is) doesn't work, nor did the other 50 things that I
tried.

Top personnel are working on the issue.  Please stand by.  Thank you for your
continued patience.  We apologize for any inconvenience.  *cue hold music*

>Ugh, I really fail to understand why WinOF cannot evaluate an option of
>using less "special" build tools for WDK insensitive code (such as
>user-space programs ported from linux) - it would solve all those issues
>just magically. And we are not entered yet a more complicated porting
>areas such as pthreads...

I have no problem with it.  But it does require two build environments.  The
current WinOF setup uses a single build environment to build the drivers and
related userspace libraries and applications.  This is a fairly common practice.
I don't know that this is all that different than how OFED packages everything
together.

My plan for more complicated porting areas is to use complib, and fix any issues
that arise.  That's what it was designed for.

- Sean


From sashak at voltaire.com  Wed Feb 18 09:57:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 19:57:03 +0200
Subject: [ofa-general] Re: [PATCH 8/8] [ib-diags] smpquery: add support for
	WinOF
In-Reply-To: <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
	<20090218095230.GC7189@sashak.voltaire.com>
	<3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com>
Message-ID: <20090218175703.GV5910@sashak.voltaire.com>

On 09:32 Wed 18 Feb     , Sean Hefty wrote:
> >>  #include <infiniband/umad.h>
> >>  #include <infiniband/mad.h>
> >> -#include <infiniband/complib/cl_nodenamemap.h>
> >> +#include <complib/cl_nodenamemap.h>
> >
> >Is it needed? Rest tools use similar path with leading 'infiniband'.
> 
> That directory path doesn't exist in Windows.  I think this makes sense.
> Complib is a separate library, independent of infiniband.

This is not so in Linux. complib headers are installed under infiniband
(don't know why, but historically it is so).

Hmm, actually it is not really matter since complib headers by itself are
using  things like #include <complib/cl_something.h>. So ok, I think we
can change it in diag tools too.

> 
> >> -	for (i = 0; i < (n + 31) / 32; i++) {
> >> +	for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) {
> >
> >Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't
> >make any sense here.
> >
> >>  		mod =  i | (portnum << 16);
> >>  		if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0))
> >>  			return "pkey table query failed";
> >> @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc)
> >>  		return "port info failed";
> >>  	mad_decode_field(data, IB_PORT_GUID_CAP_F, &n);
> >>
> >> -	for (i = 0; i < (n + 7) / 8; i++) {
> >> +	for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) {
> 
> fixed

Thanks. Just repost the patch. I will apply.

Sasha


From sean.hefty at intel.com  Wed Feb 18 10:00:21 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 10:00:21 -0800
Subject: [ofa-general] [PATCH v2] [ib-diags] smpquery: add support for WinOF
In-Reply-To: <20090218175703.GV5910@sashak.voltaire.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
	<20090218095230.GC7189@sashak.voltaire.com>
	<3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com>
	<20090218175703.GV5910@sashak.voltaire.com>
Message-ID: <905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com>

Allow smpquery to build and run on both Linux and Windows.  Window
build files are maintained in the WinOF respository.  These changes
allow dropping the infiniband-diags into the WinOF build environment.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
changes from v1: declared variables as int, versus casting expressions
to (uint32_t)

 infiniband-diags/src/smpquery.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
index 44280e1..bf1626d 100644
--- a/infiniband-diags/src/smpquery.c
+++ b/infiniband-diags/src/smpquery.c
@@ -47,7 +47,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -166,7 +166,7 @@ static char *
 pkey_table(ib_portid_t *dest, char **argv, int argc)
 {
 	uint8_t data[IB_SMP_DATA_SIZE];
-	uint32_t i, j, k;
+	int i, j, k;
 	uint16_t *p;
 	unsigned mod;
 	int n, t, phy_ports;
@@ -343,7 +343,7 @@ static char *
 guid_info(ib_portid_t *dest, char **argv, int argc)
 {
 	uint8_t data[IB_SMP_DATA_SIZE];
-	uint32_t i, j, k;
+	int i, j, k;
 	uint64_t *p;
 	unsigned mod;
 	int n;
@@ -412,7 +412,7 @@ int main(int argc, char **argv)
 	const struct ibdiag_opt opts[] = {
 		{ "combined", 'c', 0, NULL, "use Combined route address argument"},
 		{ "node-name-map", 1, 1, "<file>", "node name map file"},
-		{}
+		{ 0 }
 	};
 	const char *usage_examples[] = {
 		"portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier",


From sashak at voltaire.com  Wed Feb 18 10:15:56 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 20:15:56 +0200
Subject: [ofa-general] Re: [PATCH v2] [ib-diags] smpquery: add support for
	WinOF
In-Reply-To: <905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com>
References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com>
	<8B21199DAF6B4010B109838D36505522@amr.corp.intel.com>
	<20090218095230.GC7189@sashak.voltaire.com>
	<3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com>
	<20090218175703.GV5910@sashak.voltaire.com>
	<905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com>
Message-ID: <20090218181556.GW5910@sashak.voltaire.com>

On 10:00 Wed 18 Feb     , Sean Hefty wrote:
> Allow smpquery to build and run on both Linux and Windows.  Window
> build files are maintained in the WinOF respository.  These changes
> allow dropping the infiniband-diags into the WinOF build environment.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Applied. Thanks.

Sasha


From volker.jaenisch at inqbus.de  Wed Feb 18 10:13:37 2009
From: volker.jaenisch at inqbus.de (Dr. Volker Jaenisch)
Date: Wed, 18 Feb 2009 19:13:37 +0100
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499C0EAD.7040604@voltaire.com>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
Message-ID: <499C4FD1.7040200@inqbus.de>

Dear Or!

Or Gerlitz schrieb:
>> Hello Ofa-List!  Compiling the ofa-kernel modules from OFED-1.4 on 
>> Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace:
> First, this list is related to the development of the Linux RDMA stack 
> not, please refer with ofed issues to ewg at lists.openfabrics.org
Sorry for that. But the description of the ofa-List "OpenFabrics General 
Mailing List" does not indicate this list as an explicit developer 
forum. And there are lots of postings quite similiar to mine in this list.

The description of the ewg-List "OpenFabrics Enterprise Working Group 
Mailing List" where I find working group
anouncements like "Agenda for the OFED meeting today (Jan 5, 09)  
<http://lists.openfabrics.org/pipermail/ewg/2009-January/012553.html>" 
looked not so promissing to post my message.

May be a dedicated OFED-Users list can be setup where I can post my 
stupid questions. :-)
> Second, what makes you want to replace the IB stack that comes with 
> Debian and not update the distro?
I never said nothing about replacing. But before I can bring in some 
improvement to the Debian IB stack
firstly I like to have a running IB Stack on Debian at all.

ISER from the Debian IB Stack does not work for me. Remember our 
discussion on the STGT-list?
http://lists.wpkg.org/pipermail/stgt/2009-February/002649.html

So I looked for a working alternative to double check my findings on the 
iSER read problems before posting a bug report.
Therefore I tried to install OFED 1.4 under debian. So what's wrong with 
that?

There are several parts of the OFED (for instance opensm and other user 
space tools) that are not avaible in debian, yet.
The idea is to bring a more consistent Infiniband support to Debian. But 
this is not my project, so I do not like to discuss over the head of 
someone other. Here the wishlist entry for OFED Debian support issued by 
Guy Coates.

http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b42e830ce29c641a

Best regards,

Volker

-- 
====================================================
   inqbus it-consulting      +49 ( 341 )  5643800
   Dr.  Volker Jaenisch      http://www.inqbus.de
   Herloßsohnstr.    12      0 4 1 5 5    Leipzig
   N  O  T -  F Ä L L E      +49 ( 170 )  3113748
====================================================


From sashak at voltaire.com  Wed Feb 18 10:19:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Feb 2009 20:19:55 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
	for the newly discovered port of the known node
In-Reply-To: <499AB068.2020205@dev.mellanox.co.il>
References: <499AB068.2020205@dev.mellanox.co.il>
Message-ID: <20090218181955.GX5910@sashak.voltaire.com>

On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> This patch fixes bugzilla issue #1515:
> 
> Topology:
>                  |---------------|
>                  |      SW2      |
>                  |---------------|
>                    |x |y    |z |v
>               |----|  |     |  |----|
>               |       |     |       |
>               |  |----|     |----|  |
>               |  |               |  |
>              a| b|              c| d|
>       |---------------|     |---------------|
>       |       SW1     |     |     SW3       |
>       |---------------|     |---------------|
>           |                             |
>           |                             |
>        HCA with SM                      HCA
> 
> During the discovery:
> 
> SM sends NodeInfo request to SW1
> SM sends NodeInfo request to SW2 through link a->x
> SM discovers new node SW2:
>   - updates DR to SW2 to go through link a->x
>   - creates physp x
> SM sends NodeInfo request to SW2 through link b->y
> SM discovers a known node SW2
>   - DOES NOT create physp y
>   - updates DR to SW2 to go through link b->y
> 
> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
> port y any more, leaving it uninitialized (no physp object for this port).
> 
> The fix is to create physp for the newly discovered port of the known
> switch node, same way as it is done for HCAs.
> I also added one log message for the case that showed the problem - when
> one of the link sides is uninitialized (no valid ports check). Perhaps
> this log message should be an error message instead?
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>  1 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
> index c52c0d5..7da3103 100644
> --- a/opensm/opensm/osm_node_info_rcv.c
> +++ b/opensm/opensm/osm_node_info_rcv.c
> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>  	 */
>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>  					   p_neighbor_node,
> -					   p_ni_context->port_num))
> +					   p_ni_context->port_num)) {

Actually if port is initialized unconditionally on NodeInfo receiving
this case becomes impossible. No?

If yes, we probably need to put CL_ASSERT() there instead of run-time
check.

Sasha

> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>  		goto _exit;
> +	}
> 
>  	if (osm_node_link_exists(p_node, port_num,
>  				 p_neighbor_node, p_ni_context->port_num)) {
> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
>  				     IN osm_node_t * const p_node,
>  				     IN const osm_madw_t * const p_madw)
>  {
> +
> +	ib_smp_t *p_smp;
> +	ib_node_info_t *p_ni;
> +	uint8_t port_num;
> +
>  	OSM_LOG_ENTER(sm->p_log);
> 
> +	p_smp = osm_madw_get_smp_ptr(p_madw);
> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
> +	port_num = ib_node_info_get_local_port_num(p_ni);
> +
> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> +			"Creating physp for node GUID:0x%"
> +			PRIx64 ", port %u\n",
> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
> +			port_num);
> +		osm_node_init_physp(p_node, p_madw);
> +	}
> +
>  	/*
>  	   If this switch has already been probed during this sweep,
>  	   then don't bother reprobing it.
> -- 
> 1.5.1.4
> 


From rdreier at cisco.com  Wed Feb 18 10:38:38 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Feb 2009 10:38:38 -0800
Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp,
	do path_free only for newly-created paths
In-Reply-To: <200902180913.16171.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Wed, 18 Feb 2009 09:13:15 +0200")
References: <200902171701.36107.jackm@dev.mellanox.co.il>
	<adad4dg65eb.fsf@cisco.com>
	<200902180913.16171.jackm@dev.mellanox.co.il>
Message-ID: <adaeixv380h.fsf@cisco.com>

 > Yossi identified the problem flow. I wrote and tested the actual patch.
 > Moni reviewed it, and I wrote the final version. I always thought that
 > the first s-o-b was for the patch writer. Next time, I'll do it right.

Yes, first s-o-b should be for the patch writer.  But since Moni wasn't
involved in sending the patch out, there's no reason for his s-o-b and
in fact it doesn't make sense.  If he reviewed it, then "Reviewed-by:"
is probably the right thing to include.

 - R.


From rdreier at cisco.com  Wed Feb 18 10:41:48 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Feb 2009 10:41:48 -0800
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: logical-/bit-or confusion?
In-Reply-To: <499C256E.7050004@opengridcomputing.com> (Steve Wise's message of
	"Wed, 18 Feb 2009 09:12:46 -0600")
References: <499BD470.4080705@gmail.com>
	<499C256E.7050004@opengridcomputing.com>
Message-ID: <adaab8j37v7.fsf@cisco.com>

 > > -	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
 > > +	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0;

I don't seem to have the original email for some reason.

Has anyone looked at which way generates better/smaller code?  Since ||
requires short-circuit evaluation it might be better to leave it as |.
But maybe it's not worth being so tricky.

If someone can resend the patch to me I'm happy to apply it.

 - R.


From or.gerlitz at gmail.com  Wed Feb 18 11:58:20 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 18 Feb 2009 21:58:20 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH 2 of 2 for 2.6.28] mlx4: Add Raw
	Ethertype QP support
In-Reply-To: <ada7i43i5l3.fsf@cisco.com>
References: <200812151312.53603.jackm@dev.mellanox.co.il>
	<ada7i43i5l3.fsf@cisco.com>
Message-ID: <15ddcffd0902181158o54477d62kbb3798e3b3310fc9@mail.gmail.com>

On Sat, Feb 7, 2009 at 12:05 AM, Roland Dreier <rdreier at cisco.com> wrote:

> Seems we're at the point where mlx4 could use a "is_special_qpt()"
> helper maybe?

Jack, Igor

Can you address Roland's comments? the 2.6.30 merge window becomes
closer and I'd like to see this patch set in, to be used in possible
sniffer implementation.

Or.


From kliteyn at dev.mellanox.co.il  Wed Feb 18 13:26:52 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 18 Feb 2009 23:26:52 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
 for the newly discovered port of the known node
In-Reply-To: <20090218174218.GT5910@sashak.voltaire.com>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218010303.GZ7189@sashak.voltaire.com>
	<499BD55B.3090606@dev.mellanox.co.il>
	<20090218174218.GT5910@sashak.voltaire.com>
Message-ID: <499C7D1C.8070800@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
>>> On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
>>>> This patch fixes bugzilla issue #1515:
>>>>
>>>> Topology:
>>>>                  |---------------|
>>>>                  |      SW2      |
>>>>                  |---------------|
>>>>                    |x |y    |z |v
>>>>               |----|  |     |  |----|
>>>>               |       |     |       |
>>>>               |  |----|     |----|  |
>>>>               |  |               |  |
>>>>              a| b|              c| d|
>>>>       |---------------|     |---------------|
>>>>       |       SW1     |     |     SW3       |
>>>>       |---------------|     |---------------|
>>>>           |                             |
>>>>           |                             |
>>>>        HCA with SM                      HCA
>>>>
>>>> During the discovery:
>>>>
>>>> SM sends NodeInfo request to SW1
>>>> SM sends NodeInfo request to SW2 through link a->x
>>>> SM discovers new node SW2:
>>>>   - updates DR to SW2 to go through link a->x
>>>>   - creates physp x
>>> And requests SwitchInfo from SW2, and on response sends PortInfo to all
>>> switch ports. PortInfo receiver will initialize all switch ports. Isn't
>>> it?
>> Links are created only by getting NodeInfo response. W/o the
>> fix, when SW1 gets NodeInfo from SW2 through link b->y, it
>> doesn't initialize physp for y, hence the link can't be created.
>> So the only chance for the link to be created is when
>> SW2 will send NodeInfo request to SW1 through link y->b.
>> But this isn't happening, because DR for SW2 is updated
>> to contain this link, so SM doesn't probe the remote side
>> of y to avoid loop.
> 
> Ok, so whole story should be caused by race between SW2 SwitchInfo
> receiving (using a->x) and SW2 NodeInfo (using b->y). As far as I can
> see only in this case SW2 port 0 path will be altered (and PortInfo will
> be requested using new path). Right?

Right.

>> BTW, thing happens with every other link that connects
>> same nodes. In the example above, link v<->d will be
>> missing as well.
> 
> Hmm, I was not able to reproduce this using two switch setup. But if it
> is resulted by race it also should not be 100% reproducible.

Right again. Discovery shouldn't rely on the order of packets
that it receives. I guess that on real hardware the packets
are handled serially, so we need some more complex example
for higher probability of this race.
I see the problem on the simple example using the simulator
(ibmgtsim), which has several threads handling the packets,
so the chances for OOO packets are much higher.

-- Yevgeny

> Basically I'm not against proposed physp initialization, but want to
> understand the problem better.
> 
> Sasha
> 
>> -- Yevgeny
>>
>>> Sasha
>>>> SM sends NodeInfo request to SW2 through link b->y
>>>> SM discovers a known node SW2
>>>>   - DOES NOT create physp y
>>>>   - updates DR to SW2 to go through link b->y
>>>>
>>>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal 
>>>> with
>>>> port y any more, leaving it uninitialized (no physp object for this 
>>>> port).
>>>>
>>>> The fix is to create physp for the newly discovered port of the known
>>>> switch node, same way as it is done for HCAs.
>>>> I also added one log message for the case that showed the problem - when
>>>> one of the link sides is uninitialized (no valid ports check). Perhaps
>>>> this log message should be an error message instead?
>>>>
>>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>>> ---
>>>>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>>>>  1 files changed, 23 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/opensm/opensm/osm_node_info_rcv.c 
>>>> b/opensm/opensm/osm_node_info_rcv.c
>>>> index c52c0d5..7da3103 100644
>>>> --- a/opensm/opensm/osm_node_info_rcv.c
>>>> +++ b/opensm/opensm/osm_node_info_rcv.c
>>>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>>>>  	 */
>>>>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>>>>  					   p_neighbor_node,
>>>> -					   p_ni_context->port_num))
>>>> +					   p_ni_context->port_num)) {
>>>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>>>> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
>>>> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>>>>  		goto _exit;
>>>> +	}
>>>>
>>>>  	if (osm_node_link_exists(p_node, port_num,
>>>>  				 p_neighbor_node, p_ni_context->port_num)) {
>>>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * 
>>>> sm,
>>>>  				     IN osm_node_t * const p_node,
>>>>  				     IN const osm_madw_t * const p_madw)
>>>>  {
>>>> +
>>>> +	ib_smp_t *p_smp;
>>>> +	ib_node_info_t *p_ni;
>>>> +	uint8_t port_num;
>>>> +
>>>>  	OSM_LOG_ENTER(sm->p_log);
>>>>
>>>> +	p_smp = osm_madw_get_smp_ptr(p_madw);
>>>> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
>>>> +	port_num = ib_node_info_get_local_port_num(p_ni);
>>>> +
>>>> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
>>>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>>>> +			"Creating physp for node GUID:0x%"
>>>> +			PRIx64 ", port %u\n",
>>>> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
>>>> +			port_num);
>>>> +		osm_node_init_physp(p_node, p_madw);
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	   If this switch has already been probed during this sweep,
>>>>  	   then don't bother reprobing it.
>>>> -- 
>>>> 1.5.1.4
>>>>
> 


From kliteyn at dev.mellanox.co.il  Wed Feb 18 13:31:25 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 18 Feb 2009 23:31:25 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
 for the newly discovered port of the known node
In-Reply-To: <20090218181955.GX5910@sashak.voltaire.com>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218181955.GX5910@sashak.voltaire.com>
Message-ID: <499C7E2D.8050301@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 14:41 Tue 17 Feb     , Yevgeny Kliteynik wrote:
>> Hi Sasha,
>>
>> This patch fixes bugzilla issue #1515:
>>
>> Topology:
>>                  |---------------|
>>                  |      SW2      |
>>                  |---------------|
>>                    |x |y    |z |v
>>               |----|  |     |  |----|
>>               |       |     |       |
>>               |  |----|     |----|  |
>>               |  |               |  |
>>              a| b|              c| d|
>>       |---------------|     |---------------|
>>       |       SW1     |     |     SW3       |
>>       |---------------|     |---------------|
>>           |                             |
>>           |                             |
>>        HCA with SM                      HCA
>>
>> During the discovery:
>>
>> SM sends NodeInfo request to SW1
>> SM sends NodeInfo request to SW2 through link a->x
>> SM discovers new node SW2:
>>   - updates DR to SW2 to go through link a->x
>>   - creates physp x
>> SM sends NodeInfo request to SW2 through link b->y
>> SM discovers a known node SW2
>>   - DOES NOT create physp y
>>   - updates DR to SW2 to go through link b->y
>>
>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
>> port y any more, leaving it uninitialized (no physp object for this port).
>>
>> The fix is to create physp for the newly discovered port of the known
>> switch node, same way as it is done for HCAs.
>> I also added one log message for the case that showed the problem - when
>> one of the link sides is uninitialized (no valid ports check). Perhaps
>> this log message should be an error message instead?
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/osm_node_info_rcv.c |   24 +++++++++++++++++++++++-
>>  1 files changed, 23 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
>> index c52c0d5..7da3103 100644
>> --- a/opensm/opensm/osm_node_info_rcv.c
>> +++ b/opensm/opensm/osm_node_info_rcv.c
>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
>>  	 */
>>  	if (!osm_node_link_has_valid_ports(p_node, port_num,
>>  					   p_neighbor_node,
>> -					   p_ni_context->port_num))
>> +					   p_ni_context->port_num)) {
> 
> Actually if port is initialized unconditionally on NodeInfo receiving
> this case becomes impossible. No?
> 
> If yes, we probably need to put CL_ASSERT() there instead of run-time
> check.

Good point.
I'll repost the patch when we finish discussing it.

-- Yevgeny

> Sasha
> 
>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>> +			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
>> +			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
>>  		goto _exit;
>> +	}
>>
>>  	if (osm_node_link_exists(p_node, port_num,
>>  				 p_neighbor_node, p_ni_context->port_num)) {
>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
>>  				     IN osm_node_t * const p_node,
>>  				     IN const osm_madw_t * const p_madw)
>>  {
>> +
>> +	ib_smp_t *p_smp;
>> +	ib_node_info_t *p_ni;
>> +	uint8_t port_num;
>> +
>>  	OSM_LOG_ENTER(sm->p_log);
>>
>> +	p_smp = osm_madw_get_smp_ptr(p_madw);
>> +	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
>> +	port_num = ib_node_info_get_local_port_num(p_ni);
>> +
>> +	if (!osm_node_get_physp_ptr(p_node, port_num)) {
>> +		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>> +			"Creating physp for node GUID:0x%"
>> +			PRIx64 ", port %u\n",
>> +			cl_ntoh64(osm_node_get_node_guid(p_node)),
>> +			port_num);
>> +		osm_node_init_physp(p_node, p_madw);
>> +	}
>> +
>>  	/*
>>  	   If this switch has already been probed during this sweep,
>>  	   then don't bother reprobing it.
>> -- 
>> 1.5.1.4
>>
> 


From weiny2 at llnl.gov  Wed Feb 18 14:38:32 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 18 Feb 2009 14:38:32 -0800
Subject: [ofa-general] Re: [PATCH 1/8] Clean up "new" interface
In-Reply-To: <f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
References: <20090217210642.41c64624.weiny2@llnl.gov>
	<f0e08f230902180907u6b5074at6fad8dffbcdada4@mail.gmail.com>
Message-ID: <20090218143832.c1a809ce.weiny2@llnl.gov>

I will resend this whole series.  Al also informed me that my signature/from
is messed up.

   From: weiny2 at llnl.gov <weiny2 at wopri.(none)>

It looks like my .gitconfig is not right.

Sorry,
Ira


On Wed, 18 Feb 2009 12:07:15 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >
> > From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001
> > From: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> > Date: Tue, 17 Feb 2009 17:32:15 -0800
> > Subject: [PATCH] Clean up "new" interface
> >
> >   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
> >   Create new mad_rpc_portid(struct ibmad_port *srcport) function
> >      which mirrors madrpc_portid(void)
> >
> > Signed-off-by: weiny2 at llnl.gov <weiny2 at wopri.(none)>
> > ---
> >  libibmad/include/infiniband/mad.h |   58 ++++++++++++++++++++++--------------
> >  libibmad/src/gs.c                 |   19 ++++++------
> >  libibmad/src/libibmad.map         |    1 +
> >  libibmad/src/resolve.c            |   10 ++++--
> >  libibmad/src/rpc.c                |   29 +++++++++---------
> >  libibmad/src/sa.c                 |    4 +-
> >  libibmad/src/smp.c                |    4 +-
> >  7 files changed, 71 insertions(+), 54 deletions(-)
> >
> > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> > index 1aaaa1b..56b87e6 100644
> > --- a/libibmad/include/infiniband/mad.h
> > +++ b/libibmad/include/infiniband/mad.h
> > @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt)
> >  }
> >
> >  /* rpc.c */
> > +/* Depricated interface */
> 
> typo - Deprecated
> 
> >  MAD_EXPORT int madrpc_portid(void);
> > -MAD_EXPORT int madrpc_set_retries(int retries);
> > -MAD_EXPORT int madrpc_set_timeout(int timeout);
> 
> I thought initially we weren't going to remove APIs but move over to
> the new ones ? A subsequent step would be to deprecate the old APIs
> and then eventually remove the old APIs.
> 
> -- Hal
> 
> >  void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
> >  void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
> >                  void *data);
> >  MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> >                            int num_classes);
> >  void madrpc_save_mad(void *madbuf, int len);
> > -MAD_EXPORT void madrpc_show_errors(int set);
> >
> > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> > +/* New interface */
> > +MAD_EXPORT void madrpc_show_errors(int set);
> > +MAD_EXPORT int madrpc_set_retries(int retries);
> > +MAD_EXPORT int madrpc_set_timeout(int timeout);
> > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> >                        int num_classes);
> > -void mad_rpc_close_port(void *ibmad_port);
> > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void mad_rpc_close_port(struct ibmad_port *srcport);
> > +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> >              void *payload, void *rcvdata);
> > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> >                   ib_rmpp_hdr_t * rmpp, void *data);
> > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
> >
> >  /* smp.c */
> >  MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
> >                              unsigned mod, unsigned timeout);
> >  MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
> >                            unsigned mod, unsigned timeout);
> > +
> > +/* smp.c new interface */
> >  MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> > -                      unsigned mod, unsigned timeout, const void *srcport);
> > +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
> >  uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> > -                    unsigned timeout, const void *srcport);
> > +                    unsigned timeout, const struct ibmad_port *srcport);
> >
> >  /* sa.c */
> >  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> >                 unsigned timeout);
> > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> > -                    ib_sa_call_t * sa, unsigned timeout);
> >  MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */
> > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> > +
> > +/* sa.c new interface */
> > +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
> > +                    ib_sa_call_t * sa, unsigned timeout);
> > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
> >                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
> >
> >  /* resolve.c */
> > @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> >  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
> >                               ibmad_gid_t * gid);
> >
> > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> > +/* resolve.c new interface */
> > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport);
> >  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> > -                       ib_portid_t * sm_id, int timeout, const void *srcport);
> > +                       ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport);
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >                              enum MAD_DEST dest, ib_portid_t * sm_id,
> > -                             const void *srcport);
> > +                             const struct ibmad_port *srcport);
> >  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> > -                       const void *srcport);
> > +                       const struct ibmad_port *srcport);
> >
> >  /* gs.c */
> >  MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
> > @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
> >  MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
> >                                              int port, unsigned timeout);
> >
> > +/* gs.c new interface */
> >  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned timeout,
> > -                                     const void *srcport);
> > +                                     const struct ibmad_port *srcport);
> >  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> > -                                   unsigned timeout, const void *srcport);
> > +                                   unsigned timeout, const struct ibmad_port *srcport);
> >  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                                    unsigned mask, unsigned timeout,
> > -                                   const void *srcport);
> > +                                   const struct ibmad_port *srcport);
> >  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport);
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned mask,
> > -                                       unsigned timeout, const void *srcport);
> > +                                       unsigned timeout,
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport);
> > +                                       const struct ibmad_port *srcport);
> >  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                       int port, unsigned timeout,
> > -                                      const void *srcport);
> > +                                      const struct ibmad_port *srcport);
> >  /* dump.c */
> >  MAD_EXPORT ib_mad_dump_fn
> >     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
> > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
> > index d2c4574..e302caf 100644
> > --- a/libibmad/src/gs.c
> > +++ b/libibmad/src/gs.c
> > @@ -47,7 +47,7 @@
> >
> >  static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                              unsigned timeout, unsigned id,
> > -                             const void *srcport)
> > +                             const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >        int lid = dest->lid;
> > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
> >
> >  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned timeout,
> > -                                     const void *srcport)
> > +                                     const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
> >                             srcport);
> > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
> >  }
> >
> >  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> > -                                   unsigned timeout, const void *srcport)
> > +                                   unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_COUNTERS, srcport);
> > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                      int port, unsigned mask, unsigned timeout,
> > -                                     unsigned id, const void *srcport)
> > +                                     unsigned id, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >        int lid = dest->lid;
> > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> >                                    unsigned mask, unsigned timeout,
> > -                                   const void *srcport)
> > +                                   const struct ibmad_port *srcport)
> >  {
> >        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
> >                                     IB_GSI_PORT_COUNTERS, srcport);
> > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport)
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_COUNTERS_EXT, srcport);
> > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned mask,
> > -                                       unsigned timeout, const void *srcport)
> > +                                       unsigned timeout,
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
> >                                     IB_GSI_PORT_COUNTERS_EXT, srcport);
> > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                        int port, unsigned timeout,
> > -                                       const void *srcport)
> > +                                       const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_SAMPLES_CONTROL, srcport);
> > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
> >
> >  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> >                                       int port, unsigned timeout,
> > -                                      const void *srcport)
> > +                                      const struct ibmad_port *srcport)
> >  {
> >        return pma_query_via(rcvbuf, dest, port, timeout,
> >                             IB_GSI_PORT_SAMPLES_RESULT, srcport);
> > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> > index f944d86..94d7762 100644
> > --- a/libibmad/src/libibmad.map
> > +++ b/libibmad/src/libibmad.map
> > @@ -69,6 +69,7 @@ IBMAD_1.3 {
> >                mad_rpc_close_port;
> >                mad_rpc;
> >                mad_rpc_rmpp;
> > +               mad_rpc_portid;
> >                madrpc;
> >                madrpc_def_timeout;
> >                madrpc_init;
> > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> > index 553949d..3291f43 100644
> > --- a/libibmad/src/resolve.c
> > +++ b/libibmad/src/resolve.c
> > @@ -45,7 +45,8 @@
> >  #undef DEBUG
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
> > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t self = { 0 };
> >        uint8_t portinfo[64];
> > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
> >  }
> >
> >  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> > -                       ib_portid_t * sm_id, int timeout, const void *srcport)
> > +                       ib_portid_t * sm_id, int timeout,
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t sm_portid;
> >        char buf[IB_SA_DATA_SIZE] = { 0 };
> > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> >
> >  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> >                              enum MAD_DEST dest_type, ib_portid_t * sm_id,
> > -                             const void *srcport)
> > +                             const struct ibmad_port *srcport)
> >  {
> >        uint64_t guid;
> >        int lid;
> > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> >  }
> >
> >  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> > -                       const void *srcport)
> > +                       const struct ibmad_port *srcport)
> >  {
> >        ib_portid_t self = { 0 };
> >        uint8_t portinfo[64];
> > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> > index e811526..d47873b 100644
> > --- a/libibmad/src/rpc.c
> > +++ b/libibmad/src/rpc.c
> > @@ -100,6 +100,11 @@ int madrpc_portid(void)
> >        return mad_portid;
> >  }
> >
> > +int mad_rpc_portid(struct ibmad_port *srcport)
> > +{
> > +       return (srcport->port_id);
> > +}
> > +
> >  static int
> >  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
> >           int timeout)
> > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
> >        return -1;
> >  }
> >
> > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
> >              void *payload, void *rcvdata)
> >  {
> > -       const struct ibmad_port *p = port_id;
> >        int status, len;
> >        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
> >
> > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
> >                return 0;
> >
> > -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> > -                             p->class_agents[rpc->mgtclass],
> > +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> > +                             port->class_agents[rpc->mgtclass],
> >                              len, rpc->timeout)) < 0) {
> >                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
> >                return 0;
> > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        return rcvdata;
> >  }
> >
> > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
> >                   ib_rmpp_hdr_t * rmpp, void *data)
> >  {
> > -       const struct ibmad_port *p = port_id;
> >        int status, len;
> >        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
> >
> > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> >        if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
> >                return 0;
> >
> > -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> > -                             p->class_agents[rpc->mgtclass],
> > +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> > +                             port->class_agents[rpc->mgtclass],
> >                              len, rpc->timeout)) < 0) {
> >                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
> >                return 0;
> > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
> >        }
> >  }
> >
> > -void *mad_rpc_open_port(char *dev_name, int dev_port,
> > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
> >                        int *mgmt_classes, int num_classes)
> >  {
> >        struct ibmad_port *p;
> > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
> >        return p;
> >  }
> >
> > -void mad_rpc_close_port(void *port_id)
> > +void mad_rpc_close_port(struct ibmad_port *port)
> >  {
> > -       struct ibmad_port *p = port_id;
> > -
> > -       umad_close_port(p->port_id);
> > -       free(p);
> > +       umad_close_port(port->port_id);
> > +       free(port);
> >  }
> >
> >  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> > index 7403d4f..ddeb152 100644
> > --- a/libibmad/src/sa.c
> > +++ b/libibmad/src/sa.c
> > @@ -44,7 +44,7 @@
> >  #undef DEBUG
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> >                     ib_sa_call_t * sa, unsigned timeout)
> >  {
> >        ib_rpc_t rpc = { 0 };
> > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> >                        IB_PR_COMPMASK_SGID |\
> >                        IB_PR_COMPMASK_NUMBPATH)
> >
> > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
> >                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
> >  {
> >        int npath;
> > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
> > index fad263c..e5489b3 100644
> > --- a/libibmad/src/smp.c
> > +++ b/libibmad/src/smp.c
> > @@ -45,7 +45,7 @@
> >  #define DEBUG  if (ibdebug)    IBWARN
> >
> >  uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
> > -                    unsigned mod, unsigned timeout, const void *srcport)
> > +                    unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >
> > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
> >  }
> >
> >  uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
> > -                      unsigned mod, unsigned timeout, const void *srcport)
> > +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
> >  {
> >        ib_rpc_t rpc = { 0 };
> >
> > --
> > 1.5.4.5
> >
> >


-- 
Ira Weiny
Math Programer/Computer Scientist
Larence Livermore National Lab
weiny2 at llnl.gov


From rdreier at cisco.com  Wed Feb 18 16:40:37 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Feb 2009 16:40:37 -0800
Subject: [ofa-general] ib_reg_phys_mr( ) results in crash
In-Reply-To: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	(neutron's message of "Tue, 17 Feb 2009 09:50:21 -0500")
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
Message-ID: <adavdr7z2be.fsf@cisco.com>

 > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
 > are valid.  But the system always crashes immediately after entering
 > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!

What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
you get an oops message?  If so that would be very important info for
debugging this.

- R.


From sean.hefty at intel.com  Wed Feb 18 17:43:28 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:43:28 -0800
Subject: [ofa-general] [PATCH 0/6] [ib-diag] add support to more diags for
	WinOF
Message-ID: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>

This series adds support to all remaining IB diagnostics utilities,
except saquery.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Wed Feb 18 17:46:05 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:46:05 -0800
Subject: [ofa-general] [PATCH 1/6] [ib-diag] ibnetdiscover: add support for
	WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com>

Mainly fixing datatypes to avoid type mismatches.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Also attaching patch in case my mailer wraps the lines.

 infiniband-diags/src/grouping.c      |   28 ++++++++++++++--------------
 infiniband-diags/src/ibnetdiscover.c |    8 ++++----
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c
index 0ea139f..0266af4 100644
--- a/infiniband-diags/src/grouping.c
+++ b/infiniband-diags/src/grouping.c
@@ -265,20 +265,20 @@ int is_chassis_switch(Node *node)
 }
 
 /* these structs help find Line (Anafa) slot number while using spine portnum */
-int line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
-int anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
-int line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
-int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
+char line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
+char anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
+char line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
+char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
 
 /* IPR FCR modules connectivity while using sFB4 port as reference */
-int ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
+char ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
 
 /* these structs help find Spine (Anafa) slot number while using spine portnum */
-int spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-/*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 };
*/
+char spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+/* reference                       { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */
 
 static void get_sfb_slot(Node *node, Port *lineport)
 {
@@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport)
 static void get_router_slot(Node *node, Port *spineport)
 {
 	ChassisRecord *ch = node->chrecord;
-	int guessnum = 0;
+	uint64_t guessnum = 0;
 
 	if (!ch) {
 		if (!(node->chrecord = calloc(1, sizeof(ChassisRecord))))
@@ -460,7 +460,7 @@ static void insert_line_router(Node *node, ChassisList *chassislist)
 		return;		/* already filled slot */
 
 	chassislist->linenode[i] = node;
-	node->chrecord->chassisnum = chassislist->chassisnum;
+	node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum;
 }
 
 static void insert_spine(Node *node, ChassisList *chassislist)
@@ -471,7 +471,7 @@ static void insert_spine(Node *node, ChassisList *chassislist)
 		return;		/* already filled slot */
 
 	chassislist->spinenode[i] = node;
-	node->chrecord->chassisnum = chassislist->chassisnum;
+	node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum;
 }
 
 static void pass_on_lines_catch_spines(ChassisList *chassislist)
@@ -770,7 +770,7 @@ ChassisList *group_nodes()
 					if (!node->chrecord) {
 						if (!(node->chrecord = calloc(1, sizeof(ChassisRecord))))
 							IBPANIC("out of mem");
-						node->chrecord->chassisnum = chassis->chassisnum;
+						node->chrecord->chassisnum = (unsigned char) chassis->chassisnum;
 					}
 				}
 			}
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 466d522..27afd6a 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -47,7 +47,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibnetdiscover.h"
 #include "grouping.h"
@@ -212,7 +212,7 @@ extend_dpath(ib_dr_path_t *path, int nextport)
 	++path->cnt;
 	if (path->cnt > maxhops_discovered)
 		maxhops_discovered = path->cnt;
-	path->p[path->cnt] = nextport;
+	path->p[path->cnt] = (uint8_t) nextport;
 	return path->cnt;
 }
 
@@ -517,7 +517,7 @@ out_chassis(int chassisnum)
 	uint64_t guid;
 
 	fprintf(f, "\nChassis %d", chassisnum);
-	guid = get_chassis_guid(chassisnum);
+	guid = get_chassis_guid((unsigned char) chassisnum);
 	if (guid)
 		fprintf(f, " (guid 0x%" PRIx64 ")", guid);
 	fprintf(f, "\n");
@@ -964,7 +964,7 @@ int main(int argc, char **argv)
 		{ "Router_list", 'R', 0, NULL, "list of connected routers" },
 		{ "node-name-map", 1, 1, "<file>", "node name map file" },
 		{ "ports", 'p', 0, NULL, "obtain a ports report" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "[topology-file]";
 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 01-win-ibnet
Type: application/octet-stream
Size: 5992 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090218/72896822/attachment.obj>

From sean.hefty at intel.com  Wed Feb 18 17:46:38 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:46:38 -0800
Subject: [ofa-general] [PATCH 2/6] [ib-diag] ibroute: add support for WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibroute.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 144d1b2..d1049ad 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -45,7 +45,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -327,7 +327,7 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
 
 		for (;i < e; i++) {
 			unsigned outport = lft[i % IB_SMP_DATA_SIZE];
-			unsigned valid = (outport <= nports);
+			unsigned valid = (outport <= (unsigned) nports);
 
 			if (!valid && !dump_all)
 				continue;
@@ -370,7 +370,7 @@ int main(int argc, char **argv)
 		{ "all", 'a', 0, NULL, "show all lids, even invalid entries" },
 		{ "no_dests", 'n', 0, NULL, "do not try to resolve destinations" },
 		{ "Multicast", 'M', 0, NULL, "show multicast forwarding tables" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "[<dest dr_path|lid|guid> [<startlid> [<endlid>]]]";
 	const char *usage_examples[] = {


From sean.hefty at intel.com  Wed Feb 18 17:47:10 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:47:10 -0800
Subject: [ofa-general] [PATCH 3/6] [ib-diag] ibtracert: add support for WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <05EDF7233B20414B821BCFF5B9938F44@amr.corp.intel.com>

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibtracert.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c
index ea5662b..db3b906 100644
--- a/infiniband-diags/src/ibtracert.c
+++ b/infiniband-diags/src/ibtracert.c
@@ -46,7 +46,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -180,7 +180,7 @@ extend_dpath(ib_dr_path_t *path, int nextport)
 	if (path->cnt+2 >= sizeof(path->p))
 		return -1;
 	++path->cnt;
-	path->p[path->cnt] = nextport;
+	path->p[path->cnt] = (uint8_t) nextport;
 	return path->cnt;
 }
 
@@ -718,7 +718,7 @@ int main(int argc, char **argv)
 		{ "no_info", 'n', 0, NULL, "simple format" },
 		{ "mlid", 'm', 1, "<mlid>", "multicast trace of the mlid" },
 		{ "node-name-map", 1, 1, "<file>", "node name map file" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<src-addr> <dest-addr>";
 	const char *usage_examples[] = {


From sean.hefty at intel.com  Wed Feb 18 17:48:44 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:48:44 -0800
Subject: [ofa-general] [PATCH 4/6] [ib-diag] ibsysstat: add support for WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <A5117BD0CF114C6C8EAB505CDAE7030D@amr.corp.intel.com>

Use char* pointers to obtain offsets, in place of void*.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibsysstat.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
index cc1418d..b9f2f85 100644
--- a/infiniband-diags/src/ibsysstat.c
+++ b/infiniband-diags/src/ibsysstat.c
@@ -183,7 +183,7 @@ static char *ibsystat_serv(void)
 
 		DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod);
 
-		size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS,
+		size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS,
 				sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS);
 
 		if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0)
@@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr)
 {
 	ib_rpc_t rpc = { 0 };
 	int fd, agent, timeout, len;
-	void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;
+	void *data = (char *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;
 
 	DEBUG("Sysstat ping..");
 
@@ -318,7 +318,7 @@ int main(int argc, char **argv)
 	const struct ibdiag_opt opts[] = {
 		{ "oui", 'o', 1, NULL, "use specified OUI number" },
 		{ "Server", 'S', 0, NULL, "start in server mode" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<dest lid|guid> [<op>]";
 

From sean.hefty at intel.com  Wed Feb 18 17:49:09 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:49:09 -0800
Subject: [ofa-general] [PATCH 5/6] [ib-diag] ibsendtrap: add support for
	WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com>

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibsendtrap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ba6aa8b..ba6f86a 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -43,7 +43,7 @@
 #include <getopt.h>
 
 #include <infiniband/mad.h>
-#include <infiniband/iba/ib_types.h>
+#include <iba/ib_types.h>
 
 #include "ibdiag_common.h"
 

From sean.hefty at intel.com  Wed Feb 18 17:50:15 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Feb 2009 17:50:15 -0800
Subject: [ofa-general] [PATCH 6/6] [ib-diag] mcm_rereg_test: add support for
	WinOF
In-Reply-To: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
Message-ID: <46D52E76EEAC43519FC7D536301C69CE@amr.corp.intel.com>

Fix some typecasts and variable argument function macro definitions

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/mcm_rereg_test.c |   24 +++++++++++++++---------
 1 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/infiniband-diags/src/mcm_rereg_test.c b/infiniband-diags/src/mcm_rereg_test.c
index 9285b95..5252459 100644
--- a/infiniband-diags/src/mcm_rereg_test.c
+++ b/infiniband-diags/src/mcm_rereg_test.c
@@ -31,6 +31,10 @@
  *
  */
 
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
 #include <stdio.h>
 #include <string.h>
 #include <errno.h>
@@ -39,12 +43,12 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
-#define info(fmt, arg...) fprintf(stderr, "INFO: " fmt, ##arg )
-#define err(fmt, arg...) fprintf(stderr, "ERR: " fmt, ##arg )
+#define info(fmt, ...) fprintf(stderr, "INFO: " fmt, ## __VA_ARGS__ )
+#define err(fmt, ...) fprintf(stderr, "ERR: " fmt, ## __VA_ARGS__ )
 #ifdef NOISY_DEBUG
-#define dbg(fmt, arg...) fprintf(stderr, "DBG: " fmt, ##arg )
+#define dbg(fmt, ...) fprintf(stderr, "DBG: " fmt, ## __VA_ARGS__ )
 #else
-#define dbg(fmt, arg...)
+#define dbg(fmt, ...)
 #endif
 
 #define TMO 100
@@ -161,7 +165,8 @@ static int rereg_send_all(int port, int agent, ib_portid_t *dport,
 {
 	uint8_t *umad;
 	int len = umad_size() + 256;
-	int i, ret;
+	unsigned i;
+	int ret;
 
 	info("rereg_send_all... cnt = %u\n", cnt);
 
@@ -247,7 +252,7 @@ static int rereg_recv_all(int port, int agent, ib_portid_t *dport,
 	int len = umad_size() + 256;
 	uint64_t trid;
 	unsigned n, method, status;
-	int i;
+	unsigned i;
 
 	info("rereg_recv_all...\n");
 
@@ -301,7 +306,8 @@ static int rereg_query_all(int port, int agent, ib_portid_t *dport,
 	uint8_t *umad, *mad;
 	int len = umad_size() + 256;
 	unsigned method, status;
-	int i, ret;
+	unsigned i;
+	int ret;
 
 	info("rereg_query_all...\n");
 
@@ -384,8 +390,8 @@ static int rereg_and_test_port(char *guid_file, int port, int agent, ib_portid_t
 	char line[256];
 	FILE *f;
 	ibmad_gid_t port_gid;
-	uint64_t prefix = htonll(0xfe80000000000000llu);
-	uint64_t guid = htonll(0x0002c90200223825llu);
+	uint64_t prefix = htonll(0xfe80000000000000ull);
+	uint64_t guid = htonll(0x0002c90200223825ull);
 	struct guid_trid *list;
 	int i = 0;
 

From Jie.Cai at cs.anu.edu.au  Wed Feb 18 18:07:46 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Thu, 19 Feb 2009 13:07:46 +1100
Subject: [ofa-general] RDMA write with immediate data.
Message-ID: <499CBEF2.2010909@cs.anu.edu.au>

I am currently facing a problem that I let an initiator to RDMA write data
to the remote side with immediate data.

if (initiator) {
     ret = dat_ib_post_rdma_write_immed(   h_ep,        // 
ep_handle                                                                
                                                         
1,                  // 
num_segments                                                               
                                                         
&l_iov,             // 
LMR                                                               
                                                         
cookie,             // 
user_cookie                                                                
                                                         
&r_iov,             // 
RMR                                                               
                                                         immed_data,
                                                         
DAT_COMPLETION_DEFAULT_FLAG);

     ret = dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event, &nmore);
} else {

    ret = dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore);
}

However, at remote side I got the following error message indicates that 
no event coming through.

5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED
5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED

The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED.

Would anyone helped with this.

-- 
Mr. Jie Cai
Department of Computer Science
Faculty of Engineering and Information Technology
College of Engineering & Computer Science
CSIT Building (108), North Road
The Australian National University
Canberra ACT 0200 Australia
Email: Jie.Cai at cs.anu.edu.au
Tel: +61-2-61251451
Fax: +61-2-61250010
Web: http://cs.anu.edu.au/~Jie.Cai
Mobile: 0433992958 


From arlin.r.davis at intel.com  Thu Feb 19 00:19:29 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 19 Feb 2009 00:19:29 -0800
Subject: [ofa-general] RDMA write with immediate data.
In-Reply-To: <499CBEF2.2010909@cs.anu.edu.au>
References: <499CBEF2.2010909@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A7C6173@orsmsx506.amr.corp.intel.com>

 
>
>if (initiator) {
>     ret = dat_ib_post_rdma_write_immed(   h_ep,        // 
>
>However, at remote side I got the following error message 
>indicates that 
>no event coming through.
>
>5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED
>5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED
>
>The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED.
>

Does the initiator side complete successfully?
Do you have receive's posted at the remote side for immed data?

You can look at dtestx source for an immed data example.

-arlin


From vlad at lists.openfabrics.org  Thu Feb 19 03:21:00 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 19 Feb 2009 03:21:00 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090219-0200 daily build status
Message-ID: <20090219112101.10361E28155@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at dev.mellanox.co.il  Thu Feb 19 03:32:21 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 19 Feb 2009 13:32:21 +0200
Subject: [ofa-general] ib_reg_phys_mr( ) results in crash
In-Reply-To: <adavdr7z2be.fsf@cisco.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
Message-ID: <499D4345.1010007@mellanox.co.il>

Roland Dreier wrote:
>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>  > are valid.  But the system always crashes immediately after entering
>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>
> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
> you get an oops message?  If so that would be very important info for
> debugging this.
>
>   
Also HCA used and other system info can help us

Tziporet


From Zhen.Liang at Sun.COM  Thu Feb 19 03:39:13 2009
From: Zhen.Liang at Sun.COM (Liang Zhen)
Date: Thu, 19 Feb 2009 19:39:13 +0800
Subject: [ofa-general] ib_reg_phys_mr( ) results in crash
In-Reply-To: <adavdr7z2be.fsf@cisco.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
Message-ID: <499D44E1.3010809@sun.com>

Roland Dreier :
>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>  > are valid.  But the system always crashes immediately after entering
>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>
> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
> you get an oops message?  If so that would be very important info for
> debugging this.
>   

Also, what kind of address did you pass into ib_reg_phys_mr? a little 
context of your calling is helpful

Regards
Liang


From Line.Holen at Sun.COM  Thu Feb 19 03:42:00 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Thu, 19 Feb 2009 12:42:00 +0100
Subject: [ofa-general] opensm logoutput
In-Reply-To: <F9BD5A2A5CEEEE4FB738EC67475D7BEF0242DB2B@sfrexbe01.acds.t-systems-sfr.com>
References: <F9BD5A2A5CEEEE4FB738EC67475D7BEF0242DB2B@sfrexbe01.acds.t-systems-sfr.com>
Message-ID: <499D4588.1030702@Sun.COM>

Hi Bert,

most of these messages indicates that you do have unstable links in your 
system.
But there is one message that can indicate that you've hit a newly 
discovered SM bug:

__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
po

If you do have NEM switches in your system, then you are exposed to this 
bug.
I hit it quite easily.

Yevgeny Kliteynik posted a patch for this bug just a few minutes after 
you sent
your email. (If you are interested look for the email thread "create 
physp for the
newly discovered port of the known node").

Line

On 02/17/09 01:23 PM, Wiegers, Bert wrote:
> Hi,
>
> we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
> SUN.
> As we are debugging our System I'm trying to understand the
> opensm.log's.
> (Where can I find any documentation to that?)
>
>
> We see frequent messages as follows:
>
> Feb 17 10:25:34 134964 [41802940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
> (Link state change) Producer:2 (Switch) from LID:111
> TID:0x000000000000006e
> Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 (Link state change) from LID:111
> GID:fe80::14:4fa4:cff8:50
> Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:65 (GID out of service) from LID:336
> GID:fe80::3:ba00:100:3341
> Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
> Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
> node:MT25408 ConnectX Mellanox Technologies
> Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
> tables configured on all switches
> Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
> Feb 17 10:25:46 662611 [41802940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
> (Link state change) Producer:2 (Switch) from LID:111
> TID:0x000000000000006f
> Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 (Link state change) from LID:111
> GID:fe80::14:4fa4:cff8:50
> Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
> tables configured on all switches
> Feb 17 10:25:52 476653 [44007940] 0x01 ->
> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
> Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x81
> (SubnGetResp)
>                                 D bit...................0x1
>                                 status..................0x1C00
>                                 hop_ptr.................0x0
>                                 hop_count...............0x4
>                                 trans_id................0x18c08de
>                                 attr_id.................0x15 (PortInfo)
>                                 resv....................0x0
>                                 attr_mod................0x6
>  
> m_key...................0x0000000000000000
>                                 dr_slid.................65535
>                                 dr_dlid.................65535
>
>                                 Initial path: 0,1,10,15,23
>                                 Return path:  0,23,20,12,17
>                                 Reserved:     [0][0][0][0][0][0][0]
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 11
> 03 03 02
>
>                                 34 52 00 23 40 40 00 08   08 04 F0 4C 00
> 00 00 00
>
>                                 00 00 00 00 00 88 00 00   00 00 00 00 00
> 00 00 00
>
>
>
>
> Other issues I see with messages similar to the following ones:
>
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
> po
>
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
>
> osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
> (Invalid argument)
>
>
> I'm still googleing, but hopefully someone can give me some answers.
>
>
>
> Thanks and best regards
> Bert
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From subbukl at gmail.com  Thu Feb 19 04:55:55 2009
From: subbukl at gmail.com (subbu kl)
Date: Thu, 19 Feb 2009 18:25:55 +0530
Subject: [ofa-general] ***SPAM*** INT-X fallback in mthca driver
Message-ID: <f3b32c250902190455o784624dfw4718d3e30058a014@mail.gmail.com>

I am trying PCI passthrogh of Mellanox Infinihost III Lx chip based
Infiniband and TG3 ethernet PCIe cards on Centos 5.2 Full virtualized guest
with Xen 3.3.0

ib_mthca driver fails with
QUERY_FW failed
probe failed with errror -11

But interestingly tg3 driver says "Could not get MSI interrupts falling back
to INTx" and works fine

So,
1) why Xen could not get the MSI interrupts working ?
2) Should we have INT-x falling back method for Mellanox driver also if its
needed ?

~subbu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090219/e4b6054f/attachment.html>

From kliteyn at dev.mellanox.co.il  Thu Feb 19 05:28:06 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 19 Feb 2009 15:28:06 +0200
Subject: [ofa-general] opensm logoutput
In-Reply-To: <499D4588.1030702@Sun.COM>
References: <F9BD5A2A5CEEEE4FB738EC67475D7BEF0242DB2B@sfrexbe01.acds.t-systems-sfr.com>
	<499D4588.1030702@Sun.COM>
Message-ID: <499D5E66.3010600@dev.mellanox.co.il>

Bert,

Line.Holen at Sun.COM wrote:
> Hi Bert,
> 
> most of these messages indicates that you do have unstable links in your 
> system.
> But there is one message that can indicate that you've hit a newly 
> discovered SM bug:
> 
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)

This message is probably also related to the unstable links (or nodes).
Some port didn't answer a query from the SM (see below), so SM warns
that there is a port that is physically not down, but the other side
of the link couldn't be probed.

> If you do have NEM switches in your system, then you are exposed to this 
> bug.
> I hit it quite easily.
> 
> Yevgeny Kliteynik posted a patch for this bug just a few minutes after 
> you sent
> your email. (If you are interested look for the email thread "create 
> physp for the
> newly discovered port of the known node").

Of course, using the patch wouldn't hurt :)

> Line
> 
> On 02/17/09 01:23 PM, Wiegers, Bert wrote:
>> Hi,
>>
>> we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
>> SUN.
>> As we are debugging our System I'm trying to understand the
>> opensm.log's.
>> (Where can I find any documentation to that?)
>>
>>
>> We see frequent messages as follows:
>>
>> Feb 17 10:25:34 134964 [41802940] 0x01 ->
>> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
>> (Link state change) Producer:2 (Switch) from LID:111
>> TID:0x000000000000006e
>> Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:1 num:128 (Link state change) from LID:111
>> GID:fe80::14:4fa4:cff8:50

Generic notice num. 128 (trap 128) is issued by switch (LID 111) because
it detected port state change on one of its ports, could be because of
unstable link, could be something else. SM logs that it got this trap from
the switch.


>> Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:65 (GID out of service) from LID:336
>> GID:fe80::3:ba00:100:3341

SM can't find some port any more, so it informs the fabric that
this GID is "out of service" by sending notice num. 65.

>> Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
>> Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
>> node:MT25408 ConnectX Mellanox Technologies

LID 1047 is no longer reachable and removed from the SM's DB.

>> Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
>> tables configured on all switches
>> Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
>> Feb 17 10:25:46 662611 [41802940] 0x01 ->
>> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
>> (Link state change) Producer:2 (Switch) from LID:111
>> TID:0x000000000000006f
>> Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:1 num:128 (Link state change) from LID:111
>> GID:fe80::14:4fa4:cff8:50
>> Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
>> tables configured on all switches
>> Feb 17 10:25:52 476653 [44007940] 0x01 ->
>> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
>> Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
>>                                 base_ver................0x1
>>                                 mgmt_class..............0x81
>>                                 class_ver...............0x1
>>                                 method..................0x81
>> (SubnGetResp)
>>                                 D bit...................0x1
>>                                 status..................0x1C00
>>                                 hop_ptr.................0x0
>>                                 hop_count...............0x4
>>                                 trans_id................0x18c08de
>>                                 attr_id.................0x15 (PortInfo)
>>                                 resv....................0x0
>>                                 attr_mod................0x6
>>  
>> m_key...................0x0000000000000000
>>                                 dr_slid.................65535
>>                                 dr_dlid.................65535
>>
>>                                 Initial path: 0,1,10,15,23
>>                                 Return path:  0,23,20,12,17
>>                                 Reserved:     [0][0][0][0][0][0][0]
>>
>>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
>> 00 00 00
>>
>>                                 00 00 00 00 00 00 00 00   00 00 00 00 11
>> 03 03 02
>>
>>                                 34 52 00 23 40 40 00 08   08 04 F0 4C 00
>> 00 00 00
>>
>>                                 00 00 00 00 00 88 00 00   00 00 00 00 00
>> 00 00 00
>>
>>
>>
>>
>> Other issues I see with messages similar to the following ones:
>>
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
>> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
>> po
>>
>> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
>> (IB_TIMEOUT)

The above two messages are related. The IB_TIMEOUT says that some MAD
was sent, but no response was received. This, in turn, would cause the
"unknown remote side" message.

Bottom line - there might be unstable ports/links in the fabric.
Check all the links that reported by the SM as having an unknown
remote side.

-- Yevgeny

>> osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
>> (Invalid argument)
>>
>> I'm still googleing, but hopefully someone can give me some answers.
>>
>>
>>
>> Thanks and best regards
>> Bert
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>   
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From hnrose at comcast.net  Thu Feb 19 05:06:53 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 19 Feb 2009 08:06:53 -0500
Subject: [ofa-general] [PATCH] opensm/console: Enhance perfmgr print_counters
	for better nodenames
Message-ID: <20090219130653.GA29318@comcast.net>


nodenames can have spaces in them
Also, no need for next_token being inlined

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 00e2a94..9cad594 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -73,11 +73,16 @@ on: 0, delay_s: 2, loop_function:NULL};
 
 static const struct command console_cmds[];
 
-static inline char *next_token(char **p_last)
+static char *next_token(char **p_last)
 {
 	return strtok_r(NULL, " \t\n\r", p_last);
 }
 
+static char *name_token(char **p_last)
+{
+	return strtok_r(NULL, "\t\n\r", p_last);
+}
+
 static void help_command(FILE * out, int detail)
 {
 	int i;
@@ -1152,7 +1157,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 							  PERFMGR_EVENT_DB_DUMP_HR);
 			}
 		} else if (strcmp(p_cmd, "print_counters") == 0) {
-			p_cmd = next_token(p_last);
+			p_cmd = name_token(p_last);
 			if (p_cmd) {
 				osm_perfmgr_print_counters(&p_osm->perfmgr,
 							   p_cmd, out);
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 3babe3a..8766f93 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -1304,9 +1304,9 @@ void
 osm_perfmgr_print_counters(osm_perfmgr_t *pm, char *nodename, FILE *fp)
 {
 	uint64_t guid = strtoull(nodename, NULL, 0);
-	if (guid == 0 && errno == EINVAL)
+	if (guid == 0 && errno)	// name
 		perfmgr_db_print_by_name(pm->db, nodename, fp);
-	else
+	else		// guid
 		perfmgr_db_print_by_guid(pm->db, guid, fp);
 }
 

From ogerlitz at Voltaire.com  Thu Feb 19 06:52:41 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Thu, 19 Feb 2009 16:52:41 +0200
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499C4FD1.7040200@inqbus.de>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
	<499C4FD1.7040200@inqbus.de>
Message-ID: <499D7239.5060502@Voltaire.com>

Dr. Volker Jaenisch wrote:
> Or Gerlitz schrieb:

>>> Hello Ofa-List!  Compiling the ofa-kernel modules from OFED-1.4 on
>>> Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace:

>> Second, what makes you want to replace the IB stack that comes with
>> Debian and not update the distro?

> I never said nothing about replacing. But before I can bring in some improvement 
> to the Debian IB stack firstly I like to have a running IB Stack on Debian at all.

Hi,

The Linux kernel Infiniband maintainer Roland Dreier made the following comment
@ http://lists.openfabrics.org/pipermail/general/2008-July/052824.html

> I use Debian for pretty much all my development.  However I haven't
> tried to use OFED -- rather, I have just gotten all the support that I
> use into the main Debian archive.  I'm not sure how much is in Etch but
> Lenny should be pretty good: there are libibverbs, librdmacm, libmthca,
> libmlx4, libcxgb3, and libipathverbs packages in the main archive, along
> with Open MPI 1.2.6 built with IB support.  And the 2.6.25 kernel in the
> archive should have all the kernel drivers you need.

So IB comes with Debian out of the box, and if its broken, please
report it and I'm sure Roland will act to fix things.

> There are several parts of the OFED (for instance opensm and other user
> space tools) that are not avaible in debian, yet. The idea is to bring a more 
> consistent Infiniband support to Debian. But this is not my project, so I do 
> not like to discuss over the head of
> someone other. Here the wishlist entry for OFED Debian support issued by Guy Coates.
> http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b42e830ce29c641a

The OFED packages by no means provide "more consistent Infiniband support to Debian". If some packages are missing, I would recommend to act and add them in the native Debian forms  as Roland suggested in his postings over the other thread and not to go and "port ofed to debian" - its useless and end-in-mind will get you nothing. 

If you need help with the management libraries (libibmad, libibumad, opensm, diags) push info Debian, there are people on this list who might be able to help you with that.

Or.


From gmpc at sanger.ac.uk  Thu Feb 19 07:31:58 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 19 Feb 2009 15:31:58 +0000
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499D7239.5060502@Voltaire.com>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
	<499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com>
Message-ID: <499D7B6E.3050206@sanger.ac.uk>


> The OFED packages by no means provide "more consistent Infiniband support to Debian". 

>If some packages are missing, I would recommend to act and add them in the native Debian forms  
>as Roland suggested in his postings over the other thread and not to go and "port ofed to debian" 
>- its useless and end-in-mind will get you nothing. 

> If you need help with the management libraries (libibmad, libibumad, opensm, diags) push info Debian, 
> there are people on this list who might be able to help you with that.

Hi all,

A bit of historical background;

I started packaging the missing bits of OFED 1.3 for debian etch for my own
private use, as I needed some bits that were not present. (openSM, srp-tools,
and a set of OFED 1.3 kernel modules+headers that I could build lustre against,
which was the ultimate aim of the exercise).

Now that OFED 1.4 + lenny has been released, my aim is to build on that work and
push the remaining unpackaged bits of OFED 1.4  bits upstream.


I am not repackaging any of the bits that Roland has already done. (in fact, my
packages depend on them).

(for the record, the unpackaged bits of OFED which I now have packages for are
below)

dapl_2.0.15-1
ibutils_1.2-1
infiniband-diags_1.4.4-1
libibcm_1.0.4
libibcommon_1.1.2
libibmad_1.2.3-1
libibumad_1.2.3-1
libnes_0.5
libsdp_1.1.99
mstflint_1.4
ofa-kernel_1.4
ofed_1.4
ofed-docs
opensm_3.2.5
perftest_1.2
qlvnictools_0.0.1
rds-tools-1.4
sdpnetstat_1.60
srptools_0.0.4
tvflash_0.9.0


Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From brian at sun.com  Thu Feb 19 07:41:00 2009
From: brian at sun.com (Brian J. Murrell)
Date: Thu, 19 Feb 2009 10:41:00 -0500
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499D7B6E.3050206@sanger.ac.uk>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
	<499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com>
	<499D7B6E.3050206@sanger.ac.uk>
Message-ID: <1235058060.28114.307.camel@pc.interlinx.bc.ca>

On Thu, 2009-02-19 at 15:31 +0000, Guy Coates wrote:
>  

Hi Guy,

> I started packaging the missing bits of OFED 1.3 for debian etch for my own
> private use, as I needed some bits that were not present. (openSM, srp-tools,
> and a set of OFED 1.3 kernel modules+headers that I could build lustre against,

/me waves.

> Now that OFED 1.4 + lenny has been released, my aim is to build on that work and
> push the remaining unpackaged bits of OFED 1.4  bits upstream.

Out of interest, what kernel and OFED backports is that using?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090219/73abd143/attachment.sig>

From eli at dev.mellanox.co.il  Thu Feb 19 08:55:05 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 19 Feb 2009 18:55:05 +0200
Subject: [ofa-general] iscsi initiator ipoib+lro crash on upstream kernel
Message-ID: <20090219165505.GA13617@mtls03>

Hi,

I have encountered a kernel crash when running a iSCSI initiator on
IPoIB configured with LRO (if LRO is off it does not happen). This
was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2
too. Bellow is a dump of the crash info from 2.6.28.2:

sd 2:0:0:1: Attached scsi generic sg3 type 0
BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
IP: [<ffffffff803c50a4>] skb_seq_read+0xfb/0x1a1
PGD 227115067 PUD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/platform/host2/session2/target2:0:0/2:0:0:1/type
CPU 2 
Modules linked in: ib_uverbs ib_umad mlx4_ib nfs lockd nfs_acl mlx4_core sunrpc ib_mthca ib_ipoib ib_cm ib_sa ib_mad ib_core inet_lro ipv6 button battery a]
Pid: 0, comm: swapper Not tainted 2.6.28.2-debug #3
RIP: 0010:[<ffffffff803c50a4>]  [<ffffffff803c50a4>] skb_seq_read+0xfb/0x1a1
RSP: 0018:ffff88022f0e3b00  EFLAGS: 00010246
RAX: ffff88022dd44f38 RBX: ffff88022f0e3b30 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88022f0e3b88 RDI: 00000000000007d4
RBP: 00000000000007d4 R08: ffff880220476d30 R09: 000000000000085c
R10: 00000000000b0038 R11: ffffffffa0126115 R12: ffff88022f0e3b88
R13: ffff88022d974d38 R14: 00000000000007d4 R15: 00000000000007d4
FS:  0000000000000000(0000) GS:ffff88022f07bb50(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000004 CR3: 00000002271c2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88022f0da000, task ffff88022f0a4050)
Stack:
 ffff88022d974fa0 ffff88022f0e3b30 00000000000007d4 ffffffffa01261fe
 ffff88022d974f80 ffff880220418068 0000085c00000000 000007d400000000
 ffff880220476d30 ffff88022dd44f38 0000000000000000 ffff88022dd44e58
Call Trace:
 <IRQ> <0> [<ffffffffa01261fe>] ? iscsi_tcp_recv+0x64/0x39b [iscsi_tcp]
 [<ffffffff803f0d0f>] ? ip_queue_xmit+0x2aa/0x2fd
 [<ffffffff803f60fd>] ? tcp_read_sock+0x97/0x212
 [<ffffffffa012619a>] ? iscsi_tcp_recv+0x0/0x39b [iscsi_tcp]
 [<ffffffffa012615d>] ? iscsi_tcp_data_ready+0x48/0x85 [iscsi_tcp]
 [<ffffffff803ff119>] ? tcp_rcv_established+0x4c0/0x567
 [<ffffffff804042f8>] ? tcp_v4_do_rcv+0x2c/0x1c8
 [<ffffffff80405fb9>] ? tcp_v4_rcv+0x630/0x683
 [<ffffffff803c6552>] ? skb_release_head_state+0x60/0x8f
 [<ffffffff803ecb9f>] ? ip_local_deliver_finish+0xda/0x197
 [<ffffffff803ecaab>] ? ip_rcv_finish+0x32f/0x349
 [<ffffffffa024e42d>] ? lro_flush+0x159/0x17e [inet_lro]
 [<ffffffffa024eb2e>] ? __lro_proc_skb+0x1ca/0x1ed [inet_lro]
 [<ffffffff80221e28>] ? swiotlb_map_single_phys+0x0/0x12
 [<ffffffffa024eb69>] ? lro_receive_skb+0x18/0x3e [inet_lro]
 [<ffffffffa0299582>] ? ipoib_ib_handle_rx_wc+0x1ed/0x22b [ib_ipoib]
 [<ffffffffa0299e97>] ? ipoib_poll+0x9c/0x173 [ib_ipoib]
 [<ffffffff803ce1d0>] ? net_rx_action+0x9d/0x175
 [<ffffffff80239ffb>] ? __do_softirq+0x7a/0x13d
 [<ffffffff8020cf4c>] ? call_softirq+0x1c/0x28
 [<ffffffff8020df5d>] ? do_softirq+0x2c/0x68
 [<ffffffff8020e05b>] ? do_IRQ+0xc2/0xdf
 [<ffffffff8020c206>] ? ret_from_intr+0x0/0xa
 <EOI> <0> [<ffffffff80212464>] ? mwait_idle+0x41/0x44
 [<ffffffff8020abca>] ? cpu_idle+0x40/0x5e
Code: ff 88 48 e0 ff ff 48 c7 43 20 00 00 00 00 ff 43 08 8b 46 0c 01 43 0c 48 8b 43 18 8b 4b 08 8b 90 b4 00 00 00 48 03 90 b8 00 00 00 <0f> b7 42 04 39 c1  
RIP  [<ffffffff803c50a4>] skb_seq_read+0xfb/0x1a1
 RSP <ffff88022f0e3b00>
CR2: 0000000000000004
Kernel panic - not syncing: Fatal exception in interrupt

When I looked at this on sles10 I was able to verify that the problem was with
(see bellow where this comes from) st->cur_skb->next equals 0xffffffff:

 if (st->cur_skb->next) {
                st->cur_skb = st->cur_skb->next;  <<<=== this where I see the problem
                st->frag_idx = 0;
                goto next_skb;
        } else if (st->root_skb == st->cur_skb &&


From brian at sun.com  Thu Feb 19 09:42:54 2009
From: brian at sun.com (Brian J. Murrell)
Date: Thu, 19 Feb 2009 12:42:54 -0500
Subject: [ofa-general] IB function calls in kernel module fail
In-Reply-To: <499C1DDA.3060601@mellanox.co.il>
References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com>
	<49994BB2.3010206@mellanox.co.il>
	<7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com>
	<499A8A20.1090507@mellanox.co.il>
	<1234893143.21802.96.camel@pc.interlinx.bc.ca>
	<499C1DDA.3060601@mellanox.co.il>
Message-ID: <1235065374.28114.463.camel@pc.interlinx.bc.ca>

On Wed, 2009-02-18 at 16:40 +0200, Tziporet Koren wrote:
> Brian J. Murrell wrote:
> > Ahhh.  But should he just include <ofed-prefix>/src/openib/include/ or
> > also
> > <ofed-prefix>/src/openib/kernel_addons/backport/<kernel_ver>/include/
> > (as described in <ofed-prefix>/src/openib/ofed_patch.mk as well?
> >
> > And in what order should these be specified in?
> >
> >   
> You need both
> Order not important

Are you sure about this?  I have been, in the past, unsure about this
ordering too, but have been seeing evidence that order is important.

Take for example in the current ~vlad/ofed_kernel-1.4 tree, there is an
exportfs.h in both <ofed-prefix>/src/openib/include/linux and
<ofed-prefix>/src/openib/kernel_addons/backport/2.6.16_sles10_sp2/include/linux.

Having discussed the presence of this (newish) header in the SLES10 SP2
backports tree with Jeff it's clear that it should be used in preference
to the one in the general include tree.

Therefore, if one is not careful about ordering (so that backport
headers take precedence) over the general ones, one would get the wrong
exportfs.h header for SLES10 SP2 builds.

But then there is the question of the kernel headers and ordering.  Many
of backport headers use #include_next to get the next found instance of
a header.  I have always assumed that was to get the kernel's version of
a header included in the backport header.  But if the order is:

     1. backports headers
     2. ofa general headers
     3. kernel headers

then an "#include_next <foo.h>" in
<ofed-prefix>/src/openib/kernel_addons/backport/<kernel_ver>/include/linux/foo.h could potentially pick up <ofed-prefix>/src/openib/include/linux/foo.h rather than <kernel_source>/include/linux/foo.h which I think is what is intended/desired in most cases.

But if the ordering is changed to:

     1. backports headers
     2. kernel headers
     3. ofa general headers

Then the desired preference of
<ofed-prefix>/src/openib/include/{rdma,scsi}/* headers vs. the ones
included in the kernel will be lost.

How can we reconcile this?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090219/2366a698/attachment.sig>

From hnrose at comcast.net  Thu Feb 19 09:44:13 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 19 Feb 2009 12:44:13 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/umad2sim.c: Eliminate
	unneeded umad2sim_dev num
Message-ID: <20090219174413.GA29805@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
index e13e30a..aaa6260 100644
--- a/umad2sim/umad2sim.c
+++ b/umad2sim/umad2sim.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This file is part of ibsim.
  *
@@ -77,7 +78,6 @@ struct ib_user_mad_reg_req {
 
 struct umad2sim_dev {
 	int fd;
-	unsigned num;
 	char name[32];
 	uint8_t port;
 	struct sim_client sim_client;
@@ -351,15 +351,13 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
 	*str = '\0';
 
 	/* /sys/class/infiniband_mad/umad0/ */
-	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
-		 dev->num);
+	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, 0);
 	make_path(path);
 	file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name);
 	file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port);
 
 	/* /sys/class/infiniband_mad/issm0/ */
-	snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir,
-		 dev->num);
+	snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, 0);
 	make_path(path);
 	file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name);
 	file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port);
@@ -546,7 +544,7 @@ static int umad2sim_ioctl(struct umad2sim_dev *dev, unsigned long request,
 	return -1;
 }
 
-static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
+static struct umad2sim_dev *umad2sim_dev_create(const char *name)
 {
 	struct umad2sim_dev *dev;
 	unsigned i;
@@ -558,7 +556,6 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
 		return NULL;
 	memset(dev, 0, sizeof(*dev));
 
-	dev->num = num;
 	strncpy(dev->name, name, sizeof(dev->name) - 1);
 
 	if (sim_client_init(&dev->sim_client) < 0)
@@ -574,9 +571,9 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
 	dev_sysfs_create(dev);
 
 	snprintf(dev->umad_path, sizeof(dev->umad_path), "%s/%s%u",
-		 umad_dev_dir, "umad", num);
+		 umad_dev_dir, "umad", 0);
 	snprintf(dev->issm_path, sizeof(dev->issm_path), "%s/%s%u",
-		 umad_dev_dir, "issm", num);
+		 umad_dev_dir, "issm", 0);
 
 	return dev;
 
@@ -646,7 +643,7 @@ static void umad2sim_init(void)
 	DEBUG("umad2sim_init...\n");
 	snprintf(umad2sim_sysfs_prefix, sizeof(umad2sim_sysfs_prefix),
 		 "./sys-%d", getpid());
-	devices[0] = umad2sim_dev_create(0, "ibsim0");
+	devices[0] = umad2sim_dev_create("ibsim0");
 	if (!devices[0]) {
 		ERROR("cannot init umad2sim. Exit.\n");
 		exit(-1);


From gmpc at sanger.ac.uk  Thu Feb 19 10:28:52 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 19 Feb 2009 18:28:52 +0000
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <1235058060.28114.307.camel@pc.interlinx.bc.ca>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
	<499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com>
	<499D7B6E.3050206@sanger.ac.uk>
	<1235058060.28114.307.camel@pc.interlinx.bc.ca>
Message-ID: <499DA4E4.5020904@sanger.ac.uk>

Brian J. Murrell wrote:
> On Thu, 2009-02-19 at 15:31 +0000, Guy Coates wrote:
>>  
> 
> Hi Guy,
> 
>> I started packaging the missing bits of OFED 1.3 for debian etch for my own
>> private use, as I needed some bits that were not present. (openSM, srp-tools,
>> and a set of OFED 1.3 kernel modules+headers that I could build lustre against,
> 
> /me waves.

Lenny comes with 2.6.26 and I've been using the 2.6.26 backport to build the
OFED 1.4 kernel modules. The modules all build except for ipath_inf-mod,
iser-mod  and ehca.

I think I have fixed the iser module build issue; I'll generate some patches.


Following your comments on the lustre mailing list, I haven't attempted to build
lustre against OFED 1.4.

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From hnrose at comcast.net  Thu Feb 19 10:44:15 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 19 Feb 2009 13:44:15 -0500
Subject: [ofa-general] ***SPAM*** [PATCHv2] opensm/man/opensm.8.in: Indicate
	ROUTER_EXP obsoleted
Message-ID: <20090219184415.GA29943@comcast.net>


Pointed out by Rolf

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 7690980..f9f30d6 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -569,8 +569,8 @@ opensm will return the path to the first available matching router.
 A configuration file with a single line where both prefix and GUID
 are wild-carded means that a path record query specifying any
 off-subnet DGID should return a path to the first available router.
-This configuration yields the same behaviour formerly achieved by
-compiling opensm with -DROUTER_EXP.
+This configuration yields the same behavior formerly achieved by
+compiling opensm with -DROUTER_EXP which has been obsoleted.
 
 .SH ROUTING
 .PP


From neutronsharc at gmail.com  Thu Feb 19 10:47:27 2009
From: neutronsharc at gmail.com (neutron)
Date: Thu, 19 Feb 2009 13:47:27 -0500
Subject: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash
In-Reply-To: <adavdr7z2be.fsf@cisco.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
Message-ID: <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com>

I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version:
2.6.18-53.1.14.el5,  ofed 1.3.1.

The failed function call is like:

{

ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE,
                &dma_addr, GFP_KERNEL);

ctx->phy_buf[0].addr = dma_addr;
ctx->phy_buf[0].size = MAX_SIZE;
ctx->iovstart = (u64) ctx->send_buf;

printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n",
       ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart );

send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1,
                        IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ
                         | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart));
}

The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf".

Below is /var/log/messages output around the crash.
----------------
Feb 19 12:50:22 wci30 kernel:  pd=ffff8101da3ddce0,
phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000

Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer
dereference at 0000000000000000
 RIP:
Feb 19 12:50:22 wci30 kernel:  [<0000000000000000>] _stext+0x7ffff000/0x1000
Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0
Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP
Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version
Feb 19 12:50:22 wci30 kernel: CPU 0
Feb 19 12:54:05 wci30 syslogd 1.4.1: restart.
Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5
(brewbuilder at hs20-bc2-3.build.redha
t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb
19 07:18:46 EST 2008
Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet

====================
It's strange that the kernel doesn't print out the function call stack
before crashing.

Any hints?  Thanks a lot!

On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier <rdreier at cisco.com> wrote:
>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>  > are valid.  But the system always crashes immediately after entering
>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>
> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
> you get an oops message?  If so that would be very important info for
> debugging this.
>
> - R.
>


From brian at sun.com  Thu Feb 19 11:00:45 2009
From: brian at sun.com (Brian J. Murrell)
Date: Thu, 19 Feb 2009 14:00:45 -0500
Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on
	2.6.26 under Debian Lenny
In-Reply-To: <499DA4E4.5020904@sanger.ac.uk>
References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com>
	<499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com>
	<499D7B6E.3050206@sanger.ac.uk>
	<1235058060.28114.307.camel@pc.interlinx.bc.ca>
	<499DA4E4.5020904@sanger.ac.uk>
Message-ID: <1235070045.28114.472.camel@pc.interlinx.bc.ca>

On Thu, 2009-02-19 at 18:28 +0000, Guy Coates wrote:
> 
> Lenny comes with 2.6.26 and I've been using the 2.6.26 backport to build the
> OFED 1.4 kernel modules.

Without actually looking, I'd guess the backport for 2.6.26 is a lot
smaller and less intrusive than it is for the SLES10 and RHEL5 (2.6.16
and 2.6.18 respectively) kernels.

> Following your comments on the lustre mailing list, I haven't attempted to build
> lustre against OFED 1.4.

Chances are good that it will be a lot more successful than RHEL5 or
SLES10 I'd guess, so don't let comments regarding those OSes hold you
back.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090219/e85c237a/attachment.sig>

From or.gerlitz at gmail.com  Thu Feb 19 11:40:49 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 19 Feb 2009 21:40:49 +0200
Subject: [ofa-general] ***SPAM*** Re: [ewg] iscsi initiator ipoib+lro crash
	on upstream kernel
In-Reply-To: <20090219165505.GA13617@mtls03>
References: <20090219165505.GA13617@mtls03>
Message-ID: <15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com>

On Thu, Feb 19, 2009 at 6:55 PM, Eli Cohen <eli at dev.mellanox.co.il> wrote:

> I have encountered a kernel crash when running a iSCSI initiator on
> IPoIB configured with LRO (if LRO is off it does not happen). This
> was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2 too.

Eli,

This is a known issue
(http://bugzilla.kernel.org/show_bug.cgi?id=11804) a fix was submitted
upstream and would be included in the next kernel.

Or.


From sean.hefty at intel.com  Thu Feb 19 12:48:32 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 19 Feb 2009 12:48:32 -0800
Subject: [ofa-general] [PATCH 5/6 v2] [ib-diag] ibsendtrap: add support for
	WinOF
In-Reply-To: <0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com>
Message-ID: <C5CFD168DA074C7E9D7C629DCD12D903@amr.corp.intel.com>

Add typecasts and modify include path.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Update from v1: need casts from int to uint16.  One of the include files in
the winof tree disables certain build warnings for the callers convenience...

 infiniband-diags/src/ibsendtrap.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ba6aa8b..92b72f1 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -43,7 +43,7 @@
 #include <getopt.h>
 
 #include <infiniband/mad.h>
-#include <infiniband/iba/ib_types.h>
+#include <iba/ib_types.h>
 
 #include "ibdiag_common.h"
 
@@ -73,8 +73,8 @@ static int send_144_node_desc_update(void)
 	notice.generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
 	notice.g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA);
 	notice.g_or_v.generic.trap_num = cl_hton16(144);
-	notice.issuer_lid = cl_hton16(selfportid.lid);
-	notice.data_details.ntc_144.lid = cl_hton16(selfportid.lid);
+	notice.issuer_lid = cl_hton16((uint16_t) selfportid.lid);
+	notice.data_details.ntc_144.lid = cl_hton16((uint16_t) selfportid.lid);
 	notice.data_details.ntc_144.local_changes =
 	    TRAP_144_MASK_OTHER_LOCAL_CHANGES;
 	notice.data_details.ntc_144.change_flgs =


From weiny2 at llnl.gov  Thu Feb 19 19:05:20 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:20 -0800
Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags -- converting to
 "new" interface.
Message-ID: <20090219190520.c18280e1.weiny2@llnl.gov>

Here is v2 of the patch series.

I used __attribute__ ((deprecated)) on the functions which should aid others
in realizing that these functions will go away.  (It sure helped me to convert
all the diags.

Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the
new interface and I am hoping it will be accepted soon.

The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat
because they were all simple to convert and the patch series was getting
ridiculous.

Thanks,
Ira

-- 
Ira Weiny
Math Programer/Computer Scientist
Larence Livermore National Lab
weiny2 at llnl.gov


From weiny2 at llnl.gov  Thu Feb 19 19:05:25 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:25 -0800
Subject: [ofa-general] [PATCH 1/10] libibmad: Clean up "new" interface
Message-ID: <20090219190525.322681b8.weiny2@llnl.gov>

>From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 18 Feb 2009 16:37:36 -0800
Subject: [PATCH] libibmad: Clean up "new" interface

   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
   Create new mad_rpc_portid(struct ibmad_port *srcport) function
      which mirrors madrpc_portid(void)
   Mark all "old" functions with __attribute__ ((deprecated))

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
 libibmad/src/gs.c                 |   19 +++---
 libibmad/src/libibmad.map         |    1 +
 libibmad/src/resolve.c            |   10 ++-
 libibmad/src/rpc.c                |   29 ++++----
 libibmad/src/sa.c                 |    4 +-
 libibmad/src/smp.c                |    4 +-
 7 files changed, 118 insertions(+), 88 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 1aaaa1b..80e38be 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
 }
 
 /* rpc.c */
-MAD_EXPORT int madrpc_portid(void);
-MAD_EXPORT int madrpc_set_retries(int retries);
-MAD_EXPORT int madrpc_set_timeout(int timeout);
-void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
-void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
-		  void *data);
+MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated));
+void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata)
+		__attribute__ ((deprecated));
+void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
+		__attribute__ ((deprecated));
 MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
-			    int num_classes);
-void madrpc_save_mad(void *madbuf, int len);
-MAD_EXPORT void madrpc_show_errors(int set);
+			    int num_classes) __attribute__ ((deprecated));
+void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated));
 
-void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
+/* New interface */
+MAD_EXPORT void madrpc_show_errors(int set);
+MAD_EXPORT int madrpc_set_retries(int retries);
+MAD_EXPORT int madrpc_set_timeout(int timeout);
+MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
 			int num_classes);
-void mad_rpc_close_port(void *ibmad_port);
-void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
-	      void *payload, void *rcvdata);
-void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
-		   ib_rmpp_hdr_t * rmpp, void *data);
+MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
+MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
+			void *payload, void *rcvdata);
+MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
+			ib_rmpp_hdr_t * rmpp, void *data);
+MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
 
 /* smp.c */
 MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
-			      unsigned mod, unsigned timeout);
+		      unsigned mod, unsigned timeout) __attribute__ ((deprecated));
 MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
-			    unsigned mod, unsigned timeout);
+		    unsigned mod, unsigned timeout) __attribute__ ((deprecated));
+
+/* smp.c new interface */
 MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
-		       unsigned mod, unsigned timeout, const void *srcport);
-uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
-		     unsigned timeout, const void *srcport);
+		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
+		     unsigned timeout, const struct ibmad_port *srcport);
 
 /* sa.c */
 uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
-		 unsigned timeout);
-uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
+		 unsigned timeout) __attribute__ ((deprecated));
+MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id,
+		void *buf) __attribute__ ((deprecated));
+
+/* sa.c new interface */
+MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
 		     ib_sa_call_t * sa, unsigned timeout);
-MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);	/* returns lid */
-int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
+MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
+	/* returns lid */
 
 /* resolve.c */
-MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
+MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
+				__attribute__ ((deprecated));
 MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
-			       ib_portid_t * sm_id, int timeout);
+			       ib_portid_t * sm_id, int timeout)
+				__attribute__ ((deprecated));
 MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
-				     enum MAD_DEST dest, ib_portid_t * sm_id);
+				     enum MAD_DEST dest, ib_portid_t * sm_id)
+				__attribute__ ((deprecated));
 MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
-			       ibmad_gid_t * gid);
-
-int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
-int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
-			ib_portid_t * sm_id, int timeout, const void *srcport);
-int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
+			       ibmad_gid_t * gid)
+				__attribute__ ((deprecated));
+
+/* resolve.c new interface */
+MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport);
+MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
+			ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport);
+MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 			      enum MAD_DEST dest, ib_portid_t * sm_id,
-			      const void *srcport);
-int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
-			const void *srcport);
+			      const struct ibmad_port *srcport);
+MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
+			const struct ibmad_port *srcport);
 
 /* gs.c */
 MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
-					     int port, unsigned timeout);
+					     int port, unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest,
-					   int port, unsigned timeout);
+					   int port, unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest,
 					   int port, unsigned mask,
-					   unsigned timeout);
+					   unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest,
-					       int port, unsigned timeout);
+					       int port, unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest,
 					       int port, unsigned mask,
-					       unsigned timeout);
+					       unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
-					       int port, unsigned timeout);
+					       int port, unsigned timeout)
+						__attribute__ ((deprecated));
 MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
-					      int port, unsigned timeout);
+					      int port, unsigned timeout)
+						__attribute__ ((deprecated));
 
-uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
+/* gs.c new interface */
+MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned timeout,
-				      const void *srcport);
-uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
-				    unsigned timeout, const void *srcport);
-uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
+				      const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
+				    unsigned timeout, const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
 				    unsigned mask, unsigned timeout,
-				    const void *srcport);
-uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
+				    const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport);
-uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
+					const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned mask,
-					unsigned timeout, const void *srcport);
-uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
+					unsigned timeout,
+					const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport);
-uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
+					const struct ibmad_port *srcport);
+MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
 				       int port, unsigned timeout,
-				       const void *srcport);
+				       const struct ibmad_port *srcport);
 /* dump.c */
 MAD_EXPORT ib_mad_dump_fn
     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
index d2c4574..e302caf 100644
--- a/libibmad/src/gs.c
+++ b/libibmad/src/gs.c
@@ -47,7 +47,7 @@
 
 static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
 			      unsigned timeout, unsigned id,
-			      const void *srcport)
+			      const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 	int lid = dest->lid;
@@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
 
 uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned timeout,
-				      const void *srcport)
+				      const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
 			     srcport);
@@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
 }
 
 uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
-				    unsigned timeout, const void *srcport)
+				    unsigned timeout, const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_COUNTERS, srcport);
@@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
 				      int port, unsigned mask, unsigned timeout,
-				      unsigned id, const void *srcport)
+				      unsigned id, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 	int lid = dest->lid;
@@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
 				    unsigned mask, unsigned timeout,
-				    const void *srcport)
+				    const struct ibmad_port *srcport)
 {
 	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
 				     IB_GSI_PORT_COUNTERS, srcport);
@@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport)
+					const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_COUNTERS_EXT, srcport);
@@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned mask,
-					unsigned timeout, const void *srcport)
+					unsigned timeout,
+					const struct ibmad_port *srcport)
 {
 	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
 				     IB_GSI_PORT_COUNTERS_EXT, srcport);
@@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
-					const void *srcport)
+					const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_SAMPLES_CONTROL, srcport);
@@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
 
 uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
 				       int port, unsigned timeout,
-				       const void *srcport)
+				       const struct ibmad_port *srcport)
 {
 	return pma_query_via(rcvbuf, dest, port, timeout,
 			     IB_GSI_PORT_SAMPLES_RESULT, srcport);
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index f944d86..94d7762 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -69,6 +69,7 @@ IBMAD_1.3 {
 		mad_rpc_close_port;
 		mad_rpc;
 		mad_rpc_rmpp;
+		mad_rpc_portid;
 		madrpc;
 		madrpc_def_timeout;
 		madrpc_init;
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index 553949d..3291f43 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -45,7 +45,8 @@
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
 
-int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
+int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t self = { 0 };
 	uint8_t portinfo[64];
@@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
 }
 
 int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
-			ib_portid_t * sm_id, int timeout, const void *srcport)
+			ib_portid_t * sm_id, int timeout,
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t sm_portid;
 	char buf[IB_SA_DATA_SIZE] = { 0 };
@@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 
 int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
 			      enum MAD_DEST dest_type, ib_portid_t * sm_id,
-			      const void *srcport)
+			      const struct ibmad_port *srcport)
 {
 	uint64_t guid;
 	int lid;
@@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
 }
 
 int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
-			const void *srcport)
+			const struct ibmad_port *srcport)
 {
 	ib_portid_t self = { 0 };
 	uint8_t portinfo[64];
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index e811526..d47873b 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -100,6 +100,11 @@ int madrpc_portid(void)
 	return mad_portid;
 }
 
+int mad_rpc_portid(struct ibmad_port *srcport)
+{
+	return (srcport->port_id);
+}
+
 static int
 _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 	   int timeout)
@@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 	return -1;
 }
 
-void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
+void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
 	      void *payload, void *rcvdata)
 {
-	const struct ibmad_port *p = port_id;
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
@@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
-			      p->class_agents[rpc->mgtclass],
+	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
+			      port->class_agents[rpc->mgtclass],
 			      len, rpc->timeout)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
@@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	return rcvdata;
 }
 
-void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
+void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
 		   ib_rmpp_hdr_t * rmpp, void *data)
 {
-	const struct ibmad_port *p = port_id;
 	int status, len;
 	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
@@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
 	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
-			      p->class_agents[rpc->mgtclass],
+	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
+			      port->class_agents[rpc->mgtclass],
 			      len, rpc->timeout)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
@@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
 	}
 }
 
-void *mad_rpc_open_port(char *dev_name, int dev_port,
+struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
 			int *mgmt_classes, int num_classes)
 {
 	struct ibmad_port *p;
@@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
 	return p;
 }
 
-void mad_rpc_close_port(void *port_id)
+void mad_rpc_close_port(struct ibmad_port *port)
 {
-	struct ibmad_port *p = port_id;
-
-	umad_close_port(p->port_id);
-	free(p);
+	umad_close_port(port->port_id);
+	free(port);
 }
 
 uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
index 7403d4f..ddeb152 100644
--- a/libibmad/src/sa.c
+++ b/libibmad/src/sa.c
@@ -44,7 +44,7 @@
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
 
-uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
+uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
 		     ib_sa_call_t * sa, unsigned timeout)
 {
 	ib_rpc_t rpc = { 0 };
@@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
 			IB_PR_COMPMASK_SGID |\
 			IB_PR_COMPMASK_NUMBPATH)
 
-int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
+int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
 		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
 {
 	int npath;
diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
index fad263c..e5489b3 100644
--- a/libibmad/src/smp.c
+++ b/libibmad/src/smp.c
@@ -45,7 +45,7 @@
 #define DEBUG 	if (ibdebug)	IBWARN
 
 uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
-		     unsigned mod, unsigned timeout, const void *srcport)
+		     unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 
@@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
 }
 
 uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
-		       unsigned mod, unsigned timeout, const void *srcport)
+		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
 {
 	ib_rpc_t rpc = { 0 };
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:28 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:28 -0800
Subject: [ofa-general] [PATCH 2/10] infiniband-diags: Convert ibaddr to "new"
 ibmad interface
Message-ID: <20090219190528.11c080f8.weiny2@llnl.gov>

>From 1ead0cdb05b159dbd3a89d2030870fc7326ec84d Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 14:47:05 -0800
Subject: [PATCH] infiniband-diags: Convert ibaddr to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibaddr.c        |   17 ++++++++++++-----
 infiniband-diags/src/ibdiag_common.c |    3 ++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
index 9098699..bb22be9 100644
--- a/infiniband-diags/src/ibaddr.c
+++ b/infiniband-diags/src/ibaddr.c
@@ -45,6 +45,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int
 ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid)
 {
@@ -55,10 +57,10 @@ ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid)
 	ibmad_gid_t gid;
 	int lmc;
 
-	if (!smp_query(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return -1;
 
-	if (!smp_query(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid);
@@ -137,17 +139,22 @@ int main(int argc, char **argv)
 	if (!show_lid && !show_gid)
 		show_lid = show_gid = 1;
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (argc) {
-		if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+						ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[0]);
 	} else {
-		if (ib_resolve_self(&portid, &port, 0) < 0)
+		if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0)
 			IBERROR("can't resolve self port %s", argv[0]);
 	}
 
 	if (ib_resolve_addr(&portid, port, show_lid, show_gid) < 0)
 		IBERROR("can't resolve requested address");
+
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index 5f2472d..609df69 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -179,7 +179,8 @@ static int process_opt(int ch, char *optarg)
 		ibd_timeout = val;
 		break;
 	case 's':
-		if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0)
+		if (ib_resolve_portid_str_via(&sm_portid, optarg, IB_DEST_LID,
+				0, NULL) < 0)
 			IBERROR("cannot resolve SM destination port %s", optarg);
 		ibd_sm_id = &sm_portid;
 		break;
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:36 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:36 -0800
Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate to
 "new" ibmad interface
Message-ID: <20090219190536.f96edca7.weiny2@llnl.gov>

>From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:27:21 -0800
Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibportstate.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index c0b9b34..ca72bda 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -46,6 +46,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 /*******************************************/
 
 static int
@@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data)
 {
 	int node_type;
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return -1;
 
 	node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
@@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
 	char buf[2048];
 	char val[64];
 
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	if (port_op != 4) {
@@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
 	char buf[2048];
 	char val[64];
 
-	if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	if (port_op != 4)
@@ -223,9 +225,12 @@ int main(int argc, char **argv)
 	if (argc < 2)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	/* First, make sure it is a switch port if it is a "set" */
@@ -314,7 +319,8 @@ int main(int argc, char **argv)
 					peerportid.drpath.p[1] = (uint8_t) portnum;
 
 					/* Set DrSLID to local lid */
-					if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
+					if (ib_resolve_self_via(&selfportid,
+							&selfport, 0, srcport) < 0)
 						IBERROR("could not resolve self");
 					peerportid.drpath.drslid = (uint16_t) selfportid.lid;
 					peerportid.drpath.drdlid = 0xffff;
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:32 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:32 -0800
Subject: [ofa-general] [PATCH 3/10] infiniband-diags: convert ibping to "new"
 ibmad interface
Message-ID: <20090219190532.faf400f5.weiny2@llnl.gov>

>From 039b42d9df09598d146d47d5d2adc1a13d952999 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 16:57:55 -0800
Subject: [PATCH] infiniband-diags: convert ibping to "new" ibmad interface

   To do this I needed the following additional functions
      mad_register_client_via
      mad_register_server_via
      mad_send_via
      mad_receive_via
      mad_respond_via
      ib_vendor_call_via

   And I marked their counterparts as deprecated and clean up interface a bit
   more.

   Note I moved some of the "new" interface declarations higher in mad.h

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibping.c     |   21 ++++++++----
 libibmad/include/infiniband/mad.h |   66 +++++++++++++++++++++++++------------
 libibmad/src/libibmad.map         |    5 +++
 libibmad/src/mad_internal.h       |   44 ++++++++++++++++++++++++
 libibmad/src/register.c           |   58 ++++++++++++++++++++++++++------
 libibmad/src/rpc.c                |    8 +---
 libibmad/src/serv.c               |   39 ++++++++++++++++++++--
 libibmad/src/vendor.c             |   15 +++++++-
 8 files changed, 206 insertions(+), 50 deletions(-)
 create mode 100644 libibmad/src/mad_internal.h

diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c
index 1994eba..901079f 100644
--- a/infiniband-diags/src/ibping.c
+++ b/infiniband-diags/src/ibping.c
@@ -48,6 +48,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE];
 static char last_host[IB_VENDOR_RANGE2_DATA_SIZE];
 
@@ -82,7 +84,7 @@ ibping_serv(void)
 
 	DEBUG("starting to serve...");
 
-	while ((umad = mad_receive(0, -1))) {
+	while ((umad = mad_receive_via(0, -1, srcport))) {
 
 		mad = umad_get_mad(umad);
 		data = (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS;
@@ -91,7 +93,7 @@ ibping_serv(void)
 
 		DEBUG("Pong: %s", data);
 
-		if (mad_respond(umad, 0, 0) < 0)
+		if (mad_respond_via(umad, 0, 0, srcport) < 0)
 			DEBUG("respond failed");
 
 		mad_free(umad);
@@ -120,7 +122,7 @@ ibping(ib_portid_t *portid, int quiet)
 	call.timeout = 0;
 	memset(&call.rmpp, 0, sizeof call.rmpp);
 
-	if (!ib_vendor_call(data, portid, &call))
+	if (!ib_vendor_call_via(data, portid, &call, srcport))
 		return ~0ull;
 
 	rtt = cl_get_time_stamp() - start;
@@ -208,10 +210,12 @@ int main(int argc, char **argv)
 	if (!argc && !server)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (server) {
-		if (mad_register_server(ping_class, 0, 0, oui) < 0)
+		if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0)
 			IBERROR("can't serve class %d on this port", ping_class);
 
 		get_host_and_domain(host_and_domain, sizeof host_and_domain);
@@ -221,10 +225,11 @@ int main(int argc, char **argv)
 		exit(0);
 	}
 
-	if (mad_register_client(ping_class, 0) < 0)
+	if (mad_register_client_via(ping_class, 0, srcport) < 0)
 		IBERROR("can't register ping class %d on this port", ping_class);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+					ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	signal(SIGINT, report);
@@ -252,5 +257,7 @@ int main(int argc, char **argv)
 
 	report(0);
 
+	mad_rpc_close_port(srcport);
+
 	exit(-1);
 }
diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 80e38be..5cf135e 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -691,27 +691,64 @@ MAD_EXPORT uint64_t mad_trid(void);
 MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport,
 			     ib_rmpp_hdr_t * rmpp, void *data);
 
+/* New interface */
+MAD_EXPORT void madrpc_show_errors(int set);
+MAD_EXPORT int madrpc_set_retries(int retries);
+MAD_EXPORT int madrpc_set_timeout(int timeout);
+MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
+			int num_classes);
+MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
+MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
+			void *payload, void *rcvdata);
+MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
+			ib_rmpp_hdr_t * rmpp, void *data);
+MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
+
 /* register.c */
 MAD_EXPORT int mad_register_port_client(int port_id, int mgmt,
-					uint8_t rmpp_version);
-MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version);
+			uint8_t rmpp_version) __attribute__ ((deprecated));
+MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version)
+			__attribute__ ((deprecated));
 MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version,
-				   long method_mask[16 / sizeof(long)],
-				   uint32_t class_oui);
+			   long method_mask[16 / sizeof(long)],
+			   uint32_t class_oui) __attribute__ ((deprecated));
+/* register.c new interface */
+MAD_EXPORT int mad_register_client_via(int mgmt, uint8_t rmpp_version,
+				struct ibmad_port *srcport);
+MAD_EXPORT int mad_register_server_via(int mgmt, uint8_t rmpp_version,
+				long method_mask[16 / sizeof(long)],
+				uint32_t class_oui,
+				struct ibmad_port *srcport);
 MAD_EXPORT int mad_class_agent(int mgmt);
 MAD_EXPORT int mad_agent_class(int agent);
 
 /* serv.c */
 MAD_EXPORT int mad_send(ib_rpc_t * rpc, ib_portid_t * dport,
-			ib_rmpp_hdr_t * rmpp, void *data);
-MAD_EXPORT void *mad_receive(void *umad, int timeout);
-MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus);
+		ib_rmpp_hdr_t * rmpp, void *data) __attribute__ ((deprecated));
+MAD_EXPORT void *mad_receive(void *umad, int timeout)
+		__attribute__ ((deprecated));
+MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
+		__attribute__ ((deprecated));
+
+/* serv.c new interface */
+MAD_EXPORT int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport,
+			ib_rmpp_hdr_t * rmpp, void *data,
+			struct ibmad_port *srcport);
+MAD_EXPORT void *mad_receive_via(void *umad, int timeout,
+			struct ibmad_port *srcport);
+MAD_EXPORT int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus,
+			struct ibmad_port *srcport);
 MAD_EXPORT void *mad_alloc(void);
 MAD_EXPORT void mad_free(void *umad);
 
 /* vendor.c */
 MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
-				   ib_vendor_call_t * call);
+			   ib_vendor_call_t * call) __attribute__ ((deprecated));
+
+/* vendor.c new interface */
+MAD_EXPORT uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid,
+				   ib_vendor_call_t * call,
+				   struct ibmad_port *srcport);
 
 static inline int mad_is_vendor_range1(int mgmt)
 {
@@ -733,19 +770,6 @@ MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
 			    int num_classes) __attribute__ ((deprecated));
 void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated));
 
-/* New interface */
-MAD_EXPORT void madrpc_show_errors(int set);
-MAD_EXPORT int madrpc_set_retries(int retries);
-MAD_EXPORT int madrpc_set_timeout(int timeout);
-MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
-			int num_classes);
-MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
-MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
-			void *payload, void *rcvdata);
-MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
-			ib_rmpp_hdr_t * rmpp, void *data);
-MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
-
 /* smp.c */
 MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
 		      unsigned mod, unsigned timeout) __attribute__ ((deprecated));
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 94d7762..bac74a9 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -60,6 +60,8 @@ IBMAD_1.3 {
 		mad_class_agent;
 		mad_register_client;
 		mad_register_server;
+		mad_register_client_via;
+		mad_register_server_via;
 		ib_resolve_guid;
 		ib_resolve_portid_str;
 		ib_resolve_self;
@@ -86,10 +88,13 @@ IBMAD_1.3 {
 		mad_free;
 		mad_receive;
 		mad_respond;
+		mad_receive_via;
+		mad_respond_via;
 		mad_send;
 		smp_query;
 		smp_set;
 		ib_vendor_call;
+		ib_vendor_call_via;
 		smp_query_via;
 		smp_set_via;
 		ib_path_query_via;
diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
new file mode 100644
index 0000000..9afe7a9
--- /dev/null
+++ b/libibmad/src/mad_internal.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _MAD_INTERNAL_H_
+#define _MAD_INTERNAL_H_
+
+#define MAX_CLASS 256
+
+struct ibmad_port {
+	int port_id;		/* file descriptor returned by umad_open() */
+	int class_agents[MAX_CLASS];	/* class2agent mapper */
+};
+
+#endif /* _MAD_INTERNAL_H_ */
diff --git a/libibmad/src/register.c b/libibmad/src/register.c
index 4d91ff8..4aabd7c 100644
--- a/libibmad/src/register.c
+++ b/libibmad/src/register.c
@@ -43,10 +43,11 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
+#include "mad_internal.h"
+
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
 
-#define MAX_CLASS	256
 #define MAX_AGENTS	256
 
 static int class_agent[MAX_CLASS];
@@ -136,22 +137,57 @@ int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version)
 
 int mad_register_client(int mgmt, uint8_t rmpp_version)
 {
+	int rc = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	rc = mad_register_client_via(mgmt, rmpp_version, &port);
+	if (rc < 0)
+		return rc;
+	return register_agent(port.class_agents[mgmt], mgmt);
+}
+
+int mad_register_client_via(int mgmt, uint8_t rmpp_version,
+			struct ibmad_port *srcport)
+{
 	int agent;
 
-	agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version);
+	if (!srcport)
+		return -1;
+
+	agent = mad_register_port_client(mad_rpc_portid(srcport), mgmt, rmpp_version);
 	if (agent < 0)
 		return agent;
 
-	return register_agent(agent, mgmt);
+	srcport->class_agents[mgmt] = agent;
+	return 0;
 }
 
 int
 mad_register_server(int mgmt, uint8_t rmpp_version,
 		    long method_mask[], uint32_t class_oui)
 {
+	int rc = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	port.class_agents[mgmt] = class_agent[mgmt];
+	rc = mad_register_server_via(mgmt, rmpp_version,
+				method_mask, class_oui,
+				&port);
+	if (rc < 0)
+		return rc;
+	return register_agent(port.class_agents[mgmt], mgmt);
+}
+
+int
+mad_register_server_via(int mgmt, uint8_t rmpp_version,
+		    long method_mask[], uint32_t class_oui,
+		    struct ibmad_port *srcport)
+{
 	long class_method_mask[16 / sizeof(long)];
 	uint8_t oui[3];
-	int agent, vers, mad_portid;
+	int agent, vers;
 
 	if (method_mask)
 		memcpy(class_method_mask, method_mask,
@@ -159,11 +195,12 @@ mad_register_server(int mgmt, uint8_t rmpp_version,
 	else
 		memset(class_method_mask, 0xff, sizeof(class_method_mask));
 
-	if ((mad_portid = madrpc_portid()) < 0)
+	if (!srcport)
 		return -1;
 
-	if (class_agent[mgmt] >= 0) {
-		DEBUG("Class 0x%x already registered", mgmt);
+	if (srcport->class_agents[mgmt] >= 0) {
+		DEBUG("Class 0x%x already registered %d",
+			mgmt, srcport->class_agents[mgmt]);
 		return -1;
 	}
 	if ((vers = mgmt_class_vers(mgmt)) <= 0) {
@@ -175,19 +212,18 @@ mad_register_server(int mgmt, uint8_t rmpp_version,
 		oui[0] = (class_oui >> 16) & 0xff;
 		oui[1] = (class_oui >> 8) & 0xff;
 		oui[2] = class_oui & 0xff;
-		if ((agent = umad_register_oui(mad_portid, mgmt, rmpp_version,
+		if ((agent = umad_register_oui(srcport->port_id, mgmt, rmpp_version,
 					       oui, class_method_mask)) < 0) {
 			DEBUG("Can't register agent for class %d", mgmt);
 			return -1;
 		}
-	} else if ((agent = umad_register(mad_portid, mgmt, vers, rmpp_version,
+	} else if ((agent = umad_register(srcport->port_id, mgmt, vers, rmpp_version,
 					  class_method_mask)) < 0) {
 		DEBUG("Can't register agent for class %d", mgmt);
 		return -1;
 	}
 
-	if (register_agent(agent, mgmt) < 0)
-		return -1;
+	srcport->class_agents[mgmt] = agent;
 
 	return agent;
 }
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index d47873b..210f0c2 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -43,12 +43,7 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
-#define MAX_CLASS 256
-
-struct ibmad_port {
-	int port_id;		/* file descriptor returned by umad_open() */
-	int class_agents[MAX_CLASS];	/* class2agent mapper */
-};
+#include "mad_internal.h"
 
 int ibdebug;
 
@@ -339,6 +334,7 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
 		return NULL;
 	}
 
+	memset(p->class_agents, 0xff, sizeof p->class_agents);
 	while (num_classes--) {
 		uint8_t rmpp_version = 0;
 		int mgmt = *mgmt_classes++;
diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c
index c7631bb..0ce1660 100644
--- a/libibmad/src/serv.c
+++ b/libibmad/src/serv.c
@@ -42,12 +42,25 @@
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
 
+#include "mad_internal.h"
+
 #undef DEBUG
 #define DEBUG	if (ibdebug)	IBWARN
 
 int
 mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass);
+	return mad_send_via(rpc, dport, rmpp, data, &port);
+}
+
+int
+mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data,
+		struct ibmad_port *srcport)
+{
 	uint8_t pktbuf[1024];
 	void *umad = pktbuf;
 
@@ -64,7 +77,7 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 		      (char *)umad_get_mad(umad) + rpc->dataoffs, rpc->datasz);
 	}
 
-	if (umad_send(madrpc_portid(), mad_class_agent(rpc->mgtclass),
+	if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass],
 		      umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) {
 		IBWARN("send failed; %m");
 		return -1;
@@ -75,6 +88,18 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
 
 int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 {
+	int i = 0;
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	for (i = 1; i < MAX_CLASS; i++)
+		port.class_agents[i] = mad_class_agent(i);
+	return mad_respond_via(umad, portid, rstatus, &port);
+}
+
+int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus,
+		struct ibmad_port *srcport)
+{
 	uint8_t *mad = umad_get_mad(umad);
 	ib_mad_addr_t *mad_addr;
 	ib_rpc_t rpc = { 0 };
@@ -138,7 +163,7 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 	if (ibdebug > 1)
 		xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE);
 
-	if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad,
+	if (umad_send(srcport->port_id, srcport->class_agents[rpc.mgtclass], umad,
 		      IB_MAD_SIZE, rpc.timeout, 0) < 0) {
 		DEBUG("send failed; %m");
 		return -1;
@@ -149,11 +174,19 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus)
 
 void *mad_receive(void *umad, int timeout)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	return mad_receive_via(umad, timeout, &port);
+}
+
+void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport)
+{
 	void *mad = umad ? umad : umad_alloc(1, umad_size() + IB_MAD_SIZE);
 	int agent;
 	int length = IB_MAD_SIZE;
 
-	if ((agent = umad_recv(madrpc_portid(), mad, &length, timeout)) < 0) {
+	if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) {
 		if (!umad)
 			umad_free(mad);
 		DEBUG("recv failed: %m");
diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c
index 50a878e..1a129e5 100644
--- a/libibmad/src/vendor.c
+++ b/libibmad/src/vendor.c
@@ -40,6 +40,7 @@
 #include <string.h>
 
 #include <infiniband/mad.h>
+#include "mad_internal.h"
 
 #undef DEBUG
 #define DEBUG 	if (ibdebug)	IBWARN
@@ -53,6 +54,16 @@ static inline int response_expected(int method)
 uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
 			ib_vendor_call_t * call)
 {
+	struct ibmad_port port;
+
+	port.port_id = madrpc_portid();
+	return ib_vendor_call_via(data, portid, call, &port);
+}
+
+uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid,
+			ib_vendor_call_t * call,
+			struct ibmad_port *srcport)
+{
 	ib_rpc_t rpc = { 0 };
 	int range1 = 0, resp_expected;
 
@@ -90,7 +101,7 @@ uint8_t *ib_vendor_call(void *data, ib_portid_t * portid,
 		portid->qkey = IB_DEFAULT_QP1_QKEY;
 
 	if (resp_expected)
-		return madrpc_rmpp(&rpc, portid, 0, data);	/* FIXME: no RMPP for now */
+		return mad_rpc_rmpp(srcport, &rpc, portid, 0, data);	/* FIXME: no RMPP for now */
 
-	return mad_send(&rpc, portid, 0, data) < 0 ? 0 : data;	/* FIXME: no RMPP for now */
+	return mad_send_via(&rpc, portid, 0, data, srcport) < 0 ? 0 : data;	/* FIXME: no RMPP for now */
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:41 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:41 -0800
Subject: [ofa-general] [PATCH 5/10] infiniband-diags: Convert ibroute to
 "new" ibmad interface
Message-ID: <20090219190541.f4a50fdc.weiny2@llnl.gov>

>From 5b66b604de9bc43458ca4d295c5ab14cf2c6df10 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:30:14 -0800
Subject: [PATCH] infiniband-diags: Convert ibroute to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibroute.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 144d1b2..60bfdd8 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -49,6 +49,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int brief, dump_all, multicast;
 
 /*******************************************/
@@ -61,12 +63,12 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid,
 	int type;
 
 	DEBUG("checking node type");
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, 0)) {
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) {
 		xdump(stderr, "nodeinfo\n", ni, sizeof ni);
 		return "node info failed: valid addr?";
 	}
 
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, 0))
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, 0, srcport))
 		return "node desc failed";
 
 	mad_decode_field(ni, IB_NODE_TYPE_F, &type);
@@ -77,7 +79,7 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid,
 	mad_decode_field(ni, IB_NODE_NPORTS_F, nports);
 	mad_decode_field(ni, IB_NODE_GUID_F, guid);
 
-	if (!smp_query(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0))
+	if (!smp_query_via(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0, srcport))
 		return "switch info failed: is a switch node?";
 
 	return 0;
@@ -195,7 +197,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid)
 			mod = (block - IB_MIN_MCAST_LID/IB_MLIDS_IN_BLOCK) | (j << 28);
 
 			DEBUG("reading block %x chunk %d mod %x", block, j, mod);
-			if (!smp_query(mft + j, portid, IB_ATTR_MULTICASTFORWTBL, mod, 0))
+			if (!smp_query_via(mft + j, portid,
+					IB_ATTR_MULTICASTFORWTBL, mod, 0, srcport))
 				return "multicast forwarding table get failed";
 		}
 
@@ -259,9 +262,9 @@ dump_lid(char *str, int strlen, int lid, int valid)
 	portguid = 0;
 	lidport.lid = lid;
 
-	if (!smp_query(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100) ||
-	    !smp_query(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100) ||
-	    !smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100))
+	if (!smp_query_via(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100, srcport) ||
+	    !smp_query_via(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100, srcport) ||
+	    !smp_query_via(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100, srcport))
 		return snprintf(str, strlen, ": (unknown node and type)");
 
 	mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid);
@@ -316,7 +319,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
 	endblock = ALIGN(endlid, IB_SMP_DATA_SIZE) / IB_SMP_DATA_SIZE;
 	for (block = startblock; block <= endblock; block++) {
 		DEBUG("reading block %d", block);
-		if (!smp_query(lft, portid, IB_ATTR_LINEARFORWTBL, block, 0))
+		if (!smp_query_via(lft, portid, IB_ATTR_LINEARFORWTBL, block,
+				0, srcport))
 			return "linear forwarding table get failed";
 		i = block * IB_SMP_DATA_SIZE;
 		e = i + IB_SMP_DATA_SIZE;
@@ -403,12 +407,15 @@ int main(int argc, char **argv)
 	if (argc > 2)
 		endlid = strtoul(argv[2], 0, 0);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (!argc) {
-		if (ib_resolve_self(&portid, 0, 0) < 0)
+		if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0)
 			IBERROR("can't resolve self addr");
-	} else if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	} else if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[1]);
 
 	if (multicast)
@@ -419,5 +426,6 @@ int main(int argc, char **argv)
 	if (err)
 		IBERROR("dump tables: %s", err);
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:46 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:46 -0800
Subject: [ofa-general] [PATCH 6/10] infiniband-diags: Convert ibsendtrap to
 "new" ibmad interface
Message-ID: <20090219190546.4fcaa158.weiny2@llnl.gov>

>From 9fcd0a9ec62fff981770e823281c660089b22d91 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:53:30 -0800
Subject: [PATCH] infiniband-diags: Convert ibsendtrap to "new" ibmad interface

   also make mad_send_via public to do the conversion

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibsendtrap.c |   13 +++++++++----
 libibmad/src/libibmad.map         |    1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ba6aa8b..d038dff 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -47,6 +47,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int send_144_node_desc_update(void)
 {
 	ib_portid_t sm_port;
@@ -55,10 +57,10 @@ static int send_144_node_desc_update(void)
 	ib_rpc_t trap_rpc;
 	ib_mad_notice_attr_t notice;
 
-	if (ib_resolve_self(&selfportid, &selfport, NULL))
+	if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport))
 		IBERROR("can't resolve self");
 
-	if (ib_resolve_smlid(&sm_port, 0))
+	if (ib_resolve_smlid_via(&sm_port, 0, srcport))
 		IBERROR("can't resolve SM destination port");
 
 	memset(&trap_rpc, 0, sizeof(trap_rpc));
@@ -80,7 +82,7 @@ static int send_144_node_desc_update(void)
 	notice.data_details.ntc_144.change_flgs =
 	    TRAP_144_MASK_NODE_DESCRIPTION_CHANGE;
 
-	return (mad_send(&trap_rpc, &sm_port, NULL, &notice));
+	return (mad_send_via(&trap_rpc, &sm_port, NULL, &notice, srcport));
 }
 
 typedef struct _trap_def {
@@ -137,7 +139,10 @@ int main(int argc, char **argv)
 	}
 
 	madrpc_show_errors(1);
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	return (send_trap(trap_name));
 }
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index bac74a9..0412027 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -91,6 +91,7 @@ IBMAD_1.3 {
 		mad_receive_via;
 		mad_respond_via;
 		mad_send;
+		mad_send_via;
 		smp_query;
 		smp_set;
 		ib_vendor_call;
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:51 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:51 -0800
Subject: [ofa-general] [PATCH 7/10] infiniband-diags: Convert ibtracert to
 "new" ibmad interface
Message-ID: <20090219190551.346fccb4.weiny2@llnl.gov>

>From 0961e0ce048950e65bb78578538cff38b2c8332d Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:58:36 -0800
Subject: [PATCH] infiniband-diags: Convert ibtracert to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibtracert.c |   36 ++++++++++++++++++++++++------------
 1 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c
index ea5662b..1965aa0 100644
--- a/infiniband-diags/src/ibtracert.c
+++ b/infiniband-diags/src/ibtracert.c
@@ -50,6 +50,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 #define MAXHOPS	63
 
 static char *node_type_str[] = {
@@ -116,10 +118,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 	void *pi = port->portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
 	char *s, *e;
 
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport))
 		return -1;
 
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout))
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport))
 		return -1;
 
 	for (s = nd, e = s + 64; s < e; s++) {
@@ -129,7 +131,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 			*s = ' ';
 	}
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport))
 		return -1;
 
 	mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid);
@@ -151,7 +153,7 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid)
 {
 	void *si = sw->switchinfo, *fdb = sw->fdb;
 
-	if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+	if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport))
 		return -1;
 
 	mad_decode_field(si, IB_SW_LINEAR_FDB_CAP_F, &sw->linearcap);
@@ -160,7 +162,8 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid)
 	if (lid > sw->linearcap && lid > sw->linearFDBtop)
 		return -1;
 
-	if (!smp_query(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, timeout))
+	if (!smp_query_via(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64,
+			timeout, srcport))
 		return -1;
 
 	DEBUG("portid %s: forward lid %d to port %d",
@@ -382,7 +385,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid)
 
 	port->portnum = portnum;
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout,
+			srcport))
 		return -1;
 
 	mad_decode_field(pi, IB_PORT_LID_F, &port->lid);
@@ -439,7 +443,7 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map)
 
 	memset(map, 0, 256);
 
-	if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+	if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport))
 		return -1;
 
 	mlid -= 0xc000;
@@ -453,8 +457,8 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map)
 	maxsets = (node->numports + 15) / 16;		/* round up */
 
 	for (set = 0; set < maxsets; set++) {
-		if (!smp_query(mdb, portid, IB_ATTR_MULTICASTFORWTBL,
-		    block | (set << 28), timeout))
+		if (!smp_query_via(mdb, portid, IB_ATTR_MULTICASTFORWTBL,
+		    block | (set << 28), timeout, srcport))
 			return -1;
 
 		for (i = 0; i < 16; i++, map++) {
@@ -746,13 +750,18 @@ int main(int argc, char **argv)
 	if (ibd_timeout)
 		timeout = ibd_timeout;
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+
 	node_name_map = open_node_name_map(node_name_map_file);
 
-	if (ib_resolve_portid_str(&src_portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&src_portid, argv[0], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve source port %s", argv[0]);
 
-	if (ib_resolve_portid_str(&dest_portid, argv[1], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&dest_portid, argv[1], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[1]);
 
 	if (ibd_dest_type == IB_DEST_DRPATH) {
@@ -796,5 +805,8 @@ int main(int argc, char **argv)
 	dump_mcpath(endnode, dumplevel);
 
 	close_node_name_map(node_name_map);
+
+	mad_rpc_close_port(srcport);
+
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:05:56 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:05:56 -0800
Subject: [ofa-general] [PATCH 8/10] infiniband-diags: Convert ibsysstat to
 "new" ibmad interface
Message-ID: <20090219190556.a831f6d3.weiny2@llnl.gov>

>From 1c19e419e04a98bcfe10b1c597856f43ea36668a Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 18:14:49 -0800
Subject: [PATCH] infiniband-diags: Convert ibsysstat to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibsysstat.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
index cc1418d..d7daa37 100644
--- a/infiniband-diags/src/ibsysstat.c
+++ b/infiniband-diags/src/ibsysstat.c
@@ -48,6 +48,8 @@
 
 #define MAX_CPUS 8
 
+struct ibmad_port *srcport;
+
 enum ib_sysstat_attr_t {
 	IB_PING_ATTR = 0x10,
 	IB_HOSTINFO_ATTR = 0x11,
@@ -101,7 +103,7 @@ static int server_respond(void *umad, int size)
 	if (ibdebug > 1)
 		xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE);
 
-	if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad,
+	if (umad_send(mad_rpc_portid(srcport), mad_class_agent(rpc.mgtclass), umad,
 		      size, rpc.timeout, 0) < 0) {
 		DEBUG("send failed; %m");
 		return -1;
@@ -169,7 +171,7 @@ static char *ibsystat_serv(void)
 
 	DEBUG("starting to serve...");
 
-	while ((umad = mad_receive(buf, -1))) {
+	while ((umad = mad_receive_via(buf, -1, srcport))) {
 		if (umad_status(buf)) {
 			DEBUG("drop mad with status %x: %s", umad_status(buf),
 			      strerror(umad_status(buf)));
@@ -230,7 +232,7 @@ static char *ibsystat(ib_portid_t *portid, int attr)
 	if ((len = mad_build_pkt(buf, &rpc, portid, NULL, NULL)) < 0)
 		IBPANIC("cannot build packet.");
 
-	fd = madrpc_portid();
+	fd = mad_rpc_portid(srcport);
 	agent = mad_class_agent(rpc.mgtclass);
 	timeout = ibd_timeout ? ibd_timeout : MAD_DEF_TIMEOUT_MS;
 
@@ -334,10 +336,12 @@ int main(int argc, char **argv)
 	if (argc > 1 && (attr = match_attr(argv[1])) < 0)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (server) {
-		if (mad_register_server(sysstat_class, 1, 0, oui) < 0)
+		if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0)
 			IBERROR("can't serve class %d", sysstat_class);
 
 		host_ncpu = build_cpuinfo();
@@ -347,14 +351,16 @@ int main(int argc, char **argv)
 		exit(0);
 	}
 
-	if (mad_register_client(sysstat_class, 1) < 0)
+	if (mad_register_client_via(sysstat_class, 1, srcport) < 0)
 		IBERROR("can't register to sysstat class %d", sysstat_class);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+			ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	if ((err = ibsystat(&portid, attr)))
 		IBERROR("ibsystat to %s: %s", portid2str(&portid), err);
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:06:02 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:06:02 -0800
Subject: [ofa-general] [PATCH 9/10] infiniband-diags: Convert mcm_rereg_test
 to "new" ibmad interface
Message-ID: <20090219190602.2522876e.weiny2@llnl.gov>

>From 4dcd4839baaa7f3bc31d01d5e695fced36b53533 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 18:24:56 -0800
Subject: [PATCH] infiniband-diags: Convert mcm_rereg_test to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/mcm_rereg_test.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/mcm_rereg_test.c b/infiniband-diags/src/mcm_rereg_test.c
index 9285b95..b9d18a4 100644
--- a/infiniband-diags/src/mcm_rereg_test.c
+++ b/infiniband-diags/src/mcm_rereg_test.c
@@ -74,6 +74,8 @@ static ibmad_gid_t mgid_ipoib = {
 	0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff
 };
 
+struct ibmad_port *srcport;
+
 uint64_t build_mcm_rec(uint8_t *data, ibmad_gid_t mgid, ibmad_gid_t port_gid)
 {
 	memset(data, 0, IB_SA_DATA_SIZE);
@@ -436,10 +438,13 @@ int main(int argc, char **argv)
 	if (argc > 1)
 		guid_file = argv[1];
 
-	madrpc_init(NULL, 0, mgmt_classes, 2);
+	srcport = mad_rpc_open_port(NULL, 0, mgmt_classes, 2);
+	if (!srcport)
+		err("Failed to open port");
+
 
 #if 1
-	ib_resolve_smlid(&dport_id, TMO);
+	ib_resolve_smlid_via(&dport_id, TMO, srcport);
 #else
 	memset(&dport_id, 0, sizeof(dport_id));
 	dport_id.lid = 1;
@@ -457,7 +462,7 @@ int main(int argc, char **argv)
 	}
 
 #if 1
-	port = madrpc_portid();
+	port = mad_rpc_portid(srcport);
 #else
 	ret = umad_init();
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Feb 19 19:06:08 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 19 Feb 2009 19:06:08 -0800
Subject: [ofa-general] [PATCH 10/10] infiniband-diags: Convert perfquery,
 saquery, sminfo, smpquery, and vendstat to "new" ibmad interface
Message-ID: <20090219190608.f8fd4a02.weiny2@llnl.gov>

>From e809dfacb08e6c2237ad2d0f197d1227654dde87 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 18:53:10 -0800
Subject: [PATCH] infiniband-diags: Convert perfquery, saquery, sminfo, smpquery, and vendstat to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/perfquery.c |   35 +++++++++++++++++--------
 infiniband-diags/src/saquery.c   |    9 ++++--
 infiniband-diags/src/sminfo.c    |   18 +++++++++---
 infiniband-diags/src/smpquery.c  |   53 +++++++++++++++++++++++--------------
 infiniband-diags/src/vendstat.c  |   19 ++++++++-----
 5 files changed, 88 insertions(+), 46 deletions(-)

diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
index 6292743..2f104b8 100644
--- a/infiniband-diags/src/perfquery.c
+++ b/infiniband-diags/src/perfquery.c
@@ -47,6 +47,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 struct perf_count {
 	uint32_t portselect;
 	uint32_t counterselect;
@@ -269,7 +271,7 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
 	char buf[1024];
 
 	if (extended != 1) {
-		if (!port_performance_query(pc, portid, port, timeout))
+		if (!port_performance_query_via(pc, portid, port, timeout, srcport))
 			IBERROR("perfquery");
 		if (!(cap_mask & 0x1000)) {
 			/* if PortCounters:PortXmitWait not suppported clear this counter */
@@ -284,7 +286,7 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
 		if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */
 			IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask);
 
-		if (!port_performance_ext_query(pc, portid, port, timeout))
+		if (!port_performance_ext_query_via(pc, portid, port, timeout, srcport))
 			IBERROR("perfextquery");
 		if (aggregate)
 			aggregate_perfcounters_ext();
@@ -299,10 +301,12 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask,
 static void reset_counters(int extended, int timeout, int mask, ib_portid_t *portid, int port)
 {
 	if (extended != 1) {
-		if (!port_performance_reset(pc, portid, port, mask, timeout))
+		if (!port_performance_reset_via(pc, portid, port, mask,
+				timeout, srcport))
 			IBERROR("perf reset");
 	} else {
-		if (!port_performance_ext_reset(pc, portid, port, mask, timeout))
+		if (!port_performance_ext_reset_via(pc, portid, port, mask,
+				timeout, srcport))
 			IBERROR("perf ext reset");
 	}
 }
@@ -382,18 +386,22 @@ int main(int argc, char **argv)
 	if (argc > 2)
 		mask = strtoul(argv[2], 0, 0);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (argc) {
-		if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[0]);
 	} else {
-		if (ib_resolve_self(&portid, &port, 0) < 0)
+		if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0)
 			IBERROR("can't resolve self port %s", argv[0]);
 	}
 
 	/* PerfMgt ClassPortInfo is a required attribute */
-	if (!perf_classportinfo_query(pc, &portid, port, ibd_timeout))
+	if (!perf_classportinfo_query_via(pc, &portid, port,
+			ibd_timeout, srcport))
 		IBERROR("classportinfo query");
 	/* ClassPortInfo should be supported as part of libibmad */
 	memcpy(&cap_mask, pc + 2, sizeof(cap_mask));	/* CapabilityMask */
@@ -406,7 +414,8 @@ int main(int argc, char **argv)
 	}
 
 	if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) {
-		if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0)
+		if (smp_query_via(data, &portid, IB_ATTR_NODE_INFO, 0, 0,
+				srcport) < 0)
 			IBERROR("smp query nodeinfo failed");
 		node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
 		mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports);
@@ -414,7 +423,8 @@ int main(int argc, char **argv)
 			IBERROR("smp query nodeinfo: num ports invalid");
 
 		if (node_type == IB_NODE_SWITCH) {
-			if (smp_query(data, &portid, IB_ATTR_SWITCH_INFO, 0, 0) < 0)
+			if (smp_query_via(data, &portid, IB_ATTR_SWITCH_INFO,
+					0, 0, srcport) < 0)
 				IBERROR("smp query nodeinfo failed");
 			enhancedport0 = mad_get_field(data, 0, IB_SW_ENHANCED_PORT0_F);
 			if (enhancedport0)
@@ -441,8 +451,10 @@ int main(int argc, char **argv)
 	else
 		dump_perfcounters(extended, ibd_timeout, cap_mask, &portid, port, 0);
 
-	if (!reset)
+	if (!reset) {
+		mad_rpc_close_port(srcport);
 		exit(0);
+	}
 
 do_reset:
 
@@ -456,5 +468,6 @@ do_reset:
 	else
 		reset_counters(extended, ibd_timeout, mask, &portid, port);
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 9726d22..e6cbe50 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1316,12 +1316,15 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h,
 
 static bind_handle_t get_bind_handle(void)
 {
+	static struct ibmad_port *srcport;
 	static struct bind_handle handle;
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
-	ib_resolve_smlid(&handle.dport, ibd_timeout);
+	ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport);
 	if (!handle.dport.lid)
 		IBPANIC("No SM found.");
 
@@ -1329,7 +1332,7 @@ static bind_handle_t get_bind_handle(void)
 	if (!handle.dport.qkey)
 		handle.dport.qkey = IB_DEFAULT_QP1_QKEY;
 
-	handle.fd = madrpc_portid();
+	handle.fd = mad_rpc_portid(srcport);
 	handle.agent = umad_register(handle.fd, IB_SA_CLASS, 2, 1, NULL);
 
 	return &handle;
diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
index 549cb81..ebf6a47 100644
--- a/infiniband-diags/src/sminfo.c
+++ b/infiniband-diags/src/sminfo.c
@@ -48,6 +48,8 @@
 
 static uint8_t sminfo[1024];
 
+struct ibmad_port *srcport;
+
 int strdata, xdata=1, bindata;
 enum {
 	SMINFO_NOTACT,
@@ -113,13 +115,16 @@ int main(int argc, char **argv)
 	if (argc > 1)
 		mod = atoi(argv[1]);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (argc) {
-		if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, 0) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				0, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[0]);
 	} else {
-		if (ib_resolve_smlid(&portid, ibd_timeout) < 0)
+		if (ib_resolve_smlid_via(&portid, ibd_timeout, srcport) < 0)
 			IBERROR("can't resolve sm port %s", argv[0]);
 	}
 
@@ -130,10 +135,12 @@ int main(int argc, char **argv)
 	mad_encode_field(sminfo, IB_SMINFO_STATE_F, &state);
 
 	if (mod) {
-		if (!(p = smp_set(sminfo, &portid, IB_ATTR_SMINFO, mod, ibd_timeout)))
+		if (!(p = smp_set_via(sminfo, &portid, IB_ATTR_SMINFO, mod,
+				ibd_timeout, srcport)))
 			IBERROR("query");
 	} else
-		if (!(p = smp_query(sminfo, &portid, IB_ATTR_SMINFO, 0, ibd_timeout)))
+		if (!(p = smp_query_via(sminfo, &portid, IB_ATTR_SMINFO, 0,
+				ibd_timeout, srcport)))
 			IBERROR("query");
 
 	mad_decode_field(sminfo, IB_SMINFO_GUID_F, &guid);
@@ -145,5 +152,6 @@ int main(int argc, char **argv)
 	printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %u priority %d state %d %s\n",
 		portid.lid, guid, act, prio, state, STATESTR(state));
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
index bf1626d..2ed1e65 100644
--- a/infiniband-diags/src/smpquery.c
+++ b/infiniband-diags/src/smpquery.c
@@ -51,6 +51,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc);
 
 typedef struct match_rec {
@@ -88,13 +90,13 @@ node_desc(ib_portid_t *dest, char **argv, int argc)
 	char      dots[128];
 	char     *nodename = NULL;
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return "node info query failed";
 
 	mad_decode_field(data, IB_NODE_TYPE_F, &node_type);
 	mad_decode_field(data, IB_NODE_GUID_F, &node_guid);
 
-	if (!smp_query(nd, dest, IB_ATTR_NODE_DESC, 0, 0))
+	if (!smp_query_via(nd, dest, IB_ATTR_NODE_DESC, 0, 0, srcport))
 		return "node desc query failed";
 
 	nodename = remap_node_name(node_name_map, node_guid, nd);
@@ -119,7 +121,7 @@ node_info(ib_portid_t *dest, char **argv, int argc)
 	char buf[2048];
 	char data[IB_SMP_DATA_SIZE];
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return "node info query failed";
 
 	mad_dump_nodeinfo(buf, sizeof buf, data, sizeof data);
@@ -138,7 +140,7 @@ port_info(ib_portid_t *dest, char **argv, int argc)
 	if (argc > 0)
 		portnum = strtol(argv[0], 0, 0);
 
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return "port info query failed";
 
 	mad_dump_portinfo(buf, sizeof buf, data, sizeof data);
@@ -153,7 +155,7 @@ switch_info(ib_portid_t *dest, char **argv, int argc)
 	char buf[2048];
 	char data[IB_SMP_DATA_SIZE];
 
-	if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0, 0, srcport))
 		return "switch info query failed";
 
 	mad_dump_switchinfo(buf, sizeof buf, data, sizeof data);
@@ -176,7 +178,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc)
 		portnum = strtol(argv[0], 0, 0);
 
 	/* Get the partition capacity */
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return "node info query failed";
 
 	mad_decode_field(data, IB_NODE_TYPE_F, &t);
@@ -185,7 +187,8 @@ pkey_table(ib_portid_t *dest, char **argv, int argc)
 		return "invalid port number";
 
 	if ((t == IB_NODE_SWITCH) && (portnum != 0)) {
-		if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0))
+		if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0, 0,
+				srcport))
 			return "switch info failed";
 		mad_decode_field(data, IB_SW_PARTITION_ENFORCE_CAP_F, &n);
 	} else
@@ -193,7 +196,8 @@ pkey_table(ib_portid_t *dest, char **argv, int argc)
 
 	for (i = 0; i < (n + 31) / 32; i++) {
 		mod =  i | (portnum << 16);
-		if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0))
+		if (!smp_query_via(data, dest, IB_ATTR_PKEY_TBL, mod, 0,
+				srcport))
 			return "pkey table query failed";
 		if (i + 1 == (n + 31) / 32)
 			k = ((n + 7 - i * 32) / 8) * 8;
@@ -220,7 +224,7 @@ static char *sl2vl_dump_table_entry(ib_portid_t *dest, int in, int out)
 	char data[IB_SMP_DATA_SIZE];
 	int portnum = (in << 8) | out;
 
-	if (!smp_query(data, dest, IB_ATTR_SLVL_TABLE, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_SLVL_TABLE, portnum, 0, srcport))
 		return "slvl query failed";
 
 	mad_dump_sltovl(buf, sizeof buf, data, sizeof data);
@@ -240,7 +244,7 @@ sl2vl_table(ib_portid_t *dest, char **argv, int argc)
 	if (argc > 0)
 		portnum = strtol(argv[0], 0, 0);
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return "node info query failed";
 
 	mad_decode_field(data, IB_NODE_TYPE_F, &type);
@@ -270,8 +274,8 @@ static char *vlarb_dump_table_entry(ib_portid_t *dest, int portnum, int offset,
 	char buf[2048];
 	char data[IB_SMP_DATA_SIZE];
 
-	if (!smp_query(data, dest, IB_ATTR_VL_ARBITRATION,
-			(offset << 16) | portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_VL_ARBITRATION,
+			(offset << 16) | portnum, 0, srcport))
 		return "vl arb query failed";
 	mad_dump_vlarbitration(buf, sizeof(buf), data, cap * 2);
 	printf("%s", buf);
@@ -305,12 +309,14 @@ vlarb_table(ib_portid_t *dest, char **argv, int argc)
 
 	/* port number of 0 could mean SP0 or port MAD arrives on */
 	if (portnum == 0) {
-		if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+		if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0,
+				srcport))
 			return "node info query failed";
 
 		mad_decode_field(data, IB_NODE_TYPE_F, &type);
 		if (type == IB_NODE_SWITCH) {
-			if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0))
+			if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0,
+					0, srcport))
 				return "switch info query failed";
 			mad_decode_field(data, IB_SW_ENHANCED_PORT0_F, &enhsp0);
 			if (!enhsp0) {
@@ -321,7 +327,7 @@ vlarb_table(ib_portid_t *dest, char **argv, int argc)
 		}
 	}
 
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return "port info query failed";
 
 	mad_decode_field(data, IB_PORT_VL_ARBITRATION_LOW_CAP_F, &lowcap);
@@ -349,13 +355,14 @@ guid_info(ib_portid_t *dest, char **argv, int argc)
 	int n;
 
 	/* Get the guid capacity */
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, 0, 0, srcport))
 		return "port info failed";
 	mad_decode_field(data, IB_PORT_GUID_CAP_F, &n);
 
 	for (i = 0; i < (n + 7) / 8; i++) {
 		mod =  i;
-		if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0))
+		if (!smp_query_via(data, dest, IB_ATTR_GUID_INFO, mod, 0,
+				srcport))
 			return "guid info query failed";
 		if (i + 1 == (n + 7) / 8)
 			k = ((n + 1 - i * 8) / 2) * 2;
@@ -445,11 +452,15 @@ int main(int argc, char **argv)
 	if (!(fn = match_op(argv[0])))
 		IBERROR("operation '%s' not supported", argv[0]);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+
 	node_name_map = open_node_name_map(node_name_map_file);
 
 	if (ibd_dest_type != IB_DEST_DRSLID) {
-		if (ib_resolve_portid_str(&portid, argv[1], ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[1], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[1]);
 		if ((err = fn(&portid, argv+2, argc-2)))
 			IBERROR("operation %s: %s", argv[0], err);
@@ -458,11 +469,13 @@ int main(int argc, char **argv)
 
 		memset(concat, 0, 64);
 		snprintf(concat, sizeof(concat), "%s %s", argv[1], argv[2]);
-		if (ib_resolve_portid_str(&portid, concat, ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, concat, ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", concat);
 		if ((err = fn(&portid, argv+3, argc-3)))
 			IBERROR("operation %s: %s", argv[0], err);
 	}
 	close_node_name_map(node_name_map);
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c
index db87e38..d001a01 100644
--- a/infiniband-diags/src/vendstat.c
+++ b/infiniband-diags/src/vendstat.c
@@ -55,6 +55,8 @@
 /* Config space addresses */
 #define IB_MLX_IS3_PORT_XMIT_WAIT	0x10013C
 
+struct ibmad_port *srcport;
+
 typedef struct {
 	uint16_t hw_revision;
 	uint16_t device_id;
@@ -152,13 +154,16 @@ int main(int argc, char **argv)
 	if (argc > 1)
 		port = strtoul(argv[1], 0, 0);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
 	if (argc) {
-		if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 			IBERROR("can't resolve destination port %s", argv[0]);
 	} else {
-		if (ib_resolve_self(&portid, &port, 0) < 0)
+		if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0)
 			IBERROR("can't resolve self port %s", argv[0]);
 	}
 
@@ -180,12 +185,12 @@ int main(int argc, char **argv)
 	memset(&buf, 0, sizeof(buf));
 	/* vendor ClassPortInfo is required attribute if class supported */
 	call.attrid = CLASS_PORT_INFO;
-	if (!ib_vendor_call(&buf, &portid, &call))
+	if (!ib_vendor_call_via(&buf, &portid, &call, srcport))
 		IBERROR("classportinfo query");
 
 	memset(&buf, 0, sizeof(buf));
 	call.attrid = IB_MLX_IS3_GENERAL_INFO;
-	if (!ib_vendor_call(&buf, &portid, &call))
+	if (!ib_vendor_call_via(&buf, &portid, &call, srcport))
 		IBERROR("vendstat");
 	gi = (is3_general_info_t *)&buf;
 
@@ -217,7 +222,7 @@ int main(int argc, char **argv)
 		cs = (is3_config_space_t *)&buf;
 		for (i = 0; i < 16; i++)
 			cs->record[i].address = htonl(IB_MLX_IS3_PORT_XMIT_WAIT + ((i + 1) << 12));
-		if (!ib_vendor_call(&buf, &portid, &call))
+		if (!ib_vendor_call_via(&buf, &portid, &call, srcport))
 			IBERROR("vendstat");
 
 		for (i = 0; i < 16; i++)
@@ -232,7 +237,7 @@ int main(int argc, char **argv)
 		cs = (is3_config_space_t *)&buf;
 		for (i = 0; i < 8; i++)
 			cs->record[i].address = htonl(IB_MLX_IS3_PORT_XMIT_WAIT + ((i + 17) << 12));
-		if (!ib_vendor_call(&buf, &portid, &call))
+		if (!ib_vendor_call_via(&buf, &portid, &call, srcport))
 			IBERROR("vendstat");
 
 		for (i = 0; i < 8; i++)
-- 
1.5.4.5


From Jie.Cai at cs.anu.edu.au  Thu Feb 19 19:39:10 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 20 Feb 2009 14:39:10 +1100
Subject: [ofa-general] RDMA write with immediate data.
In-Reply-To: <E3280858FA94444CA49D2BA02341C9833A7C6173@orsmsx506.amr.corp.intel.com>
References: <499CBEF2.2010909@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A7C6173@orsmsx506.amr.corp.intel.com>
Message-ID: <499E25DE.5020703@cs.anu.edu.au>


Davis, Arlin R wrote:
>  
>   
>> if (initiator) {
>>     ret = dat_ib_post_rdma_write_immed(   h_ep,        // 
>>
>> However, at remote side I got the following error message 
>> indicates that 
>> no event coming through.
>>
>> 5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED
>> 5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED
>>
>> The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED.
>>
>>     
>
> Does the initiator side complete successfully?
>   
yes, the initiator complete successfully.

> Do you have receive's posted at the remote side for immed data?
>   
Nope, the remote side didn't got an event, (dat_evd_wait timed out).
The way to find out the immed data is to check the out going
parameter &event of dat_evd_wait function.

&event.event_extension_data[0]->val.immed.data has not got a value yet.
> You can look at dtestx source for an immed data example.
>   
Yes, I do checked this test program. The way to use 
dat_ib_post_rdma_write_immed
is as same as dtestx.

Thanks,
Jie

> -arlin
>
>
>
>   


From rdreier at cisco.com  Thu Feb 19 22:50:21 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 19 Feb 2009 22:50:21 -0800
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <200902171742.38223.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 17 Feb 2009 17:42:38 +0200")
References: <200902171742.38223.jackm@dev.mellanox.co.il>
Message-ID: <ada3ae9zjo2.fsf@cisco.com>

 > We have found a race condition in sysfs.c which occurs when unloading low-level modules
 > (e.g., mlx4_ib) in the driver.
 > Specifically:
 > 
 > Although the kernel takes reference counts on sysfs files, it does not take such counts
 > on modules which implement attribute reads.
 > 
 > For example, we have:
 > static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
 > 			      char *buf)
 > {
 > 	struct port_table_attribute *tab_attr =
 > 		container_of(attr, struct port_table_attribute, attr);
 > 	u16 pkey;
 > 	ssize_t ret;
 > ====>race condition HERE *****
 > 	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
 > 	if (ret)
 > 		return ret;
 > 
 > 	return sprintf(buf, "0x%04x\n", pkey);
 > }

I've not been able to reproduce this on a current kernel.  I tried
adding the patch below to make the race window very big:

--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -273,6 +273,9 @@ static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
 	u16 pkey;
 	ssize_t ret;
 
+	printk(KERN_ERR "enter show_port_pkey\n");
+	msleep(10000);
+	printk(KERN_ERR "call ib_query_pkey\n");
 	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
 	if (ret)
 		return ret;

so show_port_pkey() waits 10 seconds before actually calling
ib_query_pkey().  Then I do something like

    cat /sys/class/infiniband/mlx4_0/ports/1/pkeys/0

in one shell and immediately (during the 10 second window before
ib_query_pkey() is called):

    modprobe -r mlx4_ib

in another shell.  And I see that the mlx4_ib module is not removed
until the read of the pkey file completes; this is as I would expect,
since the sysfs delete of the pkey file should wait until there are no
open fds for that file.

What test are you using to hit this race?  Are you using a distro kernel
with OFED?

 - R.


From Zhen.Liang at Sun.COM  Fri Feb 20 00:21:58 2009
From: Zhen.Liang at Sun.COM (Liang Zhen)
Date: Fri, 20 Feb 2009 16:21:58 +0800
Subject: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash
In-Reply-To: <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
	<7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com>
Message-ID: <499E6826.704@sun.com>

Hmm, I didn't see any problem in your code. Have you installed 
ofa_kernel_devel (kernel headers of  OFED) after installation of 
ofa_kernel_1_3_1?

Regards
Liang

neutron:
> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version:
> 2.6.18-53.1.14.el5,  ofed 1.3.1.
>
> The failed function call is like:
>
> {
>
> ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE,
>                 &dma_addr, GFP_KERNEL);
>
> ctx->phy_buf[0].addr = dma_addr;
> ctx->phy_buf[0].size = MAX_SIZE;
> ctx->iovstart = (u64) ctx->send_buf;
>
> printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n",
>        ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart );
>
> send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1,
>                         IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ
>                          | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart));
> }
>
> The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf".
>
> Below is /var/log/messages output around the crash.
> ----------------
> Feb 19 12:50:22 wci30 kernel:  pd=ffff8101da3ddce0,
> phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000
>
> Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer
> dereference at 0000000000000000
>  RIP:
> Feb 19 12:50:22 wci30 kernel:  [<0000000000000000>] _stext+0x7ffff000/0x1000
> Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0
> Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP
> Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version
> Feb 19 12:50:22 wci30 kernel: CPU 0
> Feb 19 12:54:05 wci30 syslogd 1.4.1: restart.
> Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg started.
> Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5
> (brewbuilder at hs20-bc2-3.build.redha
> t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb
> 19 07:18:46 EST 2008
> Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet
>
> ====================
> It's strange that the kernel doesn't print out the function call stack
> before crashing.
>
> Any hints?  Thanks a lot!
>
> On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier <rdreier at cisco.com> wrote:
>   
>>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>>  > are valid.  But the system always crashes immediately after entering
>>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>>
>> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
>> you get an oops message?  If so that would be very important info for
>> debugging this.
>>
>> - R.
>>
>>     
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From vlad at lists.openfabrics.org  Fri Feb 20 03:15:07 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 20 Feb 2009 03:15:07 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090220-0200 daily build status
Message-ID: <20090220111507.69DF9E301F8@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Fri Feb 20 05:41:56 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 20 Feb 2009 08:41:56 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH 1/10] libibmad: Clean up "new"
	interface
In-Reply-To: <20090219190525.322681b8.weiny2@llnl.gov>
References: <20090219190525.322681b8.weiny2@llnl.gov>
Message-ID: <f0e08f230902200541x5869effbv64b2f782d5f9cdec@mail.gmail.com>

On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 18 Feb 2009 16:37:36 -0800
> Subject: [PATCH] libibmad: Clean up "new" interface
>
>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
>      which mirrors madrpc_portid(void)
>   Mark all "old" functions with __attribute__ ((deprecated))
>
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
>  libibmad/src/gs.c                 |   19 +++---
>  libibmad/src/libibmad.map         |    1 +
>  libibmad/src/resolve.c            |   10 ++-
>  libibmad/src/rpc.c                |   29 ++++----
>  libibmad/src/sa.c                 |    4 +-
>  libibmad/src/smp.c                |    4 +-
>  7 files changed, 118 insertions(+), 88 deletions(-)
>
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 1aaaa1b..80e38be 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
>  }
>
>  /* rpc.c */
> -MAD_EXPORT int madrpc_portid(void);
> -MAD_EXPORT int madrpc_set_retries(int retries);
> -MAD_EXPORT int madrpc_set_timeout(int timeout);
> -void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
> -void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
> -                 void *data);
> +MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated));
> +void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata)
> +               __attribute__ ((deprecated));
> +void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
> +               __attribute__ ((deprecated));
>  MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> -                           int num_classes);
> -void madrpc_save_mad(void *madbuf, int len);
> -MAD_EXPORT void madrpc_show_errors(int set);
> +                           int num_classes) __attribute__ ((deprecated));
> +void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated));

Should there be a mad_rpc_save_mad in the new interface ? It looks
like it would only need some additional parameters as part of
ibmad_port struct.

> -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> +/* New interface */

Nit: /* rpc.c: new interface */

-- Hal

> +MAD_EXPORT void madrpc_show_errors(int set);
> +MAD_EXPORT int madrpc_set_retries(int retries);
> +MAD_EXPORT int madrpc_set_timeout(int timeout);
> +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
>                        int num_classes);
> -void mad_rpc_close_port(void *ibmad_port);
> -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> -             void *payload, void *rcvdata);
> -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> -                  ib_rmpp_hdr_t * rmpp, void *data);
> +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
> +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> +                       void *payload, void *rcvdata);
> +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> +                       ib_rmpp_hdr_t * rmpp, void *data);
> +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
>
>  /* smp.c */
>  MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
> -                             unsigned mod, unsigned timeout);
> +                     unsigned mod, unsigned timeout) __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
> -                           unsigned mod, unsigned timeout);
> +                   unsigned mod, unsigned timeout) __attribute__ ((deprecated));
> +
> +/* smp.c new interface */
>  MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> -                      unsigned mod, unsigned timeout, const void *srcport);
> -uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> -                    unsigned timeout, const void *srcport);
> +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> +                    unsigned timeout, const struct ibmad_port *srcport);
>
>  /* sa.c */
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> -                unsigned timeout);
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> +                unsigned timeout) __attribute__ ((deprecated));
> +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id,
> +               void *buf) __attribute__ ((deprecated));
> +
> +/* sa.c new interface */
> +MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
>                     ib_sa_call_t * sa, unsigned timeout);
> -MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
> +       /* returns lid */
>
>  /* resolve.c */
> -MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
> +MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
> +                               __attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
> -                              ib_portid_t * sm_id, int timeout);
> +                              ib_portid_t * sm_id, int timeout)
> +                               __attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> -                                    enum MAD_DEST dest, ib_portid_t * sm_id);
> +                                    enum MAD_DEST dest, ib_portid_t * sm_id)
> +                               __attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
> -                              ibmad_gid_t * gid);
> -
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> -int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -                       ib_portid_t * sm_id, int timeout, const void *srcport);
> -int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> +                              ibmad_gid_t * gid)
> +                               __attribute__ ((deprecated));
> +
> +/* resolve.c new interface */
> +MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> +                       ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>                              enum MAD_DEST dest, ib_portid_t * sm_id,
> -                             const void *srcport);
> -int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -                       const void *srcport);
> +                             const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> +                       const struct ibmad_port *srcport);
>
>  /* gs.c */
>  MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
> -                                            int port, unsigned timeout);
> +                                            int port, unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest,
> -                                          int port, unsigned timeout);
> +                                          int port, unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest,
>                                           int port, unsigned mask,
> -                                          unsigned timeout);
> +                                          unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest,
> -                                              int port, unsigned timeout);
> +                                              int port, unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest,
>                                               int port, unsigned mask,
> -                                              unsigned timeout);
> +                                              unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
> -                                              int port, unsigned timeout);
> +                                              int port, unsigned timeout)
> +                                               __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
> -                                             int port, unsigned timeout);
> +                                             int port, unsigned timeout)
> +                                               __attribute__ ((deprecated));
>
> -uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> +/* gs.c new interface */
> +MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned timeout,
> -                                     const void *srcport);
> -uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -                                   unsigned timeout, const void *srcport);
> -uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> +                                     const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> +                                   unsigned timeout, const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>                                    unsigned mask, unsigned timeout,
> -                                   const void *srcport);
> -uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> +                                   const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport);
> -uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> +                                       const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned mask,
> -                                       unsigned timeout, const void *srcport);
> -uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> +                                       unsigned timeout,
> +                                       const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport);
> -uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> +                                       const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>                                       int port, unsigned timeout,
> -                                      const void *srcport);
> +                                      const struct ibmad_port *srcport);
>  /* dump.c */
>  MAD_EXPORT ib_mad_dump_fn
>     mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
> diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
> index d2c4574..e302caf 100644
> --- a/libibmad/src/gs.c
> +++ b/libibmad/src/gs.c
> @@ -47,7 +47,7 @@
>
>  static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
>                              unsigned timeout, unsigned id,
> -                             const void *srcport)
> +                             const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>        int lid = dest->lid;
> @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
>
>  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned timeout,
> -                                     const void *srcport)
> +                                     const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
>                             srcport);
> @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
>  }
>
>  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -                                   unsigned timeout, const void *srcport)
> +                                   unsigned timeout, const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_COUNTERS, srcport);
> @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                      int port, unsigned mask, unsigned timeout,
> -                                     unsigned id, const void *srcport)
> +                                     unsigned id, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>        int lid = dest->lid;
> @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>                                    unsigned mask, unsigned timeout,
> -                                   const void *srcport)
> +                                   const struct ibmad_port *srcport)
>  {
>        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>                                     IB_GSI_PORT_COUNTERS, srcport);
> @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport)
> +                                       const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned mask,
> -                                       unsigned timeout, const void *srcport)
> +                                       unsigned timeout,
> +                                       const struct ibmad_port *srcport)
>  {
>        return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>                                     IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned timeout,
> -                                       const void *srcport)
> +                                       const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_SAMPLES_CONTROL, srcport);
> @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
>
>  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>                                       int port, unsigned timeout,
> -                                      const void *srcport)
> +                                      const struct ibmad_port *srcport)
>  {
>        return pma_query_via(rcvbuf, dest, port, timeout,
>                             IB_GSI_PORT_SAMPLES_RESULT, srcport);
> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> index f944d86..94d7762 100644
> --- a/libibmad/src/libibmad.map
> +++ b/libibmad/src/libibmad.map
> @@ -69,6 +69,7 @@ IBMAD_1.3 {
>                mad_rpc_close_port;
>                mad_rpc;
>                mad_rpc_rmpp;
> +               mad_rpc_portid;
>                madrpc;
>                madrpc_def_timeout;
>                madrpc_init;
> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> index 553949d..3291f43 100644
> --- a/libibmad/src/resolve.c
> +++ b/libibmad/src/resolve.c
> @@ -45,7 +45,8 @@
>  #undef DEBUG
>  #define DEBUG  if (ibdebug)    IBWARN
>
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
> +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t self = { 0 };
>        uint8_t portinfo[64];
> @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
>  }
>
>  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -                       ib_portid_t * sm_id, int timeout, const void *srcport)
> +                       ib_portid_t * sm_id, int timeout,
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t sm_portid;
>        char buf[IB_SA_DATA_SIZE] = { 0 };
> @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>                              enum MAD_DEST dest_type, ib_portid_t * sm_id,
> -                             const void *srcport)
> +                             const struct ibmad_port *srcport)
>  {
>        uint64_t guid;
>        int lid;
> @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
>  }
>
>  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -                       const void *srcport)
> +                       const struct ibmad_port *srcport)
>  {
>        ib_portid_t self = { 0 };
>        uint8_t portinfo[64];
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index e811526..d47873b 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -100,6 +100,11 @@ int madrpc_portid(void)
>        return mad_portid;
>  }
>
> +int mad_rpc_portid(struct ibmad_port *srcport)
> +{
> +       return (srcport->port_id);
> +}
> +
>  static int
>  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>           int timeout)
> @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>        return -1;
>  }
>
> -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>              void *payload, void *rcvdata)
>  {
> -       const struct ibmad_port *p = port_id;
>        int status, len;
>        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>
> @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
>                return 0;
>
> -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -                             p->class_agents[rpc->mgtclass],
> +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +                             port->class_agents[rpc->mgtclass],
>                              len, rpc->timeout)) < 0) {
>                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>                return 0;
> @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        return rcvdata;
>  }
>
> -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>                   ib_rmpp_hdr_t * rmpp, void *data)
>  {
> -       const struct ibmad_port *p = port_id;
>        int status, len;
>        uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>
> @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>        if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
>                return 0;
>
> -       if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -                             p->class_agents[rpc->mgtclass],
> +       if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +                             port->class_agents[rpc->mgtclass],
>                              len, rpc->timeout)) < 0) {
>                IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>                return 0;
> @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
>        }
>  }
>
> -void *mad_rpc_open_port(char *dev_name, int dev_port,
> +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
>                        int *mgmt_classes, int num_classes)
>  {
>        struct ibmad_port *p;
> @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
>        return p;
>  }
>
> -void mad_rpc_close_port(void *port_id)
> +void mad_rpc_close_port(struct ibmad_port *port)
>  {
> -       struct ibmad_port *p = port_id;
> -
> -       umad_close_port(p->port_id);
> -       free(p);
> +       umad_close_port(port->port_id);
> +       free(port);
>  }
>
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> index 7403d4f..ddeb152 100644
> --- a/libibmad/src/sa.c
> +++ b/libibmad/src/sa.c
> @@ -44,7 +44,7 @@
>  #undef DEBUG
>  #define DEBUG  if (ibdebug)    IBWARN
>
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>                     ib_sa_call_t * sa, unsigned timeout)
>  {
>        ib_rpc_t rpc = { 0 };
> @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>                        IB_PR_COMPMASK_SGID |\
>                        IB_PR_COMPMASK_NUMBPATH)
>
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>                      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
>  {
>        int npath;
> diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
> index fad263c..e5489b3 100644
> --- a/libibmad/src/smp.c
> +++ b/libibmad/src/smp.c
> @@ -45,7 +45,7 @@
>  #define DEBUG  if (ibdebug)    IBWARN
>
>  uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
> -                    unsigned mod, unsigned timeout, const void *srcport)
> +                    unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>
> @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
>  }
>
>  uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
> -                      unsigned mod, unsigned timeout, const void *srcport)
> +                      unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>        ib_rpc_t rpc = { 0 };
>
> --
> 1.5.4.5
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Fri Feb 20 05:42:31 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 20 Feb 2009 08:42:31 -0500
Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate 
	to "new" ibmad interface
In-Reply-To: <20090219190536.f96edca7.weiny2@llnl.gov>
References: <20090219190536.f96edca7.weiny2@llnl.gov>
Message-ID: <f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>

On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 19 Feb 2009 17:27:21 -0800
> Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface
>
>
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/src/ibportstate.c |   18 ++++++++++++------
>  1 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
> index c0b9b34..ca72bda 100644
> --- a/infiniband-diags/src/ibportstate.c
> +++ b/infiniband-diags/src/ibportstate.c
> @@ -46,6 +46,8 @@
>
>  #include "ibdiag_common.h"
>
> +struct ibmad_port *srcport;
> +
>  /*******************************************/
>
>  static int
> @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data)
>  {
>        int node_type;
>
> -       if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
> +       if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
>                return -1;
>
>        node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
> @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
>        char buf[2048];
>        char val[64];
>
> -       if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
> +       if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
>                return -1;
>
>        if (port_op != 4) {
> @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
>        char buf[2048];
>        char val[64];
>
> -       if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
> +       if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
>                return -1;
>
>        if (port_op != 4)
> @@ -223,9 +225,12 @@ int main(int argc, char **argv)
>        if (argc < 2)
>                ibdiag_show_usage();
>
> -       madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> +       srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> +       if (!srcport)
> +               IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);

Is this missing the corresponding close_port ? Same for some of the
subsequent patches.

-- Hal

> -       if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
> +       if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> +                               ibd_sm_id, srcport) < 0)
>                IBERROR("can't resolve destination port %s", argv[0]);
>
>        /* First, make sure it is a switch port if it is a "set" */
> @@ -314,7 +319,8 @@ int main(int argc, char **argv)
>                                        peerportid.drpath.p[1] = (uint8_t) portnum;
>
>                                        /* Set DrSLID to local lid */
> -                                       if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
> +                                       if (ib_resolve_self_via(&selfportid,
> +                                                       &selfport, 0, srcport) < 0)
>                                                IBERROR("could not resolve self");
>                                        peerportid.drpath.drslid = (uint16_t) selfportid.lid;
>                                        peerportid.drpath.drdlid = 0xffff;
> --
> 1.5.4.5
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Fri Feb 20 05:55:57 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 20 Feb 2009 08:55:57 -0500
Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags -- converting
	to "new" interface.
In-Reply-To: <20090219190520.c18280e1.weiny2@llnl.gov>
References: <20090219190520.c18280e1.weiny2@llnl.gov>
Message-ID: <f0e08f230902200555u6d4b6791s9540edc6dc25aed7@mail.gmail.com>

On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> Here is v2 of the patch series.
>
> I used __attribute__ ((deprecated)) on the functions which should aid others
> in realizing that these functions will go away.  (It sure helped me to convert
> all the diags.
>
> Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the
> new interface and I am hoping it will be accepted soon.

A related issue is whether ibnetdiscover will support both the new
library and the old way until the library is more proven via some
build option. If it is to support both, then converting it should be
done.

-- Hal

> The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat
> because they were all simple to convert and the patch series was getting
> ridiculous.
>
> Thanks,
> Ira
>
> --
> Ira Weiny
> Math Programer/Computer Scientist
> Larence Livermore National Lab
> weiny2 at llnl.gov
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From weiny2 at llnl.gov  Fri Feb 20 09:23:50 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 09:23:50 -0800
Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags --
	converting  to "new" interface.
In-Reply-To: <f0e08f230902200555u6d4b6791s9540edc6dc25aed7@mail.gmail.com>
References: <20090219190520.c18280e1.weiny2@llnl.gov>
	<f0e08f230902200555u6d4b6791s9540edc6dc25aed7@mail.gmail.com>
Message-ID: <20090220092350.7ee3ddab.weiny2@llnl.gov>

On Fri, 20 Feb 2009 08:55:57 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > Here is v2 of the patch series.
> >
> > I used __attribute__ ((deprecated)) on the functions which should aid others
> > in realizing that these functions will go away.  (It sure helped me to convert
> > all the diags.
> >
> > Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the
> > new interface and I am hoping it will be accepted soon.
> 
> A related issue is whether ibnetdiscover will support both the new
> library and the old way until the library is more proven via some
> build option. If it is to support both, then converting it should be
> done.

The conversion is easy.  I will do it for now to remove the build warnings.
And now that I think about it more leaving in the old and new code to be
chosen via configure is probably not a bad idea.  I don't know what is going
to happen once we standardize on the mad library for decoding strings.  There
are some incompatibilities there (ie 1x vs 1X and 2.5Gbps vs SDR etc.)

I will say, however, that I tested the library extensively and the first
version's output was identical to the old version with the sole exception of
the order ports were printed in.  :-D  So my confidence is high it will be
accepted sooner rather than later.

Ira

> 
> -- Hal
> 
> > The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat
> > because they were all simple to convert and the patch series was getting
> > ridiculous.
> >
> > Thanks,
> > Ira
> >
> > --
> > Ira Weiny
> > Math Programer/Computer Scientist
> > Larence Livermore National Lab
> > weiny2 at llnl.gov
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> >
> 


-- 
Ira Weiny
Math Programer/Computer Scientist
Larence Livermore National Lab
weiny2 at llnl.gov


From hnrose at comcast.net  Fri Feb 20 09:37:11 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 20 Feb 2009 12:37:11 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim.h: Better portinfo
	alignment in Port struct
Message-ID: <20090220173711.GA3024@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 ibsim/sim.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ibsim/sim.h b/ibsim/sim.h
index ec76dac..48b4536 100644
--- a/ibsim/sim.h
+++ b/ibsim/sim.h
@@ -197,8 +197,8 @@ struct Port {
 	int physstate;
 	int lmc;
 	int hoqlife;
-	uint8_t op_vls;
 	uint8_t portinfo[64];
+	uint8_t op_vls;
 
 	char remotenodeid[NODEIDLEN];
 	char remotealias[ALIASLEN + 1];
-- 
1.5.6.4


From hal.rosenstock at gmail.com  Fri Feb 20 10:24:35 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 20 Feb 2009 13:24:35 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH 1/10] libibmad:
	Clean up "new" interface
In-Reply-To: <f0e08f230902200541x5869effbv64b2f782d5f9cdec@mail.gmail.com>
References: <20090219190525.322681b8.weiny2@llnl.gov>
	<f0e08f230902200541x5869effbv64b2f782d5f9cdec@mail.gmail.com>
Message-ID: <f0e08f230902201024t671ad122t2072c519b6d8f772@mail.gmail.com>

On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
>> From: Ira Weiny <weiny2 at llnl.gov>
>> Date: Wed, 18 Feb 2009 16:37:36 -0800
>> Subject: [PATCH] libibmad: Clean up "new" interface
>>
>>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
>>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
>>      which mirrors madrpc_portid(void)
>>   Mark all "old" functions with __attribute__ ((deprecated))
>>
>> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
>> ---
>>  libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
>>  libibmad/src/gs.c                 |   19 +++---
>>  libibmad/src/libibmad.map         |    1 +
>>  libibmad/src/resolve.c            |   10 ++-
>>  libibmad/src/rpc.c                |   29 ++++----
>>  libibmad/src/sa.c                 |    4 +-
>>  libibmad/src/smp.c                |    4 +-
>>  7 files changed, 118 insertions(+), 88 deletions(-)
>>
>> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
>> index 1aaaa1b..80e38be 100644
>> --- a/libibmad/include/infiniband/mad.h
>> +++ b/libibmad/include/infiniband/mad.h
>> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
>>  }
>>
>>  /* rpc.c */
>> -MAD_EXPORT int madrpc_portid(void);
>> -MAD_EXPORT int madrpc_set_retries(int retries);
>> -MAD_EXPORT int madrpc_set_timeout(int timeout);

retries and timeouts could also be made per ibmad_port struct basis
rather than one for all clients. Those two APIs would be deprecated in
favor of new ones (mad_rpc_set_retries/timeout).

-- Hal

<snip...>


From sean.hefty at intel.com  Fri Feb 20 10:27:33 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 20 Feb 2009 10:27:33 -0800
Subject: [ofa-general] ib-diag: use of getpass()
Message-ID: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com>

saquery calls getpass, and according to the man page:

'This function is obsolete.  Do not use it.'

Can we remove this call?  What is your preference for replacing it?  (Use scanf?
take the SM Key as a command line argument?)


From weiny2 at llnl.gov  Fri Feb 20 10:28:13 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 10:28:13 -0800
Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert
	ibportstate  to "new" ibmad interface
In-Reply-To: <f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>
References: <20090219190536.f96edca7.weiny2@llnl.gov>
	<f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>
Message-ID: <20090220102813.9b0bd107.weiny2@llnl.gov>

On Fri, 20 Feb 2009 08:42:31 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Thu, 19 Feb 2009 17:27:21 -0800
> > Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface
> >
> >
> > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> > ---
> >  infiniband-diags/src/ibportstate.c |   18 ++++++++++++------
> >  1 files changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
> > index c0b9b34..ca72bda 100644
> > --- a/infiniband-diags/src/ibportstate.c
> > +++ b/infiniband-diags/src/ibportstate.c
> > @@ -46,6 +46,8 @@
> >
> >  #include "ibdiag_common.h"
> >
> > +struct ibmad_port *srcport;
> > +
> >  /*******************************************/
> >
> >  static int
> > @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data)
> >  {
> >        int node_type;
> >
> > -       if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
> > +       if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
> >                return -1;
> >
> >        node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
> > @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
> >        char buf[2048];
> >        char val[64];
> >
> > -       if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
> > +       if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
> >                return -1;
> >
> >        if (port_op != 4) {
> > @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
> >        char buf[2048];
> >        char val[64];
> >
> > -       if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
> > +       if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
> >                return -1;
> >
> >        if (port_op != 4)
> > @@ -223,9 +225,12 @@ int main(int argc, char **argv)
> >        if (argc < 2)
> >                ibdiag_show_usage();
> >
> > -       madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> > +       srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> > +       if (!srcport)
> > +               IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> 
> Is this missing the corresponding close_port ? Same for some of the
> subsequent patches.

Yep I missed a couple of them.  4/10, 6/10, and 9/10.  New patches to follow.

Ira

> 
> -- Hal
> 
> > -       if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
> > +       if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> > +                               ibd_sm_id, srcport) < 0)
> >                IBERROR("can't resolve destination port %s", argv[0]);
> >
> >        /* First, make sure it is a switch port if it is a "set" */
> > @@ -314,7 +319,8 @@ int main(int argc, char **argv)
> >                                        peerportid.drpath.p[1] = (uint8_t) portnum;
> >
> >                                        /* Set DrSLID to local lid */
> > -                                       if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
> > +                                       if (ib_resolve_self_via(&selfportid,
> > +                                                       &selfport, 0, srcport) < 0)
> >                                                IBERROR("could not resolve self");
> >                                        peerportid.drpath.drslid = (uint16_t) selfportid.lid;
> >                                        peerportid.drpath.drdlid = 0xffff;
> > --
> > 1.5.4.5
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> >
> 


-- 
Ira Weiny
Math Programer/Computer Scientist
Larence Livermore National Lab
weiny2 at llnl.gov


From rdreier at cisco.com  Fri Feb 20 10:32:20 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Feb 2009 10:32:20 -0800
Subject: [ofa-general] ib-diag: use of getpass()
In-Reply-To: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com> (Sean
	Hefty's message of "Fri, 20 Feb 2009 10:27:33 -0800")
References: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com>
Message-ID: <adak57lx8ln.fsf@cisco.com>

 > saquery calls getpass, and according to the man page:
 > 
 > 'This function is obsolete.  Do not use it.'

I believe that information may not be totally accurate.  The modern
glibc implementation doesn't seem to have any problems.

 - R.


From rdreier at cisco.com  Fri Feb 20 10:39:25 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Feb 2009 10:39:25 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for
	active connections.
In-Reply-To: <20090217215959.16117.17150.stgit@NTAC> (Steve Wise's message of
	"Tue, 17 Feb 2009 16:00:00 -0600")
References: <20090217215959.16117.17150.stgit@NTAC>
Message-ID: <adafxi9x89u.fsf@cisco.com>

 > -	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
 > +	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));

minor but the parens around the function call are totally unnecessary.
If we're touching the line anyway may as well leave them off.

 > +static int iwch_post_qp_fatal(int id, void *p, void *data)
 > +{
 > +	struct ib_event event;
 > +	struct iwch_qp *qhp = p;
 > +
 > +	event.event = IB_EVENT_DEVICE_FATAL;
 > +	event.device = qhp->ibqp.device;
 > +	event.element.qp = &qhp->ibqp;
 > +	BUG_ON(qhp->rhp != data);
 > +	BUG_ON(qhp->wq.qpid != id);
 > +	if (qhp->ibqp.event_handler) {
 > +		PDBG("%s posting DEVICE_FATAL for qpid %u\n",
 > +			__func__, qhp->wq.qpid);
 > +		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);

This doesn't match the IB driver behavior (or the IB spec) -- the
DEVICE_FATAL event is unaffiliated and delivered for the adapter as a
whole.  QP events are supposed to be for events connected to a single
QP, not the whole adapter failing.

BTW I don't think you need the * here, do you?  Would be easier to read
to just call it like

	qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context)

 > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e)
 > +{
 > +	int	error=0;
 > +	struct cxio_rdev *rdev;
 > +
 > +	rdev = (struct cxio_rdev *)tdev->ulp;
 > +	if (rdev->flags) {

Might be nice to wrap this rdev->flags test up in a trivial inline
function (eg iwch_eeh_set() or something like that) in case other things
get put into those flags later.

 > +		kfree_skb(skb);
 > +		return -EIO;
 > +	}
 > +	error = l2t_send(tdev, skb, l2e);
 > +	if (error)
 > +		kfree_skb(skb);
 > +	return error;
 > +}

The kfree_skb() calls here change behavior -- eg you have the change:

 > -	l2t_send(ep->com.tdev, skb, ep->l2t);
 > -	return 0;
 > +	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);

and now if l2t_send() returns an error the skb is freed, where before it
wasn't.

Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it
not make more sense for the cxgb3 l2t_send() to check the eeh state and
always behave appropriately?  Or is it more complicated than that?

 - R.


From hal.rosenstock at gmail.com  Fri Feb 20 10:42:59 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 20 Feb 2009 13:42:59 -0500
Subject: [ofa-general] ib-diag: use of getpass()
In-Reply-To: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com>
References: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com>
Message-ID: <f0e08f230902201042j3069a82at9a3185144b8f742c@mail.gmail.com>

On Fri, Feb 20, 2009 at 1:27 PM, Sean Hefty <sean.hefty at intel.com> wrote:
> saquery calls getpass, and according to the man page:
>
> 'This function is obsolete.  Do not use it.'
>
> Can we remove this call?  What is your preference for replacing it?  (Use scanf?
> take the SM Key as a command line argument?)

There was a thread on this back in June 2008:
http://lists.openfabrics.org/pipermail/general/2008-June/051057.html

Sasha wrote:
glibc info page doesn't indicate this. Also I did some
googling and looked at glibc code itself - found nothing suspicious yet.
Finally it is how password handled in 'su'.

-- Hal

> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sean.hefty at intel.com  Fri Feb 20 10:59:38 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 20 Feb 2009 10:59:38 -0800
Subject: [ofa-general] ib-diag: use of getpass()
In-Reply-To: <f0e08f230902201042j3069a82at9a3185144b8f742c@mail.gmail.com>
References: <CAA2EDA371D24A3A8C9C56C8066693A8@amr.corp.intel.com>
	<f0e08f230902201042j3069a82at9a3185144b8f742c@mail.gmail.com>
Message-ID: <6F2C9CB988674B1B9478AFBF3D1DE11B@amr.corp.intel.com>

>There was a thread on this back in June 2008:
>http://lists.openfabrics.org/pipermail/general/2008-June/051057.html
>
>Sasha wrote:
>glibc info page doesn't indicate this. Also I did some
>googling and looked at glibc code itself - found nothing suspicious yet.
>Finally it is how password handled in 'su'.

I'll add an implementation for it on windows then...


From hnrose at comcast.net  Fri Feb 20 12:33:36 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 20 Feb 2009 15:33:36 -0500
Subject: [ofa-general] [PATCH] ibsim: Handle sim_init_net errors better
Message-ID: <20090220203336.GA3874@comcast.net>


Use define rather than constant
Also, cosmetic formatting and fixed some typos

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 ibsim/ibsim.c   |    7 ++++---
 ibsim/sim_cmd.c |    6 ++++--
 ibsim/sim_mad.c |    8 ++++----
 ibsim/sim_net.c |    2 +-
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
index 7cea9de..bfc58f5 100644
--- a/ibsim/ibsim.c
+++ b/ibsim/ibsim.c
@@ -379,7 +379,6 @@ static int sim_ctl_get_pkeys(Client * cl, struct sim_ctl * ctl)
 	memcpy(ctl->data, port->pkey_tbl, size);
 	if (size < sizeof(ctl->data))
 		memset(ctl->data + size, 0, sizeof(ctl->data) - size);
-
 	return 0;
 }
 
@@ -730,6 +729,7 @@ int main(int argc, char **argv)
 	extern void free_core(void);
 	char *outfname = 0, *netfile;
 	FILE *infile, *outfile;
+	int status;
 
 	static char const str_opts[] = "rf:dpvIsN:S:P:L:M:l:Vhu";
 	static const struct option long_opts[] = {
@@ -818,8 +818,9 @@ int main(int argc, char **argv)
 		IBPANIC("not enough memory for core structure");
 
 	DEBUG("initializing net \"%s\"", netfile);
-	if (sim_init_net(netfile, outfile) < 0)
-		IBPANIC("sim_init failed");
+	status = sim_init_net(netfile, outfile);
+	if (status < 0)
+		IBPANIC("sim_init failed, status %d", status);
 
 	sim_init_console(outfile);
 
diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c
index c683224..94e0a14 100644
--- a/ibsim/sim_cmd.c
+++ b/ibsim/sim_cmd.c
@@ -203,7 +203,8 @@ static int do_relink(FILE * f, char *line)
 			return -1;
 		}
 
-		rport = node_get_port(lport->previous_remotenode, lport->previous_remoteport);
+		rport = node_get_port(lport->previous_remotenode,
+				      lport->previous_remoteport);
 
 		if (link_ports(lport, rport) < 0)
 			return -fprintf(f,
@@ -220,7 +221,8 @@ static int do_relink(FILE * f, char *line)
 		if (!lport->previous_remotenode)
 			continue; 
 
-		rport = node_get_port(lport->previous_remotenode, lport->previous_remoteport);
+		rport = node_get_port(lport->previous_remotenode,
+				      lport->previous_remoteport);
 
 		if (link_ports(lport, rport) < 0)
 			continue;
diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c
index 6e08031..2fbf96f 100644
--- a/ibsim/sim_mad.c
+++ b/ibsim/sim_mad.c
@@ -379,7 +379,7 @@ static int do_vlarb(Port * port, unsigned op, uint32_t mod, uint8_t * data)
 	if (op == IB_MAD_METHOD_SET) {
 		memcpy(vlarb, data, size);
 	} else {
-		memset(data, 0, 64);
+		memset(data, 0, IB_SMP_DATA_SIZE);
 		memcpy(data, vlarb, size);
 	}
 
@@ -395,7 +395,7 @@ static int do_guidinfo(Port * port, unsigned op, uint32_t mod, uint8_t * data)
 	if (op != IB_MAD_METHOD_GET)    // only get currently supported (non compliant)
 		status = ERR_METHOD_UNSUPPORTED;
 
-	memset(data, 0, 64);
+	memset(data, 0, IB_SMP_DATA_SIZE);
 	if (mod == 0) {
 		if (node->type == SWITCH_NODE)
 			mad_encode_field(data, IB_GUID_GUID0_F, &node->nodeguid);
@@ -613,7 +613,7 @@ static int pc_updated(Port ** srcport, Port * destport)
 	uint32_t madsize_div_4 = 72;	//real data divided by 4
 
 	if (*srcport != destport) {
-		//PKT get out of port ..
+		//PKT got out of port ..
 		srcpc->flow_xmt_pkts =
 		    addval(srcpc->flow_xmt_pkts, 1, GS_PERF_XMT_PKTS_LIMIT);
 		srcpc->flow_xmt_bytes =
@@ -629,7 +629,7 @@ static int pc_updated(Port ** srcport, Port * destport)
 			VERB("drop pkt due error rate %d", destport->errrate);
 			return 0;
 		}
-		//PKT get in to the port ..
+		//PKT got into the port ..
 		destpc->flow_rcv_pkts =
 		    addval(destpc->flow_rcv_pkts, 1, GS_PERF_RCV_PKTS_LIMIT);
 		destpc->flow_rcv_bytes =
diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index f0628ec..fa05c35 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -1116,7 +1116,7 @@ int link_ports(Port * lport, Port * rport)
 	rport->remoteport = lport->portnum;
 	set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport);
 	memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid));
-	lport->state = rport->state = 2;	// Initialilze
+	lport->state = rport->state = 2;	// Initialize
 	lport->physstate = rport->physstate = 5;	// LinkUP
 	if (lnode->sw)
 		lnode->sw->portchange = 1;
-- 
1.5.6.4


From neutronsharc at gmail.com  Fri Feb 20 12:44:12 2009
From: neutronsharc at gmail.com (neutron)
Date: Fri, 20 Feb 2009 15:44:12 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results
	in crash
In-Reply-To: <499E6826.704@sun.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
	<7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com>
	<499E6826.704@sun.com>
Message-ID: <7d5928b30902201244i24aff45ct2cabcb99e68ce469@mail.gmail.com>

When we installed the ofed, we use:  "<OFED_1.3.1_dir>/install.pl --all".
So we expect it should have installed everything.

"ofed_info" shows "ofa_kernel-1.3.1" is installed, but
"ofa_kernel_devel" is not.  What's that package for? where to get it?
It seems not located at " <OFED_1.3.1_dir>/SRPMS ".    Thanks.

Below is the output given by "ofed_info".
-----------------
OFED-1.3.1
libibverbs:
git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3
commit 40b771aa6a9c0ad092b2e20775b4723d3b173792
libmthca:
git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3
commit 9501e698d257949acfab2edc90812602966dbcc9
libmlx4:
git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3
commit 3869d6dab7e12fe452270ca641f7dd7082b42482
libehca:
git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3
commit fd898180cfa3b737f893f432a80b91bac3396325
libipathverbs:
git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3
commit 82be4d81859d1fd2edf830220fe65a9923b80a46
libcxgb3:
git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3
commit 6f7485feb244d8571fcab2292ef92c97bea48df0
libnes:
git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3
commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3
libibcm:
git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3
commit 53ec35f544bbc1838bbadc2210909c25a954a5e2
librdmacm:
git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3
commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1
dapl1:
git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3
commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779
dapl2:
git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3
commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177
libsdp:
git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3
commit c8102dccc502930442b23de658674d386456b350
sdpnetstat:
git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3
commit 3341620a7259c4f7bdd4180864b98e260c3dc223
srptools:
git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d
perftest:
git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3
commit 6321b5468f7293088cc003809049c02b176130d8
qlvnictools:
git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3
commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd
tvflash:
git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3
commit f5e7407a7f2058448df5e5320d9843f944427429
mstflint:
git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3
commit 78bbd3d521a9078553a991111ffb6f76665b9ee9

qperf:
git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3
commit 6221aabd038df0b7033e035378ca190641ed2295
management:
git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3
commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf
ibutils:
git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3
commit 7daf94fab6eaf307316326f3f49704e6080a1508
ibsim:
git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3
commit 55113d9f919709c7c97ea41d29991941b9c8be70

ofa_kernel-1.3.1:
Git:
git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c

# MPI
mvapich-1.0.1-2533.src.rpm
mvapich2-1.0.3-1.src.rpm
openmpi-1.2.6-1.src.rpm
mpitests-3.0-773.src.rpm


=-----------------

On Fri, Feb 20, 2009 at 3:21 AM, Liang Zhen <Zhen.Liang at sun.com> wrote:
> Hmm, I didn't see any problem in your code. Have you installed
> ofa_kernel_devel (kernel headers of  OFED) after installation of
> ofa_kernel_1_3_1?
>
> Regards
> Liang
>
> neutron:
>>
>> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version:
>> 2.6.18-53.1.14.el5,  ofed 1.3.1.
>>
>> The failed function call is like:
>>
>> {
>>
>> ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE,
>>                &dma_addr, GFP_KERNEL);
>>
>> ctx->phy_buf[0].addr = dma_addr;
>> ctx->phy_buf[0].size = MAX_SIZE;
>> ctx->iovstart = (u64) ctx->send_buf;
>>
>> printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n",
>>       ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart
>> );
>>
>> send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1,
>>                        IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ
>>                         | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart));
>> }
>>
>> The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf".
>>
>> Below is /var/log/messages output around the crash.
>> ----------------
>> Feb 19 12:50:22 wci30 kernel:  pd=ffff8101da3ddce0,
>> phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000
>>
>> Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer
>> dereference at 0000000000000000
>>  RIP:
>> Feb 19 12:50:22 wci30 kernel:  [<0000000000000000>]
>> _stext+0x7ffff000/0x1000
>> Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0
>> Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP
>> Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version
>> Feb 19 12:50:22 wci30 kernel: CPU 0
>> Feb 19 12:54:05 wci30 syslogd 1.4.1: restart.
>> Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5
>> (brewbuilder at hs20-bc2-3.build.redha
>> t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb
>> 19 07:18:46 EST 2008
>> Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet
>>
>> ====================
>> It's strange that the kernel doesn't print out the function call stack
>> before crashing.
>>
>> Any hints?  Thanks a lot!
>>
>> On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier <rdreier at cisco.com> wrote:
>>
>>>
>>>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>>>  > are valid.  But the system always crashes immediately after entering
>>>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>>>
>>> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
>>> you get an oops message?  If so that would be very important info for
>>> debugging this.
>>>
>>> - R.
>>>
>>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>


From divy at chelsio.com  Fri Feb 20 13:27:28 2009
From: divy at chelsio.com (Divy Le Ray)
Date: Fri, 20 Feb 2009 13:27:28 -0800
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for
	active connections.
In-Reply-To: <adafxi9x89u.fsf@cisco.com>
References: <20090217215959.16117.17150.stgit@NTAC> <adafxi9x89u.fsf@cisco.com>
Message-ID: <499F2040.50008@chelsio.com>

Roland Dreier wrote:
>  > -	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
>  > +	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
>
> minor but the parens around the function call are totally unnecessary.
> If we're touching the line anyway may as well leave them off.
>
>  > +static int iwch_post_qp_fatal(int id, void *p, void *data)
>  > +{
>  > +	struct ib_event event;
>  > +	struct iwch_qp *qhp = p;
>  > +
>  > +	event.event = IB_EVENT_DEVICE_FATAL;
>  > +	event.device = qhp->ibqp.device;
>  > +	event.element.qp = &qhp->ibqp;
>  > +	BUG_ON(qhp->rhp != data);
>  > +	BUG_ON(qhp->wq.qpid != id);
>  > +	if (qhp->ibqp.event_handler) {
>  > +		PDBG("%s posting DEVICE_FATAL for qpid %u\n",
>  > +			__func__, qhp->wq.qpid);
>  > +		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
>
> This doesn't match the IB driver behavior (or the IB spec) -- the
> DEVICE_FATAL event is unaffiliated and delivered for the adapter as a
> whole.  QP events are supposed to be for events connected to a single
> QP, not the whole adapter failing.
>
> BTW I don't think you need the * here, do you?  Would be easier to read
> to just call it like
>
> 	qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context)
>
>  > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e)
>  > +{
>  > +	int	error=0;
>  > +	struct cxio_rdev *rdev;
>  > +
>  > +	rdev = (struct cxio_rdev *)tdev->ulp;
>  > +	if (rdev->flags) {
>
> Might be nice to wrap this rdev->flags test up in a trivial inline
> function (eg iwch_eeh_set() or something like that) in case other things
> get put into those flags later.
>
>  > +		kfree_skb(skb);
>  > +		return -EIO;
>  > +	}
>  > +	error = l2t_send(tdev, skb, l2e);
>  > +	if (error)
>  > +		kfree_skb(skb);
>  > +	return error;
>  > +}
>
> The kfree_skb() calls here change behavior -- eg you have the change:
>
>  > -	l2t_send(ep->com.tdev, skb, ep->l2t);
>  > -	return 0;
>  > +	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
>
> and now if l2t_send() returns an error the skb is freed, where before it
> wasn't.
>
> Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it
> not make more sense for the cxgb3 l2t_send() to check the eeh state and
> always behave appropriately?  Or is it more complicated than that?
>   

Hi Roland,

l2t_send() is used on connection setup/teardown path for iWARP, but is 
the data path of the iSCSI offload module.

Cheers,
Divy


From weiny2 at llnl.gov  Fri Feb 20 13:51:44 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 13:51:44 -0800
Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert
	ibportstate  to "new" ibmad interface
In-Reply-To: <20090220102813.9b0bd107.weiny2@llnl.gov>
References: <20090219190536.f96edca7.weiny2@llnl.gov>
	<f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>
	<20090220102813.9b0bd107.weiny2@llnl.gov>
Message-ID: <20090220135144.9e3cc6db.weiny2@llnl.gov>

On Fri, 20 Feb 2009 10:28:13 -0800
Ira Weiny <weiny2 at llnl.gov> wrote:

> On Fri, 20 Feb 2009 08:42:31 -0500
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> 
> > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > > >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001
> > > From: Ira Weiny <weiny2 at llnl.gov>
> > > Date: Thu, 19 Feb 2009 17:27:21 -0800
> > > Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface
> > >
> > >
> > > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> > > ---
> > >  infiniband-diags/src/ibportstate.c |   18 ++++++++++++------

<snip>

> > >
> > > -       madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> > > +       srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
> > > +       if (!srcport)
> > > +               IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> > 
> > Is this missing the corresponding close_port ? Same for some of the
> > subsequent patches.
> 
> Yep I missed a couple of them.  4/10, 6/10, and 9/10.  New patches to follow.
> 
> Ira
> 

Nope 9/10 does not require this as it uses umad to close the port.  The 2
new patches for 4/10 and 6/10 follow.

Ira


From weiny2 at llnl.gov  Fri Feb 20 13:51:50 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 13:51:50 -0800
Subject: [ofa-general] [PATCHv2 4/10] infiniband-diags: Convert
	ibportstate  to "new" ibmad interface
In-Reply-To: <f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>
References: <20090219190536.f96edca7.weiny2@llnl.gov>
	<f0e08f230902200542k5c63174aoe0de4f3cbb8b85b4@mail.gmail.com>
Message-ID: <20090220135150.cd171cc2.weiny2@llnl.gov>

>From 5630f01688b7ea755b02d183d73edc86339f2e8b Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:27:21 -0800
Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibportstate.c |   19 +++++++++++++------
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index c0b9b34..65c9ca1 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -46,6 +46,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 /*******************************************/
 
 static int
@@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data)
 {
 	int node_type;
 
-	if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
 		return -1;
 
 	node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
@@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
 	char buf[2048];
 	char val[64];
 
-	if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	if (port_op != 4) {
@@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op)
 	char buf[2048];
 	char val[64];
 
-	if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0))
+	if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport))
 		return -1;
 
 	if (port_op != 4)
@@ -223,9 +225,12 @@ int main(int argc, char **argv)
 	if (argc < 2)
 		ibdiag_show_usage();
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
 
-	if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0)
+	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
+				ibd_sm_id, srcport) < 0)
 		IBERROR("can't resolve destination port %s", argv[0]);
 
 	/* First, make sure it is a switch port if it is a "set" */
@@ -314,7 +319,8 @@ int main(int argc, char **argv)
 					peerportid.drpath.p[1] = (uint8_t) portnum;
 
 					/* Set DrSLID to local lid */
-					if (ib_resolve_self(&selfportid, &selfport, 0) < 0)
+					if (ib_resolve_self_via(&selfportid,
+							&selfport, 0, srcport) < 0)
 						IBERROR("could not resolve self");
 					peerportid.drpath.drslid = (uint16_t) selfportid.lid;
 					peerportid.drpath.drdlid = 0xffff;
@@ -354,5 +360,6 @@ int main(int argc, char **argv)
 		}
 	}
 
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Fri Feb 20 13:51:55 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 13:51:55 -0800
Subject: [ofa-general] [PATCH v2 6/10] infiniband-diags: Convert
	ibsendtrap to "new" ibmad interface
In-Reply-To: <20090219190546.4fcaa158.weiny2@llnl.gov>
References: <20090219190546.4fcaa158.weiny2@llnl.gov>
Message-ID: <20090220135155.39cbe4e6.weiny2@llnl.gov>

>From f70635f4d62fb57221a4239a2013e602f6449548 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 19 Feb 2009 17:53:30 -0800
Subject: [PATCH] infiniband-diags: Convert ibsendtrap to "new" ibmad interface

   also make mad_send_via public to do the conversion

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibsendtrap.c |   21 ++++++++++++++-------
 libibmad/src/libibmad.map         |    1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index ba6aa8b..75120f0 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -47,6 +47,8 @@
 
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static int send_144_node_desc_update(void)
 {
 	ib_portid_t sm_port;
@@ -55,10 +57,10 @@ static int send_144_node_desc_update(void)
 	ib_rpc_t trap_rpc;
 	ib_mad_notice_attr_t notice;
 
-	if (ib_resolve_self(&selfportid, &selfport, NULL))
+	if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport))
 		IBERROR("can't resolve self");
 
-	if (ib_resolve_smlid(&sm_port, 0))
+	if (ib_resolve_smlid_via(&sm_port, 0, srcport))
 		IBERROR("can't resolve SM destination port");
 
 	memset(&trap_rpc, 0, sizeof(trap_rpc));
@@ -80,7 +82,7 @@ static int send_144_node_desc_update(void)
 	notice.data_details.ntc_144.change_flgs =
 	    TRAP_144_MASK_NODE_DESCRIPTION_CHANGE;
 
-	return (mad_send(&trap_rpc, &sm_port, NULL, &notice));
+	return (mad_send_via(&trap_rpc, &sm_port, NULL, &notice, srcport));
 }
 
 typedef struct _trap_def {
@@ -103,7 +105,7 @@ int send_trap(char *trap_name)
 		}
 	}
 	ibdiag_show_usage();
-	exit(1);
+	return(1);
 }
 
 int main(int argc, char **argv)
@@ -111,7 +113,7 @@ int main(int argc, char **argv)
 	char usage_args[1024];
 	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
 	char *trap_name = NULL;
-	int i, n;
+	int i, n, rc;
 
 	n = sprintf(usage_args, "[<trap_name>]\n"
 		    "\nArgument <trap_name> can be one of the following:\n");
@@ -137,7 +139,12 @@ int main(int argc, char **argv)
 	}
 
 	madrpc_show_errors(1);
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
 
-	return (send_trap(trap_name));
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+
+	rc = send_trap(trap_name);
+	mad_rpc_close_port(srcport);
+	return (rc);
 }
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index bac74a9..0412027 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -91,6 +91,7 @@ IBMAD_1.3 {
 		mad_receive_via;
 		mad_respond_via;
 		mad_send;
+		mad_send_via;
 		smp_query;
 		smp_set;
 		ib_vendor_call;
-- 
1.5.4.5


From hnrose at comcast.net  Fri Feb 20 13:59:38 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 20 Feb 2009 16:59:38 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/saquery.c: Convert
	more LID prints to unsigned decimal
Message-ID: <20090220215938.GB7360@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 infiniband-diags/src/saquery.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 9726d22..bcd1f61 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -332,13 +332,13 @@ static void dump_class_port_info(void *data)
 	       "\t\tResponse time value......0x%02X\n"
 	       "\t\tRedirect GID.............%s\n"
 	       "\t\tRedirect TC/SL/FL........0x%08X\n"
-	       "\t\tRedirect LID.............0x%04X\n"
+	       "\t\tRedirect LID.............%u\n"
 	       "\t\tRedirect PKey............0x%04X\n"
 	       "\t\tRedirect QP..............0x%08X\n"
 	       "\t\tRedirect QKey............0x%08X\n"
 	       "\t\tTrap GID.................%s\n"
 	       "\t\tTrap TC/SL/FL............0x%08X\n"
-	       "\t\tTrap LID.................0x%04X\n"
+	       "\t\tTrap LID.................%u\n"
 	       "\t\tTrap PKey................0x%04X\n"
 	       "\t\tTrap HL/QP...............0x%08X\n"
 	       "\t\tTrap QKey................0x%08X\n",
@@ -360,7 +360,7 @@ static void dump_portinfo_record(void *data)
 	const ib_port_info_t *const p_pi = &p_pir->port_info;
 
 	printf("PortInfoRecord dump:\n"
-	       "\t\tEndPortLid..............0x%X\n"
+	       "\t\tEndPortLid..............%u\n"
 	       "\t\tPortNum.................0x%X\n"
 	       "\t\tbase_lid................0x%X\n"
 	       "\t\tmaster_sm_base_lid......0x%X\n"
-- 
1.5.6.4


From hnrose at comcast.net  Fri Feb 20 13:58:45 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 20 Feb 2009 16:58:45 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] libibmad/fields.c: Dump LIDs as
	unsigned decimal
Message-ID: <20090220215845.GA7360@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
 libibmad/src/fields.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index d6742f9..e14dbb5 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -123,8 +123,8 @@ static const ib_field_t ib_mad_f[] = {
 	 */
 	{0, 64, "Mkey", mad_dump_hex},
 	{64, 64, "GidPrefix", mad_dump_hex},
-	{BITSOFFS(128, 16), "Lid", mad_dump_hex},
-	{BITSOFFS(144, 16), "SMLid", mad_dump_hex},
+	{BITSOFFS(128, 16), "Lid", mad_dump_uint},
+	{BITSOFFS(144, 16), "SMLid", mad_dump_uint},
 	{160, 32, "CapMask", mad_dump_portcapmask},
 	{BITSOFFS(192, 16), "DiagCode", mad_dump_hex},
 	{BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint},
-- 
1.5.6.4


From weiny2 at llnl.gov  Fri Feb 20 14:34:02 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 14:34:02 -0800
Subject: [ofa-general] [PATCH 11/10] libibmad:infiniband-diags: deprecate
 madrpc_set_[retries|timeout]  WAS: [PATCH 1/10] libibmad: Clean up  "new"
 interface
In-Reply-To: <f0e08f230902201024t671ad122t2072c519b6d8f772@mail.gmail.com>
References: <20090219190525.322681b8.weiny2@llnl.gov>
	<f0e08f230902200541x5869effbv64b2f782d5f9cdec@mail.gmail.com>
	<f0e08f230902201024t671ad122t2072c519b6d8f772@mail.gmail.com>
Message-ID: <20090220143402.c3b23b0a.weiny2@llnl.gov>

On Fri, 20 Feb 2009 13:24:35 -0500
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock
> <hal.rosenstock at gmail.com> wrote:
> > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> >> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
> >> From: Ira Weiny <weiny2 at llnl.gov>
> >> Date: Wed, 18 Feb 2009 16:37:36 -0800
> >> Subject: [PATCH] libibmad: Clean up "new" interface
> >>
> >>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
> >>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
> >>      which mirrors madrpc_portid(void)
> >>   Mark all "old" functions with __attribute__ ((deprecated))
> >>
> >> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> >> ---
> >>  libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
> >>  libibmad/src/gs.c                 |   19 +++---
> >>  libibmad/src/libibmad.map         |    1 +
> >>  libibmad/src/resolve.c            |   10 ++-
> >>  libibmad/src/rpc.c                |   29 ++++----
> >>  libibmad/src/sa.c                 |    4 +-
> >>  libibmad/src/smp.c                |    4 +-
> >>  7 files changed, 118 insertions(+), 88 deletions(-)
> >>
> >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> >> index 1aaaa1b..80e38be 100644
> >> --- a/libibmad/include/infiniband/mad.h
> >> +++ b/libibmad/include/infiniband/mad.h
> >> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
> >>  }
> >>
> >>  /* rpc.c */
> >> -MAD_EXPORT int madrpc_portid(void);
> >> -MAD_EXPORT int madrpc_set_retries(int retries);
> >> -MAD_EXPORT int madrpc_set_timeout(int timeout);
> 
> retries and timeouts could also be made per ibmad_port struct basis
> rather than one for all clients. Those two APIs would be deprecated in
> favor of new ones (mad_rpc_set_retries/timeout).
> 

Patch below.  (To be applied after the others.)


>From d12b291041bdfe0d3bddecb7a71ee769a601fd83 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 20 Feb 2009 14:30:52 -0800
Subject: [PATCH] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout]

	replace with mad_rpc_set_[retries|timeout] which are per ibmad_port
	object
	Update all diags with new functions

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibaddr.c        |    1 +
 infiniband-diags/src/ibdiag_common.c |    1 -
 infiniband-diags/src/ibping.c        |    1 +
 infiniband-diags/src/ibportstate.c   |    1 +
 infiniband-diags/src/ibroute.c       |    1 +
 infiniband-diags/src/ibsendtrap.c    |    1 +
 infiniband-diags/src/ibsysstat.c     |    1 +
 infiniband-diags/src/ibtracert.c     |    1 +
 infiniband-diags/src/perfquery.c     |    1 +
 infiniband-diags/src/saquery.c       |    1 +
 infiniband-diags/src/sminfo.c        |    1 +
 infiniband-diags/src/smpquery.c      |    1 +
 infiniband-diags/src/vendstat.c      |    1 +
 libibmad/include/infiniband/mad.h    |    6 ++++--
 libibmad/src/libibmad.map            |    2 ++
 libibmad/src/mad_internal.h          |    2 ++
 libibmad/src/rpc.c                   |   29 ++++++++++++++++++++---------
 17 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
index bb22be9..e782b36 100644
--- a/infiniband-diags/src/ibaddr.c
+++ b/infiniband-diags/src/ibaddr.c
@@ -142,6 +142,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (argc) {
 		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index 609df69..38d6cd3 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -175,7 +175,6 @@ static int process_opt(int ch, char *optarg)
 		break;
 	case 't':
 		val = strtoul(optarg, 0, 0);
-		madrpc_set_timeout(val);
 		ibd_timeout = val;
 		break;
 	case 's':
diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c
index 901079f..28e3a64 100644
--- a/infiniband-diags/src/ibping.c
+++ b/infiniband-diags/src/ibping.c
@@ -213,6 +213,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (server) {
 		if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0)
diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index 65c9ca1..deaad51 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -228,6 +228,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
 				ibd_sm_id, srcport) < 0)
diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 60bfdd8..07eddc4 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -410,6 +410,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (!argc) {
 		if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0)
diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index 75120f0..916b537 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -143,6 +143,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	rc = send_trap(trap_name);
 	mad_rpc_close_port(srcport);
diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
index d7daa37..7e668e8 100644
--- a/infiniband-diags/src/ibsysstat.c
+++ b/infiniband-diags/src/ibsysstat.c
@@ -339,6 +339,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (server) {
 		if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0)
diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c
index 1965aa0..87b5b17 100644
--- a/infiniband-diags/src/ibtracert.c
+++ b/infiniband-diags/src/ibtracert.c
@@ -753,6 +753,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
index 2f104b8..3d89cc7 100644
--- a/infiniband-diags/src/perfquery.c
+++ b/infiniband-diags/src/perfquery.c
@@ -389,6 +389,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (argc) {
 		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index e6cbe50..43eff85 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1323,6 +1323,7 @@ static bind_handle_t get_bind_handle(void)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport);
 	if (!handle.dport.lid)
diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
index ebf6a47..0caa3f3 100644
--- a/infiniband-diags/src/sminfo.c
+++ b/infiniband-diags/src/sminfo.c
@@ -118,6 +118,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (argc) {
 		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
index 2ed1e65..dc6b685 100644
--- a/infiniband-diags/src/smpquery.c
+++ b/infiniband-diags/src/smpquery.c
@@ -455,6 +455,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c
index d001a01..1c1c08f 100644
--- a/infiniband-diags/src/vendstat.c
+++ b/infiniband-diags/src/vendstat.c
@@ -157,6 +157,7 @@ int main(int argc, char **argv)
 	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
 	if (!srcport)
 		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
 
 	if (argc) {
 		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 5cf135e..cbd3049 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -693,8 +693,6 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport,
 
 /* New interface */
 MAD_EXPORT void madrpc_show_errors(int set);
-MAD_EXPORT int madrpc_set_retries(int retries);
-MAD_EXPORT int madrpc_set_timeout(int timeout);
 MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
 			int num_classes);
 MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
@@ -703,6 +701,8 @@ MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_po
 MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
 			ib_rmpp_hdr_t * rmpp, void *data);
 MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
+MAD_EXPORT int mad_rpc_set_retries(int retries, struct ibmad_port *srcport);
+MAD_EXPORT int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport);
 
 /* register.c */
 MAD_EXPORT int mad_register_port_client(int port_id, int mgmt,
@@ -761,6 +761,8 @@ static inline int mad_is_vendor_range2(int mgmt)
 }
 
 /* rpc.c */
+MAD_EXPORT int madrpc_set_retries(int retries) __attribute__ ((deprecated));
+MAD_EXPORT int madrpc_set_timeout(int timeout) __attribute__ ((deprecated));
 MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated));
 void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata)
 		__attribute__ ((deprecated));
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 0412027..f231485 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -80,6 +80,8 @@ IBMAD_1.3 {
 		madrpc_save_mad;
 		madrpc_set_retries;
 		madrpc_set_timeout;
+		mad_rpc_set_retries;
+		mad_rpc_set_timeout;
 		madrpc_show_errors;
 		ib_path_query;
 		sa_call;
diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
index 9afe7a9..3991cc3 100644
--- a/libibmad/src/mad_internal.h
+++ b/libibmad/src/mad_internal.h
@@ -39,6 +39,8 @@
 struct ibmad_port {
 	int port_id;		/* file descriptor returned by umad_open() */
 	int class_agents[MAX_CLASS];	/* class2agent mapper */
+	int retries;
+	int timeout_ms;
 };
 
 #endif /* _MAD_INTERNAL_H_ */
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index 210f0c2..229020d 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -49,7 +49,7 @@ int ibdebug;
 
 static int mad_portid = -1;
 static int iberrs;
-
+	int timeout;
 static int madrpc_retries = MAD_DEF_RETRIES;
 static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS;
 static void *save_mad;
@@ -85,9 +85,17 @@ int madrpc_set_timeout(int timeout)
 	return 0;
 }
 
-int madrpc_def_timeout(void)
+int mad_rpc_set_retries(int retries, struct ibmad_port *srcport)
+{
+	if (retries > 0)
+		srcport->retries = retries;
+	return srcport->retries;
+}
+
+int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport)
 {
-	return def_madrpc_timeout;
+	srcport->timeout_ms = timeout_ms;
+	return 0;
 }
 
 int madrpc_portid(void)
@@ -102,14 +110,14 @@ int mad_rpc_portid(struct ibmad_port *srcport)
 
 static int
 _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
-	   int timeout)
+	   int timeout, const struct ibmad_port *srcport)
 {
 	uint32_t trid;		/* only low 32 bits */
-	int retries;
+	int retries, max_retries;
 	int length, status;
 
 	if (!timeout)
-		timeout = def_madrpc_timeout;
+		timeout = srcport ? srcport->timeout_ms : def_madrpc_timeout;
 
 	if (ibdebug > 1) {
 		IBWARN(">>> sending: len %d pktsz %zu", len, umad_size() + len);
@@ -125,7 +133,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
 	trid =
 	    (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F);
 
-	for (retries = 0; retries < madrpc_retries; retries++) {
+	max_retries = srcport ? srcport->retries : madrpc_retries;
+	for (retries = 0; retries < max_retries; retries++) {
 		if (retries) {
 			ERRS("retry %d (timeout %d ms)", retries, timeout);
 		}
@@ -178,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport
 
 	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 			      port->class_agents[rpc->mgtclass],
-			      len, rpc->timeout)) < 0) {
+			      len, rpc->timeout, port)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
 	}
@@ -217,7 +226,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t *
 
 	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
 			      port->class_agents[rpc->mgtclass],
-			      len, rpc->timeout)) < 0) {
+			      len, rpc->timeout, port)) < 0) {
 		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
 		return 0;
 	}
@@ -356,6 +365,8 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
 	}
 
 	p->port_id = port_id;
+	p->retries = MAD_DEF_RETRIES;
+	p->timeout_ms = MAD_DEF_TIMEOUT_MS;
 	return p;
 }
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Fri Feb 20 14:45:29 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 20 Feb 2009 14:45:29 -0800
Subject: [ofa-general] [PATCH 12/10] infiniband-diags: convert ibnetdiscover
 to "new"
 ibmad interface WAS: [PATCH 0/10] libibmad/infiniband-diags -- converting 
 to "new" interface.
In-Reply-To: <20090220092350.7ee3ddab.weiny2@llnl.gov>
References: <20090219190520.c18280e1.weiny2@llnl.gov>
	<f0e08f230902200555u6d4b6791s9540edc6dc25aed7@mail.gmail.com>
	<20090220092350.7ee3ddab.weiny2@llnl.gov>
Message-ID: <20090220144529.018d8675.weiny2@llnl.gov>

On Fri, 20 Feb 2009 09:23:50 -0800
Ira Weiny <weiny2 at llnl.gov> wrote:

> On Fri, 20 Feb 2009 08:55:57 -0500
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> 
> > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > > Here is v2 of the patch series.
> > >
> > > I used __attribute__ ((deprecated)) on the functions which should aid others
> > > in realizing that these functions will go away.  (It sure helped me to convert
> > > all the diags.
> > >
> > > Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the
> > > new interface and I am hoping it will be accepted soon.
> > 
> > A related issue is whether ibnetdiscover will support both the new
> > library and the old way until the library is more proven via some
> > build option. If it is to support both, then converting it should be
> > done.
> 
> The conversion is easy.  I will do it for now to remove the build warnings.
> And now that I think about it more leaving in the old and new code to be
> chosen via configure is probably not a bad idea.  I don't know what is going
> to happen once we standardize on the mad library for decoding strings.  There
> are some incompatibilities there (ie 1x vs 1X and 2.5Gbps vs SDR etc.)
> 
> I will say, however, that I tested the library extensively and the first
> version's output was identical to the old version with the sole exception of
> the order ports were printed in.  :-D  So my confidence is high it will be
> accepted sooner rather than later.
> 
> Ira


Patch below:

>From ad8cbf227a803d64c02872f74d7d542b815c6092 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 20 Feb 2009 14:43:48 -0800
Subject: [PATCH] infiniband-diags: convert ibnetdiscover to "new" ibmad interface


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibnetdiscover.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 466d522..8a840be 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -53,6 +53,8 @@
 #include "grouping.h"
 #include "ibdiag_common.h"
 
+struct ibmad_port *srcport;
+
 static char *node_type_str[] = {
 	"???",
 	"ca",
@@ -143,7 +145,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid)
 
 	port->portnum = portnum;
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout,
+			srcport))
 		return -1;
 	decode_port_info(pi, port);
 
@@ -162,7 +165,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 	void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
 	void *si = switchinfo;
 
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport))
 		return -1;
 
 	mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid);
@@ -176,10 +179,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 	port->portnum = node->localport;
 	port->portguid = node->portguid;
 
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout))
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport))
 		return -1;
 
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport))
 		return -1;
 	decode_port_info(pi, port);
 
@@ -190,11 +193,12 @@ get_node(Node *node, Port *port, ib_portid_t *portid)
 	node->smalmc = port->lmc;
 
 	/* after we have the sma information find out the real PortInfo for this port */
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout))
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->localport,
+			timeout, srcport))
 	        return -1;
 	decode_port_info(pi, port);
 
-        if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+        if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport))
                 node->smaenhsp0 = 0;	/* assume base SP0 */
 	else
         	mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0);
@@ -985,7 +989,11 @@ int main(int argc, char **argv)
 	if (argc && !(f = fopen(argv[0], "w")))
 		IBERROR("can't open file %s for writing", argv[0]);
 
-	madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
+	if (!srcport)
+		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
+	mad_rpc_set_timeout(ibd_timeout, srcport);
+
 	node_name_map = open_node_name_map(node_name_map_file);
 
 	if (discover(&my_portid) < 0)
@@ -1000,5 +1008,6 @@ int main(int argc, char **argv)
 		dump_topology(list, group);
 
 	close_node_name_map(node_name_map);
+	mad_rpc_close_port(srcport);
 	exit(0);
 }
-- 
1.5.4.5


From arlin.r.davis at intel.com  Fri Feb 20 18:59:37 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Fri, 20 Feb 2009 18:59:37 -0800
Subject: [ofa-general] RDMA write with immediate data.
In-Reply-To: <499E25DE.5020703@cs.anu.edu.au>
References: <499CBEF2.2010909@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A7C6173@orsmsx506.amr.corp.intel.com>
	<499E25DE.5020703@cs.anu.edu.au>
Message-ID: <E3280858FA94444CA49D2BA02341C9833A83B978@orsmsx506.amr.corp.intel.com>

 
>
>> Do you have receive's posted at the remote side for immed data?
>>
>Nope, the remote side didn't got an event, (dat_evd_wait timed out).
>The way to find out the immed data is to check the out going
>parameter &event of dat_evd_wait function.

I don't understand your answer. Do you have a receive buffer pre-posted
on the EP to receive the inbound immediate data? Just waiting on the
event in not enough. For immediate data you don't need a buffer associated
with the work request but you do need the work request posted for each
inbound rdma_write with immed that is expected.

-arlin


From vuhuong at mellanox.com  Sat Feb 21 01:33:19 2009
From: vuhuong at mellanox.com (Vu Pham)
Date: Sat, 21 Feb 2009 01:33:19 -0800
Subject: [ofa-general] NFSRDMA connectathon prelim. testing status,
Message-ID: <499FCA5F.5070604@mellanox.com>

Hi Tom,

I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX 
(mlx4_ib) HCAs
I have seen several problems during my testing at NFS Connectathon 2009

1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
client can not mount. Talking to Tom Talpey and scanning the code, I saw 
that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
provider does not have the implementation for this verb.
If I have client on mlx4_ib and server on ib_mthca, I hit the following 
crash because of bad error handling in xprtrdma (see file attached - 
mlx4_mount_problem.log)

Because of this problem, I use InfiniHost III (ib_mthca) for all of my 
tests at Connectathon

2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
nfsrdma servers, I hit the process hung problem during the 
connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
attached files). I can only reproduce it when I ran connectathon more 
than 500 iterations (-N 1000)
I can NOT reproduce the problem with nfs client/server over IPoIB

3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
hit the following BUG_ON() right away(see file attached - svcrdma_send.log)

thanks,
-vu
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mlx4_mount_problem.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090221/c3e9d764/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sync_page_1.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090221/c3e9d764/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sync_page_2.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090221/c3e9d764/attachment-0002.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: svcrdma_send.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090221/c3e9d764/attachment-0003.ksh>

From vlad at lists.openfabrics.org  Sat Feb 21 03:18:08 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 21 Feb 2009 03:18:08 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090221-0200 daily build status
Message-ID: <20090221111808.325D4E61072@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From jackm at dev.mellanox.co.il  Sat Feb 21 23:09:11 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 22 Feb 2009 09:09:11 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <ada3ae9zjo2.fsf@cisco.com>
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<ada3ae9zjo2.fsf@cisco.com>
Message-ID: <200902220909.11784.jackm@dev.mellanox.co.il>

On Friday 20 February 2009 08:50, Roland Dreier wrote:
> What test are you using to hit this race?  Are you using a distro kernel
> with OFED?
> 
I ran on RHEL5.2, with a ConnectX card, using the following test (source given at the end of this post):

1. Start the driver.
2. In one console window, compile (just gcc) and run the app below which prints out pkeys
   in a tight loop via libsysfs.
3. In another console window, run the bash script below (which loads/unloads the driver, with some
   time randomization added).

After a few hours of this test, I got a kernel panic, and adding a mutex to make the low-level driver
access atomic (wrt ib_core) for showing pkeys fixed the problem entirely.

When I added printouts to the low-level driver and to sysfs.c (printout in procedure show_port_pkey
just before call to ib_query_pkey), I noticed that the crash occurred as follows
(note that mlx4_ib is not in the list of loaded modules, and that the paging request address failure
is in virtual function "query_pkey"):

ENTERING mlx4_ib_remove: ibdev = ffff81010dfdf800
show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=127
show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=126
show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=125
...
show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=79
show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=78
ib_device_unregister_sysfs: ibd=ffff81010dfdf800, portnum=1
ib_device_unregister_sysfs: ibd=ffff81010dfdf800, portnum=2
LEAVING mlx4_ib_remove: ibdev = ffff81010dfdf800
Unable to handle kernel paging request at ffffffff88422f53 RIP:
 [<ffffffff88422f53>]
PGD 203067 PUD 205063 PMD 11658b067 PTE 0
Oops: 0010 [1] SMP
last sysfs file: /class/infiniband/mlx4_0/ports/1/pkeys/78
CPU 0
Modules linked in: ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_core(U) ib_mad(U)
ib_core(U) hfsplus netconsole nfsd exportfs auth_rpcgss autofs4 hidp nfs lockd fscache nfs_acl
rfcomm l2cap bluetooth sunrpc ipoib_helper(U) ipv6 xfrm_nalgo crypto_api dm_mirror
dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac
parport_pc lp parport i2c_piix4 ide_cd k8_edac cdrom edac_mc i2c_core k8temp hwmon sg bnx2
serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache sata_svw libata
shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 26829, comm: opensm Tainted: G      2.6.18-128.el5 #1
RIP: 0010:[<ffffffff88422f53>]  [<ffffffff88422f53>]
RSP: 0018:ffff810212a27e58  EFLAGS: 00010246
RAX: ffff81010ccec180 RBX: ffff81012194bc80 RCX: 0000000000000000
RDX: ffff81010ccec180 RSI: 0000000000000202 RDI: ffff81010ccec280
RBP: ffff81010da7d000 R08: ffff810212a26000 R09: 000000000000003c
R10: ffff810123f88800 R11: 0000000000000001 R12: ffff810115354701
R13: 000000000000004e R14: ffff81010dfdf800 R15: ffff810212a27ea6
FS:  00002ad1a47afc00(0000) GS:ffffffff803ac000(0000) knlGS:00000000f75fdb90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff88422f53 CR3: 0000000121354000 CR4: 00000000000006e0
Process opensm (pid: 26829, threadinfo ffff810212a26000, task ffff81021c4dc820)
Stack:  00000010000280d0 ffff81012194bc80 ffff81010da7d000 ffff810115354740
 ffff810212a27f50 ffffffff882665e0 ffff81012194bc80 ffffffff88256e71
 ffff810212a27f50 ffffffff882665e0 ffff810115bcbc90 ffff81010f8ef140
Call Trace:
 [<ffffffff88256e71>] :ib_core:show_port_pkey+0x59/0x7d
 [<ffffffff80107068>] sysfs_read_file+0xa5/0x13f
 [<ffffffff8000b3f3>] vfs_read+0xcb/0x171
 [<ffffffff800117d4>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code:  Bad RIP value.
RIP  [<ffffffff88422f53>]
 RSP <ffff810212a27e58>
CR2: ffffffff88422f53
 <0>Kernel panic - not syncing: Fatal exception

- Jack
=================================
1. Pkeys print app:

/*
 * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
 *
 * This software is available to you under a choice of one of two
 * licenses.  You may choose to be licensed under the terms of the GNU
 * General Public License (GPL) Version 2, available from the file
 * COPYING in the main directory of this source tree, or the
 * OpenIB.org BSD license below:
 *
 *     Redistribution and use in source and binary forms, with or
 *     without modification, are permitted provided that the following
 *     conditions are met:
 *
 *      - Redistributions of source code must retain the above
 *        copyright notice, this list of conditions and the following
 *        disclaimer.
 *
 *      - Redistributions in binary form must reproduce the above
 *        copyright notice, this list of conditions and the following
 *        disclaimer in the documentation and/or other materials
 *        provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 *
 */

#define _GNU_SOURCE

#if HAVE_CONFIG_H
#  include <config.h>
#endif /* HAVE_CONFIG_H */

#include <inttypes.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdarg.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <string.h>
#include <endian.h>
#include <byteswap.h>
#include <sys/poll.h>
#include <syslog.h>
#include <netinet/in.h>
#include <errno.h>

static int
ret_code(void)
{
	int e = errno;

	if (e > 0)
		return -e;
	return e;
}

int
sys_read_string(char *dir_name, char *file_name, char *str, int max_len)
{
	char path[256], *s;
	int fd, r;

	snprintf(path, sizeof(path), "%s/%s", dir_name, file_name);

	if ((fd = open(path, O_RDONLY)) < 0)
		return ret_code();

	if ((r = read(fd, str, max_len)) < 0) {
		int e = errno;
		close(fd);
		errno = e;
		return ret_code();
	}

	str[(r < max_len) ? r : max_len - 1] = 0;

	if ((s = strrchr(str, '\n')))
		*s = 0;

	close(fd);
	return 0;
}

int
sys_read_uint(char *dir_name, char *file_name, unsigned *u)
{
	char buf[32];
	int r;

	if ((r = sys_read_string(dir_name, file_name, buf, sizeof(buf))) < 0)
		return r;

	*u = strtoul(buf, 0, 0);

	return 0;
}

int main()
{
	int i;
	char *path = "/sys/class/infiniband/mlx4_0/ports/1/pkeys";
	char pkey_is[20];
	unsigned u;

	while (1) 
		for (i = 127; i >= 0; --i) {
		   sprintf(pkey_is, "%d",i);
		   if (sys_read_uint(path, pkey_is, &u)) {
				sleep(1);
				break;
		   }
		   printf("%d: %u\n",i, u);
		}
	return 0;	
}
========================================================
Bash driver up-down script:

#!/bin/bash -x
i=0
while true; do
        echo iteration number $i; date
        /etc/init.d/openibd start
        opensm &
        sleep 10.$RANDOM
        pkill -9 opensm
        wait
        /etc/init.d/openibd stop
        let i=$i+1
done


From rdreier at cisco.com  Sat Feb 21 23:15:24 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 21 Feb 2009 23:15:24 -0800
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <200902220909.11784.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 22 Feb 2009 09:09:11 +0200")
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<ada3ae9zjo2.fsf@cisco.com>
	<200902220909.11784.jackm@dev.mellanox.co.il>
Message-ID: <adahc2nvt6b.fsf@cisco.com>

 > I ran on RHEL5.2 ...

I suspect that at some point in the 2+ years since 2.6.18 more locking
was added to sysfs so that this race no longer exists.  You could try
and see if my test (add a sleep to the show method and make sure you
remove the low-level driver during that window) results in an instant
crash with the RHEL 5.2 kernel.

 - R.


From eli at dev.mellanox.co.il  Sat Feb 21 23:39:54 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 22 Feb 2009 09:39:54 +0200
Subject: [ofa-general] Re: [ewg] iscsi initiator ipoib+lro crash on upstream
	kernel
In-Reply-To: <15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com>
References: <20090219165505.GA13617@mtls03>
	<15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com>
Message-ID: <4e6a6b3c0902212339t603b13ccs17160a893b0892e4@mail.gmail.com>

Thanks!

On Thu, Feb 19, 2009 at 9:40 PM, Or Gerlitz <or.gerlitz at gmail.com> wrote:
> On Thu, Feb 19, 2009 at 6:55 PM, Eli Cohen <eli at dev.mellanox.co.il> wrote:
>
>> I have encountered a kernel crash when running a iSCSI initiator on
>> IPoIB configured with LRO (if LRO is off it does not happen). This
>> was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2 too.
>
> Eli,
>
> This is a known issue
> (http://bugzilla.kernel.org/show_bug.cgi?id=11804) a fix was submitted
> upstream and would be included in the next kernel.
>
> Or.
>


From jackm at dev.mellanox.co.il  Sun Feb 22 00:15:45 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 22 Feb 2009 10:15:45 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <adahc2nvt6b.fsf@cisco.com>
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902220909.11784.jackm@dev.mellanox.co.il>
	<adahc2nvt6b.fsf@cisco.com>
Message-ID: <200902221015.46090.jackm@dev.mellanox.co.il>

On Sunday 22 February 2009 09:15, Roland Dreier wrote:
>  > I ran on RHEL5.2 ...
> 
> I suspect that at some point in the 2+ years since 2.6.18 more locking
> was added to sysfs so that this race no longer exists.  You could try
> and see if my test (add a sleep to the show method and make sure you
> remove the low-level driver during that window) results in an instant
> crash with the RHEL 5.2 kernel.
> 
>  - R.

You're right -- your test does crash the RHEL5.2 kernel, with the appropriate
stack dump (page fault for query_pkey low-level driver function).

I'll try to determine in which kernel this was fixed.

- Jack


From vlad at lists.openfabrics.org  Sun Feb 22 03:15:03 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 22 Feb 2009 03:15:03 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090222-0200 daily build status
Message-ID: <20090222111503.68A54E6104C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From jackm at dev.mellanox.co.il  Sun Feb 22 03:37:00 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 22 Feb 2009 13:37:00 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <adahc2nvt6b.fsf@cisco.com>
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902220909.11784.jackm@dev.mellanox.co.il>
	<adahc2nvt6b.fsf@cisco.com>
Message-ID: <200902221337.01172.jackm@dev.mellanox.co.il>

On Sunday 22 February 2009 09:15, Roland Dreier wrote:
>  > I ran on RHEL5.2 ...
> 
> I suspect that at some point in the 2+ years since 2.6.18 more locking
> was added to sysfs so that this race no longer exists.  You could try
> and see if my test (add a sleep to the show method and make sure you
> remove the low-level driver during that window) results in an instant
> crash with the RHEL 5.2 kernel.
> 
>  - R.
> 
There is still a problem, which we do not see with ConnectX (because of the separation between
mlx4_ib and mlx4_core -- and we are unloading only mlx4_ib, leaving all the mlx4_core infrastructure intact).

I tried your test with a Sinai card (mthca, and got the following Kernel Oops (on Kernel 2,6,27.4)
(Note that ib_mthca is still loaded, but with "(-)" following).

- Jack
======================

enter show_port_pkey
call ib_query_pkey
BUG: unable to handle kernel paging request at ffffc20000648698
IP: [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
PGD 7fc59067 PUD 11fc30067 PMD 11ff34067 PTE 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: rdma_ucm rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs
ib_umad iw_nes mlx4_en inet_lro mlx4_ib mlx4_core ib_mthca(-) ib_mad ib_core memtrack mst_pciconf
mst_pci nfsd auth_rpcgss exportfs autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6
dm_mirror dm_log dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport rtc_cmos
ide_cd_mod floppy sg button rtc_core cdrom serio_raw i2c_nforce2 rtc_lib k8temp shpchp forcedeth i2c_core
hwmon pcspkr sata_nv libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: inet_lro]
Pid: 23114, comm: cat Not tainted 2.6.27.4 #1
RIP: 0010:[<ffffffffa0217278>]  [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
RSP: 0018:ffff88011695bcd8  EFLAGS: 00010246
RAX: ffffc20000648680 RBX: ffff8800734b6000 RCX: 0000000000000001
RDX: ffffffff8021072e RSI: 000000005102d000 RDI: ffff8800734b66c8
RBP: 0000000000000001 R08: 0000000000000003 R09: 0000000000000024
R10: ffff88011695be5f R11: 000000000000ea60 R12: 000000000000ffff
R13: 000000005102d000 R14: 000000007e57b000 R15: 000000005102d003
FS:  00007fb5ad5c86f0(0000) GS:ffffffff806fca80(0000) knlGS:00000000f735fb90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffc20000648698 CR3: 000000006f897000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process cat (pid: 23114, threadinfo ffff88011695a000, task ffff88011f4ec240)
Stack:  000000005fe51850 002488006b0c5e80 ffff88005fe5ffff ffff88006b0c5e80
 ffff88006b0c5e80 002488005fe51840 0000000000000296 ffff8800734b6000
 0000000000000024 000000005102d003 ffff88011695bdb8 0000000000000001
Call Trace:
 [<ffffffffa021758c>] ? mthca_cmd_poll+0x61/0x118 [ib_mthca]
 [<ffffffffa021775f>] ? mthca_cmd_box+0x5d/0x62 [ib_mthca]
 [<ffffffffa0219c21>] ? mthca_MAD_IFC+0x171/0x1bc [ib_mthca]
 [<ffffffffa0225254>] ? mthca_query_pkey+0x103/0x18a [ib_mthca]
 [<ffffffff8023da38>] ? process_timeout+0x0/0x5
 [<ffffffffa01e9e7b>] ? show_port_pkey+0x4f/0x74 [ib_core]
 [<ffffffff802d7a1a>] ? sysfs_read_file+0xa8/0x12f
 [<ffffffff80291560>] ? vfs_read+0xaa/0x133
 [<ffffffff80291847>] ? sys_read+0x45/0x6e
 [<ffffffff8020be0b>] ? system_call_fastpath+0x16/0x1b


Code: c0 48 87 02 e8 73 89 26 e0 48 8b 83 98 06 00 00 8b 40 18 66 85 c0 79
0c 48 8b 05 14 66 53 e0 4c 39 e0 78 d2 48 8b 83 98 06 00 00 <8b> 40 18 66 85
c0 41 bc f5 ff ff ff 0f 88 b4 00 00 00 4c 89 e8
RIP  [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
 RSP <ffff88011695bcd8>
CR2: ffffc20000648698
---[ end trace 7cb234a047e4a788 ]---


From jackm at dev.mellanox.co.il  Sun Feb 22 08:04:21 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 22 Feb 2009 18:04:21 +0200
Subject: [ofa-general] [PATCH] ib_core: avoid race condition between sysfs
	access and low-level module unload
Message-ID: <200902221804.21627.jackm@dev.mellanox.co.il>

In newer kernels, a low-level module will not be unloaded
while its sysfs interface is being accessed, so its code pages will be available
for the sysfs access. However, nothing prevents the low-level module from freeing
its memory resources during such access.  This can cause a kernel Oops.

To avoid this, we protect the device reg_state with a mutex, and perform
all sysfs operations (show, store) atomically within this mutex by locking the
mutex, testing whether the device is still "alive", and only if it is, invoking
low-level module functions -- and finally, freeing the mutex.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---

Roland,
I think this patch is a reasonable solution to the sysfs problem of a low-level
driver module being unloaded while sysfs is being accessed for the device.

ib_unregister_device() is always called before the device driver frees up its
resources.  Since this patch makes sysfs accesses atomic wrt the device registration
state, it solves the problem of the race between freeing device resources and
accessing the low-level to retrieve device data.

(I ran checkpatch.pl on this, and I do have several lines slightly more than
 80 chars long -- but that's all).

Jack

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 7913b80..6254202 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -172,9 +172,14 @@ static int end_port(struct ib_device *device)
  */
 struct ib_device *ib_alloc_device(size_t size)
 {
+	struct ib_device *ibdev;
+
 	BUG_ON(size < sizeof (struct ib_device));
 
-	return kzalloc(size, GFP_KERNEL);
+	ibdev = kzalloc(size, GFP_KERNEL);
+	if (ibdev)
+		mutex_init(&ibdev->sysfs_mutex);
+	return ibdev;
 }
 EXPORT_SYMBOL(ib_alloc_device);
 
@@ -305,9 +310,10 @@ int ib_register_device(struct ib_device *device)
 		goto out;
 	}
 
+	mutex_lock(&device->sysfs_mutex);
 	list_add_tail(&device->core_list, &device_list);
-
 	device->reg_state = IB_DEV_REGISTERED;
+	mutex_unlock(&device->sysfs_mutex);
 
 	{
 		struct ib_client *client;
@@ -353,7 +359,9 @@ void ib_unregister_device(struct ib_device *device)
 		kfree(context);
 	spin_unlock_irqrestore(&device->client_data_lock, flags);
 
+	mutex_lock(&device->sysfs_mutex);
 	device->reg_state = IB_DEV_UNREGISTERED;
+	mutex_unlock(&device->sysfs_mutex);
 }
 EXPORT_SYMBOL(ib_unregister_device);
 
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b43f7d3..29f0ce1 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -94,7 +94,7 @@ static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
 			  char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
+	ssize_t ret = -ENODEV;
 
 	static const char *state_name[] = {
 		[IB_PORT_NOP]		= "NOP",
@@ -105,26 +105,33 @@ static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
 		[IB_PORT_ACTIVE_DEFER]	= "ACTIVE_DEFER"
 	};
 
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
-
-	return sprintf(buf, "%d: %s\n", attr.state,
-		       attr.state >= 0 && attr.state < ARRAY_SIZE(state_name) ?
-		       state_name[attr.state] : "UNKNOWN");
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "%d: %s\n", attr.state,
+				      attr.state >= 0 &&
+				      attr.state < ARRAY_SIZE(state_name) ?
+				      state_name[attr.state] : "UNKNOWN");
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused,
 			char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "0x%x\n", attr.lid);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "0x%x\n", attr.lid);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t lid_mask_count_show(struct ib_port *p,
@@ -132,52 +139,64 @@ static ssize_t lid_mask_count_show(struct ib_port *p,
 				   char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "%d\n", attr.lmc);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "%d\n", attr.lmc);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused,
 			   char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "0x%x\n", attr.sm_lid);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "0x%x\n", attr.sm_lid);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused,
 			  char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "%d\n", attr.sm_sl);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "%d\n", attr.sm_sl);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused,
 			     char *buf)
 {
 	struct ib_port_attr attr;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "0x%08x\n", attr.port_cap_flags);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret)
+			ret = sprintf(buf, "0x%08x\n", attr.port_cap_flags);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused,
@@ -186,24 +205,33 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused,
 	struct ib_port_attr attr;
 	char *speed = "";
 	int rate;
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
-
-	switch (attr.active_speed) {
-	case 2: speed = " DDR"; break;
-	case 4: speed = " QDR"; break;
+	ssize_t ret = -ENODEV;
+
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret) {
+			switch (attr.active_speed) {
+			case 2: speed = " DDR"; break;
+			case 4: speed = " QDR"; break;
+			}
+
+			rate = 25 * ib_width_enum_to_int(attr.active_width) *
+				attr.active_speed;
+			if (rate < 0) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			ret = sprintf(buf, "%d%s Gb/sec (%dX%s)\n",
+				      rate / 10, rate % 10 ? ".5" : "",
+				      ib_width_enum_to_int(attr.active_width),
+				      speed);
+		}
 	}
-
-	rate = 25 * ib_width_enum_to_int(attr.active_width) * attr.active_speed;
-	if (rate < 0)
-		return -EINVAL;
-
-	return sprintf(buf, "%d%s Gb/sec (%dX%s)\n",
-		       rate / 10, rate % 10 ? ".5" : "",
-		       ib_width_enum_to_int(attr.active_width), speed);
+out:
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused,
@@ -211,22 +239,26 @@ static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused,
 {
 	struct ib_port_attr attr;
 
-	ssize_t ret;
-
-	ret = ib_query_port(p->ibdev, p->port_num, &attr);
-	if (ret)
-		return ret;
-
-	switch (attr.phys_state) {
-	case 1:  return sprintf(buf, "1: Sleep\n");
-	case 2:  return sprintf(buf, "2: Polling\n");
-	case 3:  return sprintf(buf, "3: Disabled\n");
-	case 4:  return sprintf(buf, "4: PortConfigurationTraining\n");
-	case 5:  return sprintf(buf, "5: LinkUp\n");
-	case 6:  return sprintf(buf, "6: LinkErrorRecovery\n");
-	case 7:  return sprintf(buf, "7: Phy Test\n");
-	default: return sprintf(buf, "%d: <unknown>\n", attr.phys_state);
+	ssize_t ret = -ENODEV;
+
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_port(p->ibdev, p->port_num, &attr);
+		if (!ret) {
+			switch (attr.phys_state) {
+			case 1:  ret = sprintf(buf, "1: Sleep\n");
+			case 2:  ret = sprintf(buf, "2: Polling\n");
+			case 3:  ret = sprintf(buf, "3: Disabled\n");
+			case 4:  ret = sprintf(buf, "4: PortConfigurationTraining\n");
+			case 5:  ret = sprintf(buf, "5: LinkUp\n");
+			case 6:  ret = sprintf(buf, "6: LinkErrorRecovery\n");
+			case 7:  ret = sprintf(buf, "7: Phy Test\n");
+			default: ret = sprintf(buf, "%d: <unknown>\n", attr.phys_state);
+			}
+		}
 	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static PORT_ATTR_RO(state);
@@ -256,13 +288,16 @@ static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 	struct port_table_attribute *tab_attr =
 		container_of(attr, struct port_table_attribute, attr);
 	union ib_gid gid;
-	ssize_t ret;
-
-	ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "%pI6\n", gid.raw);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid);
+		if (!ret)
+			ret = sprintf(buf, "%pI6\n", gid.raw);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
@@ -271,13 +306,16 @@ static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
 	struct port_table_attribute *tab_attr =
 		container_of(attr, struct port_table_attribute, attr);
 	u16 pkey;
-	ssize_t ret;
-
-	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
-	if (ret)
-		return ret;
+	ssize_t ret = -ENODEV;
 
-	return sprintf(buf, "0x%04x\n", pkey);
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
+		if (!ret)
+			ret = sprintf(buf, "0x%04x\n", pkey);
+	}
+	mutex_unlock(&p->ibdev->sysfs_mutex);
+	return ret;
 }
 
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)			\
@@ -300,6 +338,12 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
 	if (!p->ibdev->process_mad)
 		return sprintf(buf, "N/A (no PMA)\n");
 
+	mutex_lock(&p->ibdev->sysfs_mutex);
+	if (ibdev_is_alive(p->ibdev)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
 	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
 	if (!in_mad || !out_mad) {
@@ -346,7 +390,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
 out:
 	kfree(in_mad);
 	kfree(out_mad);
-
+	mutex_unlock(&p->ibdev->sysfs_mutex);
 	return ret;
 }
 
@@ -579,20 +623,20 @@ static ssize_t show_sys_image_guid(struct device *device,
 {
 	struct ib_device *dev = container_of(device, struct ib_device, dev);
 	struct ib_device_attr attr;
-	ssize_t ret;
-
-	if (!ibdev_is_alive(dev))
-		return -ENODEV;
-
-	ret = ib_query_device(dev, &attr);
-	if (ret)
-		return ret;
-
-	return sprintf(buf, "%04x:%04x:%04x:%04x\n",
-		       be16_to_cpu(((__be16 *) &attr.sys_image_guid)[0]),
-		       be16_to_cpu(((__be16 *) &attr.sys_image_guid)[1]),
-		       be16_to_cpu(((__be16 *) &attr.sys_image_guid)[2]),
-		       be16_to_cpu(((__be16 *) &attr.sys_image_guid)[3]));
+	ssize_t ret = -ENODEV;
+
+	mutex_lock(&dev->sysfs_mutex);
+	if (ibdev_is_alive(dev)) {
+		ret = ib_query_device(dev, &attr);
+		if (!ret)
+			ret = sprintf(buf, "%04x:%04x:%04x:%04x\n",
+				      be16_to_cpu(((__be16 *) &attr.sys_image_guid)[0]),
+				      be16_to_cpu(((__be16 *) &attr.sys_image_guid)[1]),
+				      be16_to_cpu(((__be16 *) &attr.sys_image_guid)[2]),
+				      be16_to_cpu(((__be16 *) &attr.sys_image_guid)[3]));
+	}
+	mutex_unlock(&dev->sysfs_mutex);
+	return ret;
 }
 
 static ssize_t show_node_guid(struct device *device,
@@ -624,17 +668,20 @@ static ssize_t set_node_desc(struct device *device,
 {
 	struct ib_device *dev = container_of(device, struct ib_device, dev);
 	struct ib_device_modify desc = {};
-	int ret;
+	int ret = -ENODEV;
 
 	if (!dev->modify_device)
 		return -EIO;
 
 	memcpy(desc.node_desc, buf, min_t(int, count, 64));
-	ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc);
-	if (ret)
-		return ret;
-
-	return count;
+	mutex_lock(&dev->sysfs_mutex);
+	if (ibdev_is_alive(dev)) {
+		ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc);
+		if (!ret)
+			ret = count;
+	}
+	mutex_unlock(&dev->sysfs_mutex);
+	return ret;
 }
 
 static DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL);
@@ -662,14 +709,18 @@ static ssize_t show_protocol_stat(const struct device *device,
 {
 	struct ib_device *dev = container_of(device, struct ib_device, dev);
 	union rdma_protocol_stats stats;
-	ssize_t ret;
-
-	ret = dev->get_protocol_stats(dev, &stats);
-	if (ret)
-		return ret;
-
-	return sprintf(buf, "%llu\n",
-		       (unsigned long long) ((u64 *) &stats)[offset]);
+	ssize_t ret = -ENODEV;
+
+	mutex_lock(&dev->sysfs_mutex);
+	if (ibdev_is_alive(dev)) {
+		ret = dev->get_protocol_stats(dev, &stats);
+		if (!ret)
+			ret = sprintf(buf, "%llu\n",
+				      (unsigned long long)
+				      ((u64 *) &stats)[offset]);
+	}
+	mutex_unlock(&dev->sysfs_mutex);
+	return ret;
 }
 
 /* generate a read-only iwarp statistics attribute */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 936e333..3b2768c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -47,6 +47,7 @@
 #include <linux/list.h>
 #include <linux/rwsem.h>
 #include <linux/scatterlist.h>
+#include <linux/mutex.h>
 
 #include <asm/atomic.h>
 #include <asm/uaccess.h>
@@ -1143,6 +1144,7 @@ struct ib_device {
 		IB_DEV_REGISTERED,
 		IB_DEV_UNREGISTERED
 	}                            reg_state;
+	struct mutex		     sysfs_mutex;
 
 	u64			     uverbs_cmd_mask;
 	int			     uverbs_abi_ver;


From rdreier at cisco.com  Sun Feb 22 20:05:19 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 22 Feb 2009 20:05:19 -0800
Subject: [ofa-general] [PATCH] IB/ipath: Fix memory leak in
	init_shadow_tids() error path
Message-ID: <adad4d9x0g0.fsf@cisco.com>

If the second vmalloc() fails, the wrong pointer is pased to vfree(), so
the first vmalloc() ends up getting leaked.

This was spotted by the Coverity checker (CID 2709).

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Unless someone objects I'll merge this for 2.6.30.

 drivers/infiniband/hw/ipath/ipath_init_chip.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 64aeefb..077879c 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -455,7 +455,7 @@ static void init_shadow_tids(struct ipath_devdata *dd)
 	if (!addrs) {
 		ipath_dev_err(dd, "failed to allocate shadow dma handle "
 			      "array, no expected sends!\n");
-		vfree(dd->ipath_pageshadow);
+		vfree(pages);
 		dd->ipath_pageshadow = NULL;
 		return;
 	}
-- 
1.6.0.4


From rdreier at cisco.com  Sun Feb 22 20:17:00 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 22 Feb 2009 20:17:00 -0800
Subject: [ofa-general] [PATCH] IB/ipath: Really run work in
	ipath_release_user_pages_on_close()
In-Reply-To: <adad4d9x0g0.fsf@cisco.com> (Roland Dreier's message of "Sun, 22
	Feb 2009 20:05:19 -0800")
References: <adad4d9x0g0.fsf@cisco.com>
Message-ID: <ada63j1wzwj.fsf@cisco.com>

ipath_release_user_pages_on_close() just allocated a structure to
schedule work with but just returned (leaking the structure) rather than
actually doing schedule_work().  Fix the logic to what was intended.

This was spotted by the Coverity checker (CID 2700).

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
I'm only 99% sure this patch is correct... so someone who knows please
review.

 drivers/infiniband/hw/ipath/ipath_user_pages.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c
index 0190edc..855911e 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -209,20 +209,20 @@ void ipath_release_user_pages_on_close(struct page **p, size_t num_pages)
 
 	mm = get_task_mm(current);
 	if (!mm)
-		goto bail;
+		return;
 
 	work = kmalloc(sizeof(*work), GFP_KERNEL);
 	if (!work)
 		goto bail_mm;
 
-	goto bail;
-
 	INIT_WORK(&work->work, user_pages_account);
 	work->mm = mm;
 	work->num_pages = num_pages;
 
+	schedule_work(&work->work);
+	return;
+
 bail_mm:
 	mmput(mm);
-bail:
 	return;
 }
-- 
1.6.0.4


From Jie.Cai at cs.anu.edu.au  Sun Feb 22 20:46:30 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Mon, 23 Feb 2009 15:46:30 +1100
Subject: [ofa-general] RDMA write with immediate data.
In-Reply-To: <E3280858FA94444CA49D2BA02341C9833A83B978@orsmsx506.amr.corp.intel.com>
References: <499CBEF2.2010909@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A7C6173@orsmsx506.amr.corp.intel.com>
	<499E25DE.5020703@cs.anu.edu.au>
	<E3280858FA94444CA49D2BA02341C9833A83B978@orsmsx506.amr.corp.intel.com>
Message-ID: <49A22A26.50809@cs.anu.edu.au>


Davis, Arlin R wrote:
>  
>   
>>> Do you have receive's posted at the remote side for immed data?
>>>
>>>       
>> Nope, the remote side didn't got an event, (dat_evd_wait timed out).
>> The way to find out the immed data is to check the out going
>> parameter &event of dat_evd_wait function.
>>     
>
> I don't understand your answer. Do you have a receive buffer pre-posted
> on the EP to receive the inbound immediate data? Just waiting on the
> event in not enough. For immediate data you don't need a buffer associated
> with the work request but you do need the work request posted for each
> inbound rdma_write with immed that is expected.
>   
This does help. I forgot to pre-post receive for the immediate data.
> -arlin
>
>
>   


From rdreier at cisco.com  Sun Feb 22 20:40:53 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 22 Feb 2009 20:40:53 -0800
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <200902221337.01172.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 22 Feb 2009 13:37:00 +0200")
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902220909.11784.jackm@dev.mellanox.co.il>
	<adahc2nvt6b.fsf@cisco.com>
	<200902221337.01172.jackm@dev.mellanox.co.il>
Message-ID: <ada1vtpwysq.fsf@cisco.com>

 > There is still a problem, which we do not see with ConnectX (because
 > of the separation between mlx4_ib and mlx4_core -- and we are
 > unloading only mlx4_ib, leaving all the mlx4_core infrastructure
 > intact).
 > 
 > I tried your test with a Sinai card (mthca, and got the following
 > Kernel Oops (on Kernel 2,6,27.4) (Note that ib_mthca is still loaded,
 > but with "(-)" following).

Oh I see... we leave the sysfs stuff around way too long, since we want
to use it for tracking the lifetime of our class device.  the patch
below fixes things for me here... there's still room for substantial
cleanup but I think this gets the crashes fixed at least:

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 7913b80..d1fba41 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -193,7 +193,7 @@ void ib_dealloc_device(struct ib_device *device)
 
 	BUG_ON(device->reg_state != IB_DEV_UNREGISTERED);
 
-	ib_device_unregister_sysfs(device);
+	kobject_put(&device->dev.kobj);
 }
 EXPORT_SYMBOL(ib_dealloc_device);
 
@@ -348,6 +348,8 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_unlock(&device_mutex);
 
+	ib_device_unregister_sysfs(device);
+
 	spin_lock_irqsave(&device->client_data_lock, flags);
 	list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
 		kfree(context);
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b43f7d3..5270aeb 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -848,6 +848,9 @@ void ib_device_unregister_sysfs(struct ib_device *device)
 	struct kobject *p, *t;
 	struct ib_port *port;
 
+	/* Hold kobject until ib_dealloc_device() */
+	kobject_get(&device->dev.kobj);
+
 	list_for_each_entry_safe(p, t, &device->port_list, entry) {
 		list_del(&p->entry);
 		port = container_of(p, struct ib_port, kobj);


From YJia at tmriusa.com  Sun Feb 22 21:38:00 2009
From: YJia at tmriusa.com (Yicheng Jia)
Date: Sun, 22 Feb 2009 23:38:00 -0600
Subject: [ofa-general] opensm 3.2.1 lock up problem during initialization
Message-ID: <OFC85D1626.6E837ED6-ON86257565.00154C50-86257566.001EEEEC@TMRIUSA.COM>

Hi Folks,

I run into a lock up problem during opensm initialization process. The 
version I am using is 3.2.1. I noticed that there's a patch to fix race 
condition in main OpenSM flow for version 3.2.1: 
http://www.openfabrics.org/git/?p=~sashak/management.git;a=commit;h=adcdb327112c7261077cf4e4076a7499ce36c86f
.

But the OpenSM I am using is compiled without HAVE_LIBPTHREAD macro, the 
patch above is for HAVE_LIBPTHREAD code only. So my question are:
1. What is the difference between codes compiled with HAVE_LIBPTHREAD and 
without HAVE_LIBPTHREAD?
2. Could the race condition occur on OpenSM that's compiled without 
HAVE_LIBPTHREAD macro?

Thanks!

Yicheng Jia


_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090222/b0e3f32f/attachment.html>

From jackm at dev.mellanox.co.il  Sun Feb 22 23:28:34 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 23 Feb 2009 09:28:34 +0200
Subject: [ofa-general] build warnings on rhel4 U6
In-Reply-To: <1233949198.3257.19.camel@pc.interlinx.bc.ca>
References: <1233949198.3257.19.camel@pc.interlinx.bc.ca>
Message-ID: <200902230928.34846.jackm@dev.mellanox.co.il>

On Friday 06 February 2009 21:39, Brian J. Murrell wrote:
> I get these warnings trying to build with RHEL4U6 and ofa_kernel from OFED 1.4:
> 
> include/linux/jbd.h:1204:1: warning: "assert_spin_locked" redefined
> In file included from include/linux/wait.h:25,
>                  from include/linux/fs.h:12,
>                  from /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/fs.h:4,
>                  from /cache/build/BUILD/lustre-1.6.7.50/lustre/lvfs/fsfilt.c:42:
> /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/spinlock.h:8:1: warning: this is the location of the previous definition
> 
> The code in question is (from jbd.h):
> 
> #ifdef __KERNEL__
> 
> #ifdef CONFIG_SMP
> #define assert_spin_locked(lock)	J_ASSERT(spin_is_locked(lock))
> #else
> #define assert_spin_locked(lock)	do {} while(0)
> #endif
> 
> and (from the backport spinlock.h):
> 
> #ifndef BACKPORT_LINUX_SPINLOCK_H
> #define BACKPORT_LINUX_SPINLOCK_H
> 
> #include_next <linux/spinlock.h>
> 
> #define spin_lock_nested(lock, subclass) spin_lock(lock)
> 
> #define assert_spin_locked(lock)  do { (void)(lock); } while(0)
> 
> #endif
> 
> Any thoughts on how to resolve?
> 
> b.
In the backport spinlock.h file, try the following:

#ifndef assert_spin_locked
#define assert_spin_locked(lock)  do { (void)(lock); } while(0)
#endif

- Jack


From dorfman.eli at gmail.com  Sun Feb 22 23:40:46 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Mon, 23 Feb 2009 09:40:46 +0200
Subject: [ofa-general] ***SPAM*** opensm segmentation using git head
Message-ID: <49A252FE.4010006@gmail.com>

Command Line Arguments:
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.0_c4d9bcf

Entering DISCOVERING state

Using default GUID 0x2c9020022f019
 Loading Cached Option:qos_vlarb_high = 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
*** glibc detected *** ./sbin/opensm: double free or corruption (!prev): 0x000000001bd932b0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x371c871634]
/lib64/libc.so.6(cfree+0x8c)[0x371c874c5c]
./sbin/opensm[0x44e824]
./sbin/opensm(osm_subn_rescan_conf_files+0x20f)[0x4507bb]
./sbin/opensm[0x44d64c]
./sbin/opensm(osm_state_mgr_process+0xbc)[0x44de18]
./sbin/opensm[0x445d23]
./sbin/opensm[0x445e8b]
/tmp/mgmtbin/lib/libosmcomp.so.2[0x2b469b4e4472]
/lib64/libpthread.so.0[0x371d4062f7]
/lib64/libc.so.6(clone+0x6d)[0x371c8d1b6d]
======= Memory map: ========
00400000-004b5000 r-xp 00000000 08:01 655607                             /tmp/mgmtbin/sbin/opensm
006b4000-006b5000 rw-p 000b4000 08:01 655607                             /tmp/mgmtbin/sbin/opensm
006b5000-006ba000 rw-p 006b5000 00:00 0
1bd93000-1bdb4000 rw-p 1bd93000 00:00 0
40497000-40498000 ---p 40497000 00:00 0
40498000-40e98000 rw-p 40498000 00:00 0
4167a000-4167b000 ---p 4167a000 00:00 0
4167b000-4207b000 rw-p 4167b000 00:00 0
4207b000-4207c000 ---p 4207b000 00:00 0
4207c000-42a7c000 rw-p 4207c000 00:00 0
42a7c000-42a7d000 ---p 42a7c000 00:00 0
42a7d000-4347d000 rw-p 42a7d000 00:00 0
4347d000-4347e000 ---p 4347d000 00:00 0
4347e000-43e7e000 rw-p 4347e000 00:00 0
43e7e000-43e7f000 ---p 43e7e000 00:00 0
43e7f000-4487f000 rw-p 43e7f000 00:00 0
4487f000-44880000 ---p 4487f000 00:00 0
44880000-45280000 rw-p 44880000 00:00 0
45280000-45281000 ---p 45280000 00:00 0
45281000-45c81000 rw-p 45281000 00:00 0
45c81000-45c82000 ---p 45c81000 00:00 0
45c82000-46682000 rw-p 45c82000 00:00 0
46682000-46683000 ---p 46682000 00:00 0
46683000-47083000 rw-p 46683000 00:00 0
47083000-47084000 ---p 47083000 00:00 0
47084000-47a84000 rw-p 47084000 00:00 0
47a84000-47a85000 ---p 47a84000 00:00 0
47a85000-48485000 rw-p 47a85000 00:00 0
371c400000-371c41a000 r-xp 00000000 08:01 1769759                        /lib64/ld-2.5.so
371c61a000-371c61b000 r--p 0001a000 08:01 1769759                        /lib64/ld-2.5.so
371c61b000-371c61c000 rw-p 0001b000 08:01 1769759                        /lib64/ld-2.5.so
371c800000-371c94a000 r-xp 00000000 08:01 1769760                        /lib64/libc-2.5.so
371c94a000-371cb49000 ---p 0014a000 08:01 1769760                        /lib64/libc-2.5.so
371cb49000-371cb4d000 r--p 00149000 08:01 1769760                        /lib64/libc-2.5.so
371cb4d000-371cb4e000 rw-p 0014d000 08:01 1769760                        /lib64/libc-2.5.so
371cb4e000-371cb53000 rw-p 371cb4e000 00:00 0
371d000000-371d002000 r-xp 00000000 08:01 1769665                        /lib64/libdl-2.5.so
371d002000-371d202000 ---p 00002000 08:01 1769665                        /lib64/libdl-2.5.so
371d202000-371d203000 r--p 00002000 08:01 1769665                        /lib64/libdl-2.5.so
371d203000-371d204000 rw-p 00003000 08:01 1769665                        /lib64/libdl-2.5.so
371d400000-371d415000 r-xp 00000000 08:01 1769762                        /lib64/libpthread-2.5.so
371d415000-371d614000 ---p 00015000 08:01 1769762                        /lib64/libpthread-2.5.so
371d614000-371d615000 r--p 00014000 08:01 1769762                        /lib64/libpthread-2.5.so
371d615000-371d616000 rw-p 00015000 08:01 1769762                        /lib64/libpthread-2.5.so
371d616000-371d61a000 rw-p 371d616000 00:00 0
371ec00000-371ec0d000 r-xp 00000000 08:01 1769765                        /lib64/libgcc_s-4.1.2-20080102.so.1
371ec0d000-371ee0d000 ---p 0000d000 08:01 1769765                        /lib64/libgcc_s-4.1.2-20080102.so.1
371ee0d000-371ee0e000 rw-p 0000d000 08:01 1769765                        /lib64/libgcc_s-4.1.2-20080102.so.1
2aaaaaaab000-2aaaaaaad000 rw-p 2aaaaaaab000 00:00 0
2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0
2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0
2b469b2cf000-2b469b2d1000 rw-p 2b469b2cf000 00:00 0
2b469b2d1000-2b469b2d9000 r-xp 00000000 08:01 630370                     /tmp/mgmtbin/lib/libosmvendor.so.2.0.0
2b469b2d9000-2b469b4d9000 ---p 00008000 08:01 630370                     /tmp/mgmtbin/lib/libosmvendor.so.2.0.0
2b469b4d9000-2b469b4da000 rw-p 00008000 08:01 630370                     /tmp/mgmtbin/lib/libosmvendor.so.2.0.0
2b469b4da000-2b469b4ea000 r-xp 00000000 08:01 630322                     /tmp/mgmtbin/lib/libosmcomp.so.2.0.4
2b469b4ea000-2b469b6ea000 ---p 00010000 08:01 630322                     /tmp/mgmtbin/lib/libosmcomp.so.2.0.4
2b469b6ea000-2b469b6eb000 rw-p 00010000 08:01 630322                     /tmp/mgmtbin/lib/libosmcomp.so.2.0.4
2b469b6eb000-2b469b6fb000 r-xp 00000000 08:01 630374                     /tmp/mgmtbin/lib/libopensm.so.2.1.3
2b469b6fb000-2b469b8fa000 ---p 00010000 08:01 630374                     /tmp/mgmtbin/lib/libopensm.so.2.1.3
2b469b8fa000-2b469b8fc000 rw-p 0000f000 08:01 630374                     /tmp/mgmtbin/lib/libopensm.so.2.1.3
2b469b8fc000-2b469b8fd000 rw-p 2b469b8fc000 00:00 0
2b469b8fd000-2b469b903000 r-xp 00000000 08:01 630037                     /tmp/mgmtbin/lib/libibumad.so.1.0.3
2b469b903000-2b469bb03000 ---p 00006000 08:01 630037                     /tmp/mgmtbin/lib/libibumad.so.1.0.3
2b469bb03000-2b469bb04000 rw-p 00006000 08:01 630037                     /tmp/mgmtbin/lib/libibumad.so.1.0.3
2b469bb04000-2b469bb05000 rw-p 2b469bb04000 00:00 0
2b469bb18000-2b469bb1a000 rw-p 2b469bb18000 00:00 0
7fff0f7b5000-7fff0f7db000 rw-p 7fff0f7b5000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
Aborted


From vlad at lists.openfabrics.org  Mon Feb 23 03:25:49 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 23 Feb 2009 03:25:49 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090223-0200 daily build status
Message-ID: <20090223112549.DBCF9E60C4D@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From jackm at dev.mellanox.co.il  Mon Feb 23 03:30:29 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 23 Feb 2009 13:30:29 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <ada1vtpwysq.fsf@cisco.com>
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902221337.01172.jackm@dev.mellanox.co.il>
	<ada1vtpwysq.fsf@cisco.com>
Message-ID: <200902231330.29669.jackm@dev.mellanox.co.il>

On Monday 23 February 2009 06:40, Roland Dreier wrote:
> Oh I see... we leave the sysfs stuff around way too long, since we want
> to use it for tracking the lifetime of our class device.  the patch
> below fixes things for me here... there's still room for substantial
> cleanup but I think this gets the crashes fixed at least:
> 
I'm not sure that it does.  This does not make sysfs access atomic wrt module unloading.
I think an app can still lose it's timeslice while inside the sysfs access, and module
unload can still occur while the app is waiting for a new time slice (although the code pages
will not be removed as yet -- see below).

While the module code pages will still be available, what prevents module cleanup from
deleting all the module's resources?  In this case, the app will succeed in invoking
the low-level driver (its code is still loaded), but may cause an Oops when that low-level
driver code attempts to access low-level driver data structures (which have been freed).

What about the patch I just submitted?
        http://lists.openfabrics.org/pipermail/general/2009-February/057565.html

([ofa-general] [PATCH] ib_core: avoid race condition between sysfs access and low-level module unload)

- Jack


From eli at dev.mellanox.co.il  Mon Feb 23 05:20:08 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 23 Feb 2009 15:20:08 +0200
Subject: [ofa-general] Too many calls to mlx4_CLOSE_PORT()?
Message-ID: <20090223132008.GA1188@mtls03>

Roland,

browsing the code, I see that mlx4_CLOSE_PORT() gets called from,
seemingly, too many places. I would expect it to get called only from
__mlx4_ib_modify_qp() when QP0 gets closed, but mlx4_ib_remove() calls
it too even though it is soon to be called by __mlx4_ib_modify_qp()
due to destroying the MAD QP. It also gets called from
mlx4_remove_one() even though by the time this function gets called,
the port is already closed. Is there a reason for that?


From swise at opengridcomputing.com  Mon Feb 23 07:36:49 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 23 Feb 2009 09:36:49 -0600
Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for
	active connections.
In-Reply-To: <adafxi9x89u.fsf@cisco.com>
References: <20090217215959.16117.17150.stgit@NTAC> <adafxi9x89u.fsf@cisco.com>
Message-ID: <49A2C291.20706@opengridcomputing.com>

Roland Dreier wrote:
>  > -	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
>  > +	return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
>
> minor but the parens around the function call are totally unnecessary.
> If we're touching the line anyway may as well leave them off.
>
>   

Sure.

>  > +static int iwch_post_qp_fatal(int id, void *p, void *data)
>  > +{
>  > +	struct ib_event event;
>  > +	struct iwch_qp *qhp = p;
>  > +
>  > +	event.event = IB_EVENT_DEVICE_FATAL;
>  > +	event.device = qhp->ibqp.device;
>  > +	event.element.qp = &qhp->ibqp;
>  > +	BUG_ON(qhp->rhp != data);
>  > +	BUG_ON(qhp->wq.qpid != id);
>  > +	if (qhp->ibqp.event_handler) {
>  > +		PDBG("%s posting DEVICE_FATAL for qpid %u\n",
>  > +			__func__, qhp->wq.qpid);
>  > +		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
>
> This doesn't match the IB driver behavior (or the IB spec) -- the
> DEVICE_FATAL event is unaffiliated and delivered for the adapter as a
> whole.  QP events are supposed to be for events connected to a single
> QP, not the whole adapter failing.
>
>   


I'll change this to QP_FATAL then.


> BTW I don't think you need the * here, do you?  Would be easier to read
> to just call it like
>
> 	qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context)
>
>   


Ok.


>  > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e)
>  > +{
>  > +	int	error=0;
>  > +	struct cxio_rdev *rdev;
>  > +
>  > +	rdev = (struct cxio_rdev *)tdev->ulp;
>  > +	if (rdev->flags) {
>
> Might be nice to wrap this rdev->flags test up in a trivial inline
> function (eg iwch_eeh_set() or something like that) in case other things
> get put into those flags later.
>   


Agreed.


>  > +		kfree_skb(skb);
>  > +		return -EIO;
>  > +	}
>  > +	error = l2t_send(tdev, skb, l2e);
>  > +	if (error)
>  > +		kfree_skb(skb);
>  > +	return error;
>  > +}
>
> The kfree_skb() calls here change behavior -- eg you have the change:
>
>  > -	l2t_send(ep->com.tdev, skb, ep->l2t);
>  > -	return 0;
>  > +	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
>
> and now if l2t_send() returns an error the skb is freed, where before it
> wasn't.
>   

In looking at the l2t_send code, it doesn't free on failure, so I 
believe this was a memory leak in the existing error path.

> Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it
> not make more sense for the cxgb3 l2t_send() to check the eeh state and
> always behave appropriately?  Or is it more complicated than that?
>
>   

Maybe.

Divy, what do you think?


Steve.


>  - R.
>   


From stijn.deweirdt at ugent.be  Mon Feb 23 06:40:04 2009
From: stijn.deweirdt at ugent.be (Stijn De Weirdt)
Date: Mon, 23 Feb 2009 15:40:04 +0100
Subject: [ofa-general] el5.3 backport of 1.4(.0)
Message-ID: <1235400004.4588.43.camel@spike.ugent.be>

hi all,

i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
one thing we would also like to look at is switching from OFED 1.3.2 to
OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
fixes only exist in the current 1.4.1 daily snapshots.
did anyone already try to backport the el5.3 backport fixes from 1.4.1
to 1.4.0?

many thanks,

stijn


From tziporet at mellanox.co.il  Mon Feb 23 08:10:38 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 23 Feb 2009 18:10:38 +0200
Subject: [ofa-general] OFED (EWG) meeting agenda for today  (Feb 23)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com>


Hi All,
Due to unexpected thing I cannot attend the meeting today :-(
I sent a mail to Gopal asking him to replace me but got no respond yet.
If he can't maybe Woody or Betsy can 
 
In any case - these are the items that should be covered:
 
a. OFED 1.4.1 release:
	1. SLES 11 - backport progress - Jeff Becker 
	2. Open MPI 1.3.1 - Jeff Squyres 
	3. RDS with iWARP support - Steve Wise
	4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise
	5. Critical bugs:
	1287    	maj  	RHEL  	jackm at mellanox.co.il  	 IPoIB
datagram mode initial packet loss
	1516 	cri 	RHEL 	andy.grover at oracle.com Kernel panic on
RHAS4.x loading RDS 

Note: There is 1.4.1 release number in bugzilla - please change bug
release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1

b. Open discussion

Tziporet


From john.russo at qlogic.com  Mon Feb 23 08:11:13 2009
From: john.russo at qlogic.com (John Russo)
Date: Mon, 23 Feb 2009 10:11:13 -0600
Subject: [ofa-general] RE: OFED (EWG) meeting agenda for today  (Feb 23)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com>
	<5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com>
Message-ID: <A331668DC876334996266B5A7756A013134E355880@MNEXMB2.qlogic.org>

Betsy can't make it today.  I will be covering for her.  Worst case, I will cover the items that you listed.

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren
Sent: Monday, February 23, 2009 11:11 AM
To: ewg at lists.openfabrics.org
Cc: general at lists.openfabrics.org
Subject: [ewg] OFED (EWG) meeting agenda for today (Feb 23)


Hi All,
Due to unexpected thing I cannot attend the meeting today :-(
I sent a mail to Gopal asking him to replace me but got no respond yet.
If he can't maybe Woody or Betsy can 
 
In any case - these are the items that should be covered:
 
a. OFED 1.4.1 release:
	1. SLES 11 - backport progress - Jeff Becker 
	2. Open MPI 1.3.1 - Jeff Squyres 
	3. RDS with iWARP support - Steve Wise
	4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise
	5. Critical bugs:
	1287    	maj  	RHEL  	jackm at mellanox.co.il  	 IPoIB
datagram mode initial packet loss
	1516 	cri 	RHEL 	andy.grover at oracle.com Kernel panic on
RHAS4.x loading RDS 

Note: There is 1.4.1 release number in bugzilla - please change bug
release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1

b. Open discussion

Tziporet

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From tziporet at dev.mellanox.co.il  Mon Feb 23 08:16:38 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 23 Feb 2009 18:16:38 +0200
Subject: [ofa-general] Re: [ewg] RE: OFED (EWG) meeting agenda for today (Feb
	23)
In-Reply-To: <A331668DC876334996266B5A7756A013134E355880@MNEXMB2.qlogic.org>
References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com>	<5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com>
	<A331668DC876334996266B5A7756A013134E355880@MNEXMB2.qlogic.org>
Message-ID: <49A2CBE6.7050903@mellanox.co.il>

John Russo wrote:
> Betsy can't make it today.  I will be covering for her.  Worst case, I will cover the items that you listed.
>
>
>   
Many thanks

Tziporet


From tom at opengridcomputing.com  Mon Feb 23 08:30:37 2009
From: tom at opengridcomputing.com (Tom Tucker)
Date: Mon, 23 Feb 2009 10:30:37 -0600
Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,
In-Reply-To: <499FCA5F.5070604@mellanox.com>
References: <499FCA5F.5070604@mellanox.com>
Message-ID: <49A2CF2D.6020002@opengridcomputing.com>

Vu:

What memory registration model are you using?

Vu Pham wrote:
> Hi Tom,
> 
> I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
> nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX 
> (mlx4_ib) HCAs
> I have seen several problems during my testing at NFS Connectathon 2009
> 
> 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
> client can not mount. Talking to Tom Talpey and scanning the code, I saw 
> that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
> provider does not have the implementation for this verb.
> If I have client on mlx4_ib and server on ib_mthca, I hit the following 
> crash because of bad error handling in xprtrdma (see file attached - 
> mlx4_mount_problem.log)
> 
> Because of this problem, I use InfiniHost III (ib_mthca) for all of my 
> tests at Connectathon
> 
> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
> nfsrdma servers, I hit the process hung problem during the 
> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
> attached files). I can only reproduce it when I ran connectathon more 
> than 500 iterations (-N 1000)
> I can NOT reproduce the problem with nfs client/server over IPoIB
> 
> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
> hit the following BUG_ON() right away(see file attached - svcrdma_send.log)
> 
> thanks,
> -vu
> 


From sashak at voltaire.com  Mon Feb 23 09:03:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Feb 2009 19:03:42 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet: fix crash in qos string
	config parameters reloading
In-Reply-To: <49A252FE.4010006@gmail.com>
References: <49A252FE.4010006@gmail.com>
Message-ID: <20090223170342.GE7641@sashak.voltaire.com>


This fixes double free() crash in qos string config parameters
reloading. Assuming that qos parameters can be specified using config
file only we will always keep this in sync with options copy loaded from
file.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

On 09:40 Mon 23 Feb     , Eli Dorfman (Voltaire) wrote:
> Command Line Arguments:
>  Log File: /var/log/opensm.log
> -------------------------------------------------
> OpenSM 3.3.0_c4d9bcf

[snip...]

> Using default GUID 0x2c9020022f019
>  Loading Cached Option:qos_vlarb_high = 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> *** glibc detected *** ./sbin/opensm: double free or corruption (!prev): 0x000000001bd932b0 ***

This happens because qos string parameter is freed separately in
subn_init_qos_options() and its mirror pointer in file config copy still
refer already not allocated memory. Thanks for finding this. The patch
should fix the issue.

Sasha

 opensm/opensm/osm_subnet.c |   29 ++++++++++++++++++-----------
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 01478be..b3100a4 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -640,7 +640,7 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
 	opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
 }
 
-static void subn_init_qos_options(IN osm_qos_options_t * opt)
+static void subn_init_qos_options(osm_qos_options_t *opt, osm_qos_options_t *f)
 {
 	opt->max_vls = 0;
 	opt->high_limit = -1;
@@ -653,6 +653,8 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt)
 	if (opt->sl2vl)
 		free(opt->sl2vl);
 	opt->sl2vl = NULL;
+	if (f)
+		memcpy(f, opt, sizeof(*f));
 }
 
 /**********************************************************************
@@ -743,11 +745,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->no_clients_rereg = FALSE;
 	p_opt->prefix_routes_file = strdup(OSM_DEFAULT_PREFIX_ROUTES_FILE);
 	p_opt->consolidate_ipv6_snm_req = FALSE;
-	subn_init_qos_options(&p_opt->qos_options);
-	subn_init_qos_options(&p_opt->qos_ca_options);
-	subn_init_qos_options(&p_opt->qos_sw0_options);
-	subn_init_qos_options(&p_opt->qos_swe_options);
-	subn_init_qos_options(&p_opt->qos_rtr_options);
+	subn_init_qos_options(&p_opt->qos_options, NULL);
+	subn_init_qos_options(&p_opt->qos_ca_options, NULL);
+	subn_init_qos_options(&p_opt->qos_sw0_options, NULL);
+	subn_init_qos_options(&p_opt->qos_swe_options, NULL);
+	subn_init_qos_options(&p_opt->qos_rtr_options, NULL);
 }
 
 /**********************************************************************
@@ -1192,11 +1194,16 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 		return -1;
 	}
 
-	subn_init_qos_options(&p_opts->qos_options);
-	subn_init_qos_options(&p_opts->qos_ca_options);
-	subn_init_qos_options(&p_opts->qos_sw0_options);
-	subn_init_qos_options(&p_opts->qos_swe_options);
-	subn_init_qos_options(&p_opts->qos_rtr_options);
+	subn_init_qos_options(&p_opts->qos_options,
+			      &p_opts->file_opts->qos_options);
+	subn_init_qos_options(&p_opts->qos_ca_options,
+			      &p_opts->file_opts->qos_ca_options);
+	subn_init_qos_options(&p_opts->qos_sw0_options,
+			      &p_opts->file_opts->qos_sw0_options);
+	subn_init_qos_options(&p_opts->qos_swe_options,
+			      &p_opts->file_opts->qos_swe_options);
+	subn_init_qos_options(&p_opts->qos_rtr_options,
+			      &p_opts->file_opts->qos_rtr_options);
 
 	while (fgets(line, 1023, opts_file) != NULL) {
 		/* get the first token */
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Mon Feb 23 09:21:47 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Feb 2009 19:21:47 +0200
Subject: [ofa-general] [PATCH] opensm/main.c: remove enable_stack_dump() call
Message-ID: <20090223172147.GH7641@sashak.voltaire.com>


enable_stack_dump() symbol was defined in already removed libibcommon.
There still be conditional (undef #ifdef _DEBUG_) call to this function
in opensm/main.c which breaks build opensm linkage when --enable-debug
configured. Removing this.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/main.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index e22c2c4..47fd658 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -596,9 +596,6 @@ int main(int argc, char *argv[])
 			osm_is_debug(), cl_is_debug());
 		exit(1);
 	}
-#if defined (_DEBUG_) && defined (OSM_VENDOR_INTF_OPENIB)
-	enable_stack_dump(1);
-#endif
 
 	printf("-------------------------------------------------\n");
 	printf("%s\n", OSM_VERSION);
-- 
1.6.1.2.319.gbd9e


From vuhuong at mellanox.com  Mon Feb 23 10:03:24 2009
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 23 Feb 2009 10:03:24 -0800
Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,
In-Reply-To: <49A2CF2D.6020002@opengridcomputing.com>
References: <499FCA5F.5070604@mellanox.com>
	<49A2CF2D.6020002@opengridcomputing.com>
Message-ID: <49A2E4EC.7010202@mellanox.com>

Tom,

> Vu:
>
> What memory registration model are you using?

It is 6 (when the connection/mount established)


>
> Vu Pham wrote:
>> Hi Tom,
>>
>> I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
>> nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and 
>> ConnectX (mlx4_ib) HCAs
>> I have seen several problems during my testing at NFS Connectathon 2009
>>
>> 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
>> client can not mount. Talking to Tom Talpey and scanning the code, I 
>> saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
>> provider does not have the implementation for this verb.
>> If I have client on mlx4_ib and server on ib_mthca, I hit the 
>> following crash because of bad error handling in xprtrdma (see file 
>> attached - mlx4_mount_problem.log)
>>
>> Because of this problem, I use InfiniHost III (ib_mthca) for all of 
>> my tests at Connectathon
>>
>> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
>> nfsrdma servers, I hit the process hung problem during the 
>> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
>> attached files). I can only reproduce it when I ran connectathon more 
>> than 500 iterations (-N 1000)
>> I can NOT reproduce the problem with nfs client/server over IPoIB
>>
>> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
>> hit the following BUG_ON() right away(see file attached - 
>> svcrdma_send.log)
>>
>> thanks,
>> -vu
>>
>


From tmtalpey at rcn.com  Mon Feb 23 10:10:33 2009
From: tmtalpey at rcn.com (Tom Talpey)
Date: Mon, 23 Feb 2009 13:10:33 -0500
Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,
In-Reply-To: <49A2E4EC.7010202@mellanox.com>
References: <499FCA5F.5070604@mellanox.com>
	<49A2CF2D.6020002@opengridcomputing.com>
	<49A2E4EC.7010202@mellanox.com>
Message-ID: <20090223181737.90686E61019@openfabrics.org>

At 01:03 PM 2/23/2009, Vu Pham wrote:
>Tom,
>
>> Vu:
>>
>> What memory registration model are you using?
>
>It is 6 (when the connection/mount established)

i.e. all physical (get_dma_mr). Long chunk lists due to discontiguous
physical pages.

We'll try with ConnectX and frmr's later today here at Connectathon.
This will reduce the chunk lists to roughly three entries (head, pages,
tail).

With the two assertions disabled, we're again passing all general and
special tests from the OpenSolaris client, btw. :-)

Tom.

>
>
>>
>> Vu Pham wrote:
>>> Hi Tom,
>>>
>>> I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
>>> nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and 
>>> ConnectX (mlx4_ib) HCAs
>>> I have seen several problems during my testing at NFS Connectathon 2009
>>>
>>> 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
>>> client can not mount. Talking to Tom Talpey and scanning the code, I 
>>> saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
>>> provider does not have the implementation for this verb.
>>> If I have client on mlx4_ib and server on ib_mthca, I hit the 
>>> following crash because of bad error handling in xprtrdma (see file 
>>> attached - mlx4_mount_problem.log)
>>>
>>> Because of this problem, I use InfiniHost III (ib_mthca) for all of 
>>> my tests at Connectathon
>>>
>>> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
>>> nfsrdma servers, I hit the process hung problem during the 
>>> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
>>> attached files). I can only reproduce it when I ran connectathon more 
>>> than 500 iterations (-N 1000)
>>> I can NOT reproduce the problem with nfs client/server over IPoIB
>>>
>>> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
>>> hit the following BUG_ON() right away(see file attached - 
>>> svcrdma_send.log)
>>>
>>> thanks,
>>> -vu
>>>
>>
>
>


From rdreier at cisco.com  Mon Feb 23 10:31:24 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 23 Feb 2009 10:31:24 -0800
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <200902231330.29669.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 23 Feb 2009 13:30:29 +0200")
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902221337.01172.jackm@dev.mellanox.co.il>
	<ada1vtpwysq.fsf@cisco.com>
	<200902231330.29669.jackm@dev.mellanox.co.il>
Message-ID: <adaocwtuhs3.fsf@cisco.com>

 > I'm not sure that it does.  This does not make sysfs access atomic wrt module unloading.
 > I think an app can still lose it's timeslice while inside the sysfs access, and module
 > unload can still occur while the app is waiting for a new time slice (although the code pages
 > will not be removed as yet -- see below).

Not sure I follow... the low-level driver must handle requests until
ib_unregister_device() returns, and with the change I proposed,
ib_unregister_device() will not return until all sysfs files are gone
(and no open file handles remain).

 > What about the patch I just submitted?

I'd rather not add a superfluous mutex that adds complexity when a
simpler solution is available.

 - R.


From brian at sun.com  Mon Feb 23 12:08:07 2009
From: brian at sun.com (Brian J. Murrell)
Date: Mon, 23 Feb 2009 15:08:07 -0500
Subject: [ofa-general] build warnings on rhel4 U6
In-Reply-To: <200902230928.34846.jackm@dev.mellanox.co.il>
References: <1233949198.3257.19.camel@pc.interlinx.bc.ca>
	<200902230928.34846.jackm@dev.mellanox.co.il>
Message-ID: <1235419687.12136.111.camel@pc.interlinx.bc.ca>

On Mon, 2009-02-23 at 09:28 +0200, Jack Morgenstein wrote:
> In the backport spinlock.h file, try the following:
> 
> #ifndef assert_spin_locked
> #define assert_spin_locked(lock)  do { (void)(lock); } while(0)
> #endif

Indeed.  That would be a solution for the end-user but that doesn't help
us as a third-party software developer (i.e. being restricted to
building our software with "GA" releases of OFED -- so that our release
doesn't turn into a patching nightmare for our end-users).

Indeed, this probably should have been a BZ filing as my goal was
equally as much to alert somebody to the problem to ensure future
releases don't have the same problem.

Cheers and many thanks for the input.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090223/1d46493c/attachment.sig>

From john.russo at qlogic.com  Mon Feb 23 13:06:32 2009
From: john.russo at qlogic.com (John Russo)
Date: Mon, 23 Feb 2009 15:06:32 -0600
Subject: [ofa-general] ***SPAM*** OFED Minutes:  02/23/09
Message-ID: <A331668DC876334996266B5A7756A013134E3558D9@MNEXMB2.qlogic.org>

These are the OFED (EWG) meeting minutes for Feb 23 on OFED 1.4.1 release

Meeting Summary:
==============

1. Update on 1.4.1.
2. Update on 1.4.1. PRs
3. Update on Sonoma agenda

Details:
======

1. Update on 1.4.1.:

               1. SLES 11 - backport progress - Jeff Becker

                              Just received access to RC4 source and started to build on Sunday

                              Basic IB builds without change.

                                             MTHCA builds

                                             Connect-X next... Followed by ULPs


               2. Open MPI 1.3.1 - Jeff Squyres

                              1.3.1 had not been released yet.  Weekly Open MPI on Tuesdays

                                             Could release in 1 or 2 days if things go well

                                             Will send email and upload to Vlad


               3. RDS with iWARP support - Steve Wise

                              All of the latest updates pushed.  Will begin testing with Oracle this week.

                                             CRTEST on 4 node cluster.  Testing normally takes a couple of weeks

                                             Rupert asked about updating test plans for April event

Steve will try to supply some info.

Some tests are in OFED release

May have to go to Oracle directly for other tests


               4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise

                              2.6.25 & 2.6.22 backports pass basic tests.  Will try to push changes out this week

                              RedHat 5.2: Most tests passing.  Will push after .25 and .26

                              RedHat 5.3: In queue behind Redat 5.2

                              Rupert asked for tests on this these changes also


2. Update on 1.4.1. PRs
1287      maj        RHEL      jackm at mellanox.co.il<mailto:jackm at mellanox.co.il>      IPoIB datagram mode initial packet loss
       No one on the call to respond to this issue

1516      cri           RHEL      andy.grover at oracle.com<mailto:andy.grover at oracle.com>   Kernel panic on RHAS4.x loading RDS
     No one on the call to address this either.  Was told that Andy will be pinged and asked to respond

Numerous PRs are still listed in Bugzilla as Blocking or Critical.  John asked all participants to look at the PRs assigned to them and adjust their status as appropriate.

3. Sonoma updates from Bill Boas:
   Still struggling to get attendees and speakers
   Hope to extend early bird discounts into early May

   A side conversation stated at this point which diverted off into general issues/wishlist for OFED as well as other topics to be discussed at Sonoma.
       I will not capture the details of those discussions here.

  Rupert reminded everyone of the UNH/IHL  testing of OFED 1.4.1 the week of March 16-20 and pushed us to have as many patches in place at that time as possible.


John Russo

[cid:image001.jpg at 01C995CD.366283B0]
__________________________
John F. Russo
Manager, Engineering
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA 19406
Direct: 610-233-4866
Main: 610-233-4800
Fax: 610-233-4777
Cell: 610-246-9903
Email: John.Russo at qlogic.com<mailto:John.Russo at qlogic.com>
www.qlogic.com<http://www.qlogic.com>

True success is the undeniable truth that we have proved ourselves.
-Joe Luppino-Esposito
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090223/0120412b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3677 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090223/0120412b/attachment.jpg>

From vuhuong at mellanox.com  Mon Feb 23 15:21:04 2009
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 23 Feb 2009 15:21:04 -0800
Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,
In-Reply-To: <49A2E4EC.7010202@mellanox.com>
References: <499FCA5F.5070604@mellanox.com>	<49A2CF2D.6020002@opengridcomputing.com>
	<49A2E4EC.7010202@mellanox.com>
Message-ID: <49A32F60.2010803@mellanox.com>

Tom,

>> What memory registration model are you using?
>
> It is 6 (when the connection/mount established)
>
>
>>
>> Vu Pham wrote:
>>>
>>>
>>> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
>>> nfsrdma servers, I hit the process hung problem during the 
>>> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
>>> attached files). I can only reproduce it when I ran connectathon 
>>> more than 500 iterations (-N 1000)
>>> I can NOT reproduce the problem with nfs client/server over IPoIB
With mem_reg=4, I can not reproduce this problem (running against both 
OpenSolaris and Linux servers.


>>>
>>> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, 
>>> I hit the following BUG_ON() right away(see file attached - 
>>> svcrdma_send.log)
>>>
After disable two BUG_ON(), we can run test multiple times without 
problem yet

-vu


From swise at opengridcomputing.com  Mon Feb 23 15:54:45 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 23 Feb 2009 17:54:45 -0600
Subject: [ofa-general] [PATCH v2] RDMA/cxgb3: Handle EEH events for active
	connections.
Message-ID: <20090223235445.21618.85001.stgit@build.ogc.int>

- wrapper calls into cxgb3 and fail them if we're in the middle
  of an eeh event.

- correctly unwind and release endpoint and other resources when
  we are in an EEH event.

- post QP_FATAL event on all active QPs when cxgb3 notifies
  iw_cxgb3 of a fatal error.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |   10 ++--
 drivers/infiniband/hw/cxgb3/cxio_hal.h |    6 ++
 drivers/infiniband/hw/cxgb3/iwch.c     |   26 +++++++++
 drivers/infiniband/hw/cxgb3/iwch.h     |    5 ++
 drivers/infiniband/hw/cxgb3/iwch_cm.c  |   90 +++++++++++++++++++++++---------
 drivers/infiniband/hw/cxgb3/iwch_qp.c  |    4 +
 6 files changed, 107 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index eeae5f5..1db88dd 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -152,7 +152,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
 	sge_cmd = qpid << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
 	skb->priority = CPL_PRIORITY_CONTROL;
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb);
 }
 
 int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
@@ -571,7 +571,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
 	     (unsigned long long) rdev_p->ctrl_qp.dma_addr,
 	     rdev_p->ctrl_qp.workq, 1 << T3_CTRL_QP_SIZE_LOG2);
 	skb->priority = CPL_PRIORITY_CONTROL;
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb);
 err:
 	kfree_skb(skb);
 	return err;
@@ -701,7 +701,7 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
 	u32 stag_idx;
 	u32 wptr;
 
-	if (rdev_p->flags)
+	if (cxio_fatal_error(rdev_p))
 		return -EIO;
 
 	stag_state = stag_state > 0;
@@ -858,7 +858,7 @@ int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
 	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
 	wqe->irs = cpu_to_be32(attr->irs);
 	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
-	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+	return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb);
 }
 
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
@@ -1024,9 +1024,9 @@ void cxio_rdev_close(struct cxio_rdev *rdev_p)
 		cxio_hal_pblpool_destroy(rdev_p);
 		cxio_hal_rqtpool_destroy(rdev_p);
 		list_del(&rdev_p->entry);
-		rdev_p->t3cdev_p->ulp = NULL;
 		cxio_hal_destroy_ctrl_qp(rdev_p);
 		cxio_hal_destroy_resource(rdev_p->rscp);
+		rdev_p->t3cdev_p->ulp = NULL;
 	}
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 9ed65b0..2fd5d03 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -112,6 +112,11 @@ struct cxio_rdev {
 #define	CXIO_ERROR_FATAL	1
 };
 
+static inline int cxio_fatal_error(struct cxio_rdev *rdev_p)
+{
+	return (rdev_p->flags & CXIO_ERROR_FATAL);
+}
+
 static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
 {
 	return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5));
@@ -185,6 +190,7 @@ void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_flush_hw_cq(struct t3_cq *cq);
 int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
 		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb);
 
 #define MOD "iw_cxgb3: "
 #define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
index 37a4fc2..3548861 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -162,15 +162,37 @@ static void close_rnic_dev(struct t3cdev *tdev)
 	mutex_unlock(&dev_mutex);
 }
 
+static int iwch_post_qp_fatal(int id, void *p, void *data)
+{
+	struct ib_event event;
+	struct iwch_qp *qhp = p;
+
+	event.event = IB_EVENT_QP_FATAL;
+	event.device = qhp->ibqp.device;
+	event.element.qp = &qhp->ibqp;
+	BUG_ON(qhp->rhp != data);
+	BUG_ON(qhp->wq.qpid != id);
+	if (qhp->ibqp.event_handler) {
+		PDBG("%s posting QP_FATAL for qpid %u\n",
+			__func__, qhp->wq.qpid);
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+	}
+	return 0;
+}
+
 static void iwch_err_handler(struct t3cdev *tdev, u32 status, u32 error)
 {
 	struct cxio_rdev *rdev = tdev->ulp;
+	struct iwch_dev *rnicp = rdev_to_iwch_dev(rdev);
 
-	if (status == OFFLOAD_STATUS_DOWN)
+	if (status == OFFLOAD_STATUS_DOWN) {
 		rdev->flags = CXIO_ERROR_FATAL;
+		spin_lock_irq(&rnicp->lock);
+		idr_for_each(&rnicp->qpidr, iwch_post_qp_fatal, rnicp);
+		spin_unlock_irq(&rnicp->lock);
+	}
 
 	return;
-
 }
 
 static int __init iwch_init_module(void)
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
index 3773453..8473550 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.h
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -117,6 +117,11 @@ static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
 	return container_of(ibdev, struct iwch_dev, ibdev);
 }
 
+static inline struct iwch_dev *rdev_to_iwch_dev(struct cxio_rdev *rdev)
+{
+	return container_of(rdev, struct iwch_dev, rdev);
+}
+
 static inline int t3b_device(const struct iwch_dev *rhp)
 {
 	return rhp->rdev.t3cdev_p->type == T3B;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 8699947..ad38c45 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -139,6 +139,38 @@ static void stop_ep_timer(struct iwch_ep *ep)
 	put_ep(&ep->com);
 }
 
+int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e)
+{
+	int	error=0;
+	struct cxio_rdev *rdev;
+
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (cxio_fatal_error(rdev)) {
+		kfree_skb(skb);
+		return -EIO;
+	}
+	error = l2t_send(tdev, skb, l2e);
+	if (error)
+		kfree_skb(skb);
+	return error;
+}
+
+int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb)
+{
+	int	error=0;
+	struct cxio_rdev *rdev;
+
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (cxio_fatal_error(rdev)) {
+		kfree_skb(skb);
+		return -EIO;
+	}
+	error = cxgb3_ofld_send(tdev, skb);
+	if (error)
+		kfree_skb(skb);
+	return error;
+}
+
 static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
 {
 	struct cpl_tid_release *req;
@@ -150,7 +182,7 @@ static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
 	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
 	skb->priority = CPL_PRIORITY_SETUP;
-	cxgb3_ofld_send(tdev, skb);
+	iwch_cxgb3_ofld_send(tdev, skb);
 	return;
 }
 
@@ -172,8 +204,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep)
 	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
 
 	skb->priority = CPL_PRIORITY_DATA;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 int iwch_resume_tid(struct iwch_ep *ep)
@@ -194,8 +225,7 @@ int iwch_resume_tid(struct iwch_ep *ep)
 	req->val = 0;
 
 	skb->priority = CPL_PRIORITY_DATA;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static void set_emss(struct iwch_ep *ep, u16 opt)
@@ -382,7 +412,7 @@ static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
 
 	PDBG("%s t3cdev %p\n", __func__, dev);
 	req->cmd = CPL_ABORT_NO_RST;
-	cxgb3_ofld_send(dev, skb);
+	iwch_cxgb3_ofld_send(dev, skb);
 }
 
 static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
@@ -402,8 +432,7 @@ static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
 	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
 	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
@@ -424,8 +453,7 @@ static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
 	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
 	req->cmd = CPL_ABORT_SEND_RST;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_connect(struct iwch_ep *ep)
@@ -469,8 +497,7 @@ static int send_connect(struct iwch_ep *ep)
 	req->opt0l = htonl(opt0l);
 	req->params = 0;
 	req->opt2 = htonl(opt2);
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
@@ -527,7 +554,7 @@ static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
+	iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 	start_ep_timer(ep);
 	state_set(&ep->com, MPA_REQ_SENT);
 	return;
@@ -578,8 +605,7 @@ static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
@@ -630,8 +656,7 @@ static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
 	req->sndseq = htonl(ep->snd_seq);
 	ep->mpa_skb = skb;
 	state_set(&ep->com, MPA_REP_SENT);
-	l2t_send(ep->com.tdev, skb, ep->l2t);
-	return 0;
+	return iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 }
 
 static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
@@ -795,7 +820,7 @@ static int update_rx_credits(struct iwch_ep *ep, u32 credits)
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
 	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
 	skb->priority = CPL_PRIORITY_ACK;
-	cxgb3_ofld_send(ep->com.tdev, skb);
+	iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 	return credits;
 }
 
@@ -1203,8 +1228,7 @@ static int listen_start(struct iwch_listen_ep *ep)
 	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
 
 	skb->priority = 1;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
@@ -1237,8 +1261,7 @@ static int listen_stop(struct iwch_listen_ep *ep)
 	req->cpu_idx = 0;
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
 	skb->priority = 1;
-	cxgb3_ofld_send(ep->com.tdev, skb);
-	return 0;
+	return iwch_cxgb3_ofld_send(ep->com.tdev, skb);
 }
 
 static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
@@ -1286,7 +1309,7 @@ static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
 	rpl->opt2 = htonl(opt2);
 	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
 	skb->priority = CPL_PRIORITY_SETUP;
-	l2t_send(ep->com.tdev, skb, ep->l2t);
+	iwch_l2t_send(ep->com.tdev, skb, ep->l2t);
 
 	return;
 }
@@ -1315,7 +1338,7 @@ static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
 		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
 		rpl->opt2 = 0;
 		rpl->rsvd = rpl->opt2;
-		cxgb3_ofld_send(tdev, skb);
+		iwch_cxgb3_ofld_send(tdev, skb);
 	}
 }
 
@@ -1613,7 +1636,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
 	rpl->cmd = CPL_ABORT_NO_RST;
-	cxgb3_ofld_send(ep->com.tdev, rpl_skb);
+	iwch_cxgb3_ofld_send(ep->com.tdev, rpl_skb);
 out:
 	if (release)
 		release_ep_resources(ep);
@@ -2017,8 +2040,11 @@ int iwch_destroy_listen(struct iw_cm_id *cm_id)
 	ep->com.rpl_done = 0;
 	ep->com.rpl_err = 0;
 	err = listen_stop(ep);
+	if (err)
+		goto done;
 	wait_event(ep->com.waitq, ep->com.rpl_done);
 	cxgb3_free_stid(ep->com.tdev, ep->stid);
+done:
 	err = ep->com.rpl_err;
 	cm_id->rem_ref(cm_id);
 	put_ep(&ep->com);
@@ -2030,12 +2056,22 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
 	int ret=0;
 	unsigned long flags;
 	int close = 0;
+	int fatal = 0;
+	struct t3cdev *tdev;
+	struct cxio_rdev *rdev;
 
 	spin_lock_irqsave(&ep->com.lock, flags);
 
 	PDBG("%s ep %p state %s, abrupt %d\n", __func__, ep,
 	     states[ep->com.state], abrupt);
 
+	tdev = (struct t3cdev *)ep->com.tdev;
+	rdev = (struct cxio_rdev *)tdev->ulp;
+	if (cxio_fatal_error(rdev)) {
+		fatal = 1;
+		close_complete_upcall(ep);
+		ep->com.state = DEAD;
+	}
 	switch (ep->com.state) {
 	case MPA_REQ_WAIT:
 	case MPA_REQ_SENT:
@@ -2075,7 +2111,11 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
 			ret = send_abort(ep, NULL, gfp);
 		else
 			ret = send_halfclose(ep, gfp);
+		if (ret)
+			fatal = 1;
 	}
+	if (fatal)
+		release_ep_resources(ep);
 	return ret;
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index aa72d18..9324aa1 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -751,7 +751,7 @@ int iwch_post_zb_read(struct iwch_qp *qhp)
 	wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)|
 						V_FW_RIWR_LEN(flit_cnt));
 	skb->priority = CPL_PRIORITY_DATA;
-	return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
+	return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
 }
 
 /*
@@ -783,7 +783,7 @@ int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
 			 V_FW_RIWR_FLAGS(T3_COMPLETION_FLAG | T3_NOTIFY_FLAG));
 	wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid));
 	skb->priority = CPL_PRIORITY_DATA;
-	return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
+	return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb);
 }
 
 /*


From sean.hefty at intel.com  Mon Feb 23 17:34:51 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 23 Feb 2009 17:34:51 -0800
Subject: [ofa-general] [PATCH] [ib-diag] saquery: add support for WinOF
Message-ID: <608DC3F308254BB78890D5FD91B7CB33@amr.corp.intel.com>

A lot of type casting with include fix-ups.  Luckily, because
the macro CHECK_AND_SET_VAL() was added, I could add type casts
into the macro and avoid sprinkling even more throughout the code.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---


 infiniband-diags/src/saquery.c |   80 ++++++++++++++++++++++------------------
 1 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 9726d22..9d5f475 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -37,20 +37,25 @@
  *
  */
 
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
 #include <unistd.h>
 #include <stdio.h>
 #include <arpa/inet.h>
 #include <ctype.h>
 #include <string.h>
 #include <errno.h>
+#include <assert.h>
 
 #define _GNU_SOURCE
 #include <getopt.h>
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/iba/ib_types.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <iba/ib_types.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -170,7 +175,7 @@ recv_mad:
 	if (ibdebug > 1)
 		xdump(stdout, "SA Response:\n", mad, len);
 
-	method = mad_get_field(mad, 0, IB_MAD_METHOD_F);
+	method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F);
 	offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
 	result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
 	result.p_result_madw = mad;
@@ -189,12 +194,12 @@ recv_mad:
 static void *get_query_rec(void *mad, unsigned i)
 {
 	int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
-	return mad + IB_SA_DATA_OFFS + i * (offset << 3);
+	return (char *) mad + IB_SA_DATA_OFFS + i * (offset << 3);
 }
 
 static unsigned valid_gid(ib_gid_t *gid)
 {
-	ib_gid_t zero_gid = { };
+	ib_gid_t zero_gid = { 0 };
 	return memcmp(&zero_gid, gid, sizeof(*gid));
 }
 
@@ -442,7 +447,7 @@ static void dump_multicast_member_record(void *data)
 	char gid_str2[INET6_ADDRSTRLEN];
 	ib_member_rec_t *p_mcmr = data;
 	uint16_t mlid = cl_ntoh16(p_mcmr->mlid);
-	int i = 0;
+	unsigned i = 0;
 	char *node_name = "<unknown>";
 
 	/* go through the node records searching for a port guid which matches
@@ -758,7 +763,7 @@ static void dump_one_mft_record(void *data)
 
 static void dump_results(struct query_res *r, void (*dump_func) (void *))
 {
-	int i;
+	unsigned i;
 	for (i = 0; i < r->result_cnt; i++) {
 		void *data = get_query_rec(r->p_result_madw, i);
 		dump_func(data);
@@ -768,7 +773,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *))
 static void return_mad(void)
 {
 	if (result.p_result_madw) {
-		free(result.p_result_madw - umad_size());
+		free((char *) result.p_result_madw - umad_size());
 		result.p_result_madw = NULL;
 	}
 }
@@ -839,7 +844,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid)
 {
 	ib_node_record_t *node_record = NULL;
 	ib_node_info_t *p_ni = NULL;
-	int i = 0, ret;
+	unsigned i;
+	int ret;
 
 	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
 	if (ret)
@@ -869,7 +875,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name)
 	if (isalpha(name[0]))
 		assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS);
 	else
-		rc_lid = atoi(name);
+		rc_lid = (uint16_t) atoi(name);
 	if (rc_lid == 0)
 		fprintf(stderr, "Failed to find lid for \"%s\"\n", name);
 	return rc_lid;
@@ -917,8 +923,8 @@ static int parse_lid_and_ports(bind_handle_t h,
 
 #define cl_hton8(x) (x)
 #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \
-	if (val > comp_with) { \
-		target = cl_hton##size(val); \
+	if ((uint##size##_t) val > (uint##size##_t) comp_with) { \
+		target = cl_hton##size((uint##size##_t) val); \
 		comp_mask |= IB_##name##_COMPMASK_##mask; \
 	}
 
@@ -951,7 +957,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask)
 
 static int print_node_records(bind_handle_t h)
 {
-	int i = 0, ret;
+	unsigned i;
+	int ret;
 
 	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
 	if (ret)
@@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h,
 	CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID);
 	CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT);
 	CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL);
-	pr.hop_flow_raw |= cl_hton32(flow << 8);
+	pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8);
 	CHECK_AND_SET_VAL(p->tclass, 8, 0, pr.tclass, PR, TCLASS);
 	CHECK_AND_SET_VAL(p->reversible, 8, -1, reversible, PR, REVERSIBLE);
 	CHECK_AND_SET_VAL(p->numb_path, 8, -1, pr.num_path, PR, NUMBPATH);
@@ -1089,7 +1096,7 @@ static int print_multicast_member_records(bind_handle_t h)
 
 return_mc:
 	if (mc_group_result.p_result_madw)
-		free(mc_group_result.p_result_madw - umad_size());
+		free((char *) mc_group_result.p_result_madw - umad_size());
 
 	return ret;
 }
@@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
 	memset(&pktr, 0, sizeof(pktr));
 	CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID);
 	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
-	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
+	CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK);
 
 	return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0,
 					comp_mask, &pktr, smkey,
@@ -1503,13 +1510,13 @@ static int process_opt(void *context, int ch, char *optarg)
 		query_type = IB_SA_ATTR_LINKRECORD;
 		break;
 	case 5:
-		p->slid = strtoul(optarg, NULL, 0);
+		p->slid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 6:
-		p->dlid = strtoul(optarg, NULL, 0);
+		p->dlid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 7:
-		p->mlid = strtoul(optarg, NULL, 0);
+		p->mlid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 14:
 		if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0)
@@ -1534,7 +1541,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->numb_path = strtoul(optarg, NULL, 0);
 		break;
 	case 18:
-		p->pkey = strtoul(optarg, NULL, 0);
+		p->pkey = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'Q':
 		p->qos_class = strtoul(optarg, NULL, 0);
@@ -1543,19 +1550,19 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->sl = strtoul(optarg, NULL, 0);
 		break;
 	case 'M':
-		p->mtu = strtoul(optarg, NULL, 0);
+		p->mtu = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'R':
-		p->rate = strtoul(optarg, NULL, 0);
+		p->rate = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 20:
-		p->pkt_life = strtoul(optarg, NULL, 0);
+		p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'q':
 		p->qkey = strtoul(optarg, NULL, 0);
 		break;
 	case 'T':
-		p->tclass = strtoul(optarg, NULL, 0);
+		p->tclass = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'F':
 		p->flow_label = strtoul(optarg, NULL, 0);
@@ -1564,10 +1571,10 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->hop_limit = strtoul(optarg, NULL, 0);
 		break;
 	case 21:
-		p->scope = strtoul(optarg, NULL, 0);
+		p->scope = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'J':
-		p->join_state = strtoul(optarg, NULL, 0);
+		p->join_state = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'X':
 		p->proxy_join = strtoul(optarg, NULL, 0);
@@ -1582,14 +1589,7 @@ int main(int argc, char **argv)
 {
 	char usage_args[1024];
 	bind_handle_t h;
-	struct query_params params = {
-		.hop_limit = -1,
-		.reversible = -1,
-		.numb_path = -1,
-		.qos_class = -1,
-		.sl = -1,
-		.proxy_join = -1,
-	};
+	struct query_params params;
 	const struct query_cmd *q;
 	ib_api_status_t status;
 	int n;
@@ -1643,9 +1643,17 @@ int main(int argc, char **argv)
 		{ "scope", 21, 1, NULL, "Scope (MCMemberRecord)" },
 		{ "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" },
 		{ "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" },
-		{}
+		{ 0 }
 	};
 
+	memset(&params, 0, sizeof params);
+	params.hop_limit = -1;
+	params.reversible = -1;
+	params.numb_path = -1;
+	params.qos_class = -1;
+	params.sl = -1;
+	params.proxy_join = -1;
+
 	n = sprintf(usage_args, "[query-name] [<name> | <lid> | <guid>]\n"
 		    "\nSupported query names (and aliases):\n");
 	for (q = query_cmds; q->name; q++) {
@@ -1680,7 +1688,7 @@ int main(int argc, char **argv)
 
 	if (argc) {
 		if (node_print_desc == NAME_OF_LID) {
-			requested_lid = strtoul(argv[0], NULL, 0);
+			requested_lid = (uint16_t) strtoul(argv[0], NULL, 0);
 			requested_lid_flag++;
 		} else if (node_print_desc == NAME_OF_GUID) {
 			requested_guid = strtoul(argv[0], NULL, 0);


From Jie.Cai at cs.anu.edu.au  Mon Feb 23 20:34:04 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Tue, 24 Feb 2009 15:34:04 +1100
Subject: [ofa-general] Bandwidth of performance with multirail IB
In-Reply-To: <20090223211155.730AFE28137@openfabrics.org>
References: <20090223211155.730AFE28137@openfabrics.org>
Message-ID: <49A378BC.5010806@cs.anu.edu.au>

I have implemented a uDAPL program to measure the bandwidth on IB with 
multirail connections.

The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two 
ports.

The program utilize the two port on each node of cluster to build 
multirail IB connections.

The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which 
is almost the same as single rail connections.

Does anyone have similar experience?


From jackm at dev.mellanox.co.il  Mon Feb 23 23:07:09 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 24 Feb 2009 09:07:09 +0200
Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when
	unloading the driver
In-Reply-To: <adaocwtuhs3.fsf@cisco.com>
References: <200902171742.38223.jackm@dev.mellanox.co.il>
	<200902231330.29669.jackm@dev.mellanox.co.il>
	<adaocwtuhs3.fsf@cisco.com>
Message-ID: <200902240907.09398.jackm@dev.mellanox.co.il>

On Monday 23 February 2009 20:31, Roland Dreier wrote:
>  > I'm not sure that it does.  This does not make sysfs access atomic wrt module unloading.
>  > I think an app can still lose it's timeslice while inside the sysfs access, and module
>  > unload can still occur while the app is waiting for a new time slice (although the code pages
>  > will not be removed as yet -- see below).
> 
> Not sure I follow... the low-level driver must handle requests until
> ib_unregister_device() returns, and with the change I proposed,
> ib_unregister_device() will not return until all sysfs files are gone
> (and no open file handles remain).
> 
>  > What about the patch I just submitted?
> 
> I'd rather not add a superfluous mutex that adds complexity when a
> simpler solution is available.

You're right, your solution does work.  I was just concerned that the unregister-sysfs calls
would simply prevent new accessors from seeing the files, but would return before the file reference count
reached zero (thus allowing low-level driver cleanup while current accessors were still in progress).
I checked, and this does not happen.  As you mention in your answer, the unregister-sysfs calls do not
return while someone still has an open file handle on these files.

- Jack


From cap at nsc.liu.se  Tue Feb 24 00:41:53 2009
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Tue, 24 Feb 2009 09:41:53 +0100
Subject: [ofa-general] Bandwidth of performance with multirail IB
In-Reply-To: <49A378BC.5010806@cs.anu.edu.au>
References: <20090223211155.730AFE28137@openfabrics.org>
	<49A378BC.5010806@cs.anu.edu.au>
Message-ID: <200902240941.58634.cap@nsc.liu.se>

On Tuesday 24 February 2009, Jie Cai wrote:
> I have implemented a uDAPL program to measure the bandwidth on IB with
> multirail connections.
>
> The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two
> ports.
>
> The program utilize the two port on each node of cluster to build
> multirail IB connections.
>
> The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which
> is almost the same as single rail connections.

Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus 
not being able to keep up with the HCA. Since the bus is holding even a 
single DDR IB port back you see no improvement with two ports.

To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x 
5 GT/s. For one QDR or two DDR you'll need even more...

/Peter

> Does anyone have similar experience?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/2723f05b/attachment.sig>

From tziporet at dev.mellanox.co.il  Tue Feb 24 01:59:41 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 24 Feb 2009 11:59:41 +0200
Subject: [ofa-general] el5.3 backport of 1.4(.0)
In-Reply-To: <1235400004.4588.43.camel@spike.ugent.be>
References: <1235400004.4588.43.camel@spike.ugent.be>
Message-ID: <49A3C50D.4050609@mellanox.co.il>

Stijn De Weirdt wrote:
> hi all,
>
> i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
> one thing we would also like to look at is switching from OFED 1.3.2 to
> OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
> fixes only exist in the current 1.4.1 daily snapshots.
> did anyone already try to backport the el5.3 backport fixes from 1.4.1
> to 1.4.0?
>
> many thanks,
>
> stijn
>
>   
Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too

Tziporet


From stijn.deweirdt at ugent.be  Tue Feb 24 02:26:15 2009
From: stijn.deweirdt at ugent.be (Stijn De Weirdt)
Date: Tue, 24 Feb 2009 11:26:15 +0100
Subject: [ofa-general] el5.3 backport of 1.4(.0)
In-Reply-To: <49A3C50D.4050609@mellanox.co.il>
References: <1235400004.4588.43.camel@spike.ugent.be>
	<49A3C50D.4050609@mellanox.co.il>
Message-ID: <1235471175.21577.15.camel@spike.ugent.be>

> Stijn De Weirdt wrote:
> > hi all,
> >
> > i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
> > one thing we would also like to look at is switching from OFED 1.3.2 to
> > OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
> > fixes only exist in the current 1.4.1 daily snapshots.
> > did anyone already try to backport the el5.3 backport fixes from 1.4.1
> > to 1.4.0?
> >
> > many thanks,
> >
> > stijn
> >
> >   
> Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too
> 
hi tziporet,

i actually already tried that, moving the following files from a recent
1.4.1 daily to the 1.4.0 ofa_kernel src rpm
ofed_scripts/get_backport_dir.sh
kernel_addons/backport/2.6.18-EL5.3/
kernel_addons/backport/2.6.18-EL5.3/

but rebuilding this gave the following error:
(i have to say that the kernel i used was 2.6.18-128.1.1 instead the
original el5.3 2.6.18-128)

        /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch
patching file drivers/net/mlx4/en_netdev.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
drivers/net/mlx4/en_netdev.c.rej
patching file drivers/net/mlx4/en_tx.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
4 out of 4 hunks ignored -- saving rejects to file
drivers/net/mlx4/en_tx.c.rej
patching file drivers/net/mlx4/mlx4_en.h
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file
drivers/net/mlx4/mlx4_en.h.rej
Failed to apply
patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch

it is also a patch file that doesn't exist in the el5.2 backport, so i
was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i
asked it here.

anyway, many thanks for looking into this!

stijn

> Tziporet
> 
-- 
The system will shutdown in 5 minutes.


From vlad at lists.openfabrics.org  Tue Feb 24 03:19:02 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 24 Feb 2009 03:19:02 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090224-0200 daily build status
Message-ID: <20090224111902.841FAE61203@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From vlad at dev.mellanox.co.il  Tue Feb 24 03:26:43 2009
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 24 Feb 2009 13:26:43 +0200
Subject: [ofa-general] el5.3 backport of 1.4(.0)
In-Reply-To: <1235471175.21577.15.camel@spike.ugent.be>
References: <1235400004.4588.43.camel@spike.ugent.be>	<49A3C50D.4050609@mellanox.co.il>
	<1235471175.21577.15.camel@spike.ugent.be>
Message-ID: <49A3D973.9010601@dev.mellanox.co.il>

Stijn De Weirdt wrote:
>> Stijn De Weirdt wrote:
>>     
>>> hi all,
>>>
>>> i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
>>> one thing we would also like to look at is switching from OFED 1.3.2 to
>>> OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
>>> fixes only exist in the current 1.4.1 daily snapshots.
>>> did anyone already try to backport the el5.3 backport fixes from 1.4.1
>>> to 1.4.0?
>>>
>>> many thanks,
>>>
>>> stijn
>>>
>>>   
>>>       
>> Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too
>>
>>     
> hi tziporet,
>
> i actually already tried that, moving the following files from a recent
> 1.4.1 daily to the 1.4.0 ofa_kernel src rpm
> ofed_scripts/get_backport_dir.sh
> kernel_addons/backport/2.6.18-EL5.3/
> kernel_addons/backport/2.6.18-EL5.3/
>
> but rebuilding this gave the following error:
> (i have to say that the kernel i used was 2.6.18-128.1.1 instead the
> original el5.3 2.6.18-128)
>
>         /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch
> patching file drivers/net/mlx4/en_netdev.c
> Reversed (or previously applied) patch detected!  Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 2 out of 2 hunks ignored -- saving rejects to file
> drivers/net/mlx4/en_netdev.c.rej
> patching file drivers/net/mlx4/en_tx.c
> Reversed (or previously applied) patch detected!  Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 4 out of 4 hunks ignored -- saving rejects to file
> drivers/net/mlx4/en_tx.c.rej
> patching file drivers/net/mlx4/mlx4_en.h
> Reversed (or previously applied) patch detected!  Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 1 out of 1 hunk ignored -- saving rejects to file
> drivers/net/mlx4/mlx4_en.h.rej
> Failed to apply
> patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch
>
> it is also a patch file that doesn't exist in the el5.2 backport, so i
> was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i
> asked it here.
>
> anyway, many thanks for looking into this!
>
> stijn
>
>   
Hi Stijn,

You have, probably, copied RHEL 5.3 backports into ofa_kernel-1.4 
directory where the patches (RHEL5.0) already were applied.

In any case, it is better to take the latest ofa_kernel src rpm instead 
of updating source rpm coming from OFED-1.4.
The difference is RHEL5.3 support and some bug fixes (see git log).

Regards,
Vladimir


From stijn.deweirdt at ugent.be  Tue Feb 24 04:36:04 2009
From: stijn.deweirdt at ugent.be (Stijn De Weirdt)
Date: Tue, 24 Feb 2009 13:36:04 +0100
Subject: [ofa-general] el5.3 backport of 1.4(.0)
In-Reply-To: <49A3D973.9010601@dev.mellanox.co.il>
References: <1235400004.4588.43.camel@spike.ugent.be>
	<49A3C50D.4050609@mellanox.co.il>
	<1235471175.21577.15.camel@spike.ugent.be>
	<49A3D973.9010601@dev.mellanox.co.il>
Message-ID: <1235478964.21577.69.camel@spike.ugent.be>

On Tue, 2009-02-24 at 13:26 +0200, Vladimir Sokolovsky wrote:
> Stijn De Weirdt wrote:
> >> Stijn De Weirdt wrote:
> >>     
> >>> hi all,
> >>>
> >>> i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
> >>> one thing we would also like to look at is switching from OFED 1.3.2 to
> >>> OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
> >>> fixes only exist in the current 1.4.1 daily snapshots.
> >>> did anyone already try to backport the el5.3 backport fixes from 1.4.1
> >>> to 1.4.0?
> >>>
> >>> many thanks,
> >>>
> >>> stijn
> >>>
> >>>   
> >>>       
> >> Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too
> >>
> >>     
> > hi tziporet,
> >
> > i actually already tried that, moving the following files from a recent
> > 1.4.1 daily to the 1.4.0 ofa_kernel src rpm
> > ofed_scripts/get_backport_dir.sh
> > kernel_addons/backport/2.6.18-EL5.3/
> > kernel_addons/backport/2.6.18-EL5.3/
> >
> > but rebuilding this gave the following error:
> > (i have to say that the kernel i used was 2.6.18-128.1.1 instead the
> > original el5.3 2.6.18-128)
> >
> >         /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch
> > patching file drivers/net/mlx4/en_netdev.c
> > Reversed (or previously applied) patch detected!  Assume -R? [n]
> > Apply anyway? [n]
> > Skipping patch.
> > 2 out of 2 hunks ignored -- saving rejects to file
> > drivers/net/mlx4/en_netdev.c.rej
> > patching file drivers/net/mlx4/en_tx.c
> > Reversed (or previously applied) patch detected!  Assume -R? [n]
> > Apply anyway? [n]
> > Skipping patch.
> > 4 out of 4 hunks ignored -- saving rejects to file
> > drivers/net/mlx4/en_tx.c.rej
> > patching file drivers/net/mlx4/mlx4_en.h
> > Reversed (or previously applied) patch detected!  Assume -R? [n]
> > Apply anyway? [n]
> > Skipping patch.
> > 1 out of 1 hunk ignored -- saving rejects to file
> > drivers/net/mlx4/mlx4_en.h.rej
> > Failed to apply
> > patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch
> >
> > it is also a patch file that doesn't exist in the el5.2 backport, so i
> > was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i
> > asked it here.
> >
> > anyway, many thanks for looking into this!
> >
> > stijn
> >
> >   
> Hi Stijn,
> 
hi vladimir,

> You have, probably, copied RHEL 5.3 backports into ofa_kernel-1.4 
> directory where the patches (RHEL5.0) already were applied.
> 
i did what the ofed_patch.sh script does to make a new src.rpm, but
instead of patching i copied said file and directories.

> In any case, it is better to take the latest ofa_kernel src rpm instead 
> of updating source rpm coming from OFED-1.4.
> The difference is RHEL5.3 support and some bug fixes (see git log).
thanks (and ofa_kernel builds ok)

stijn

> Regards,
> Vladimir
-- 
The system will shutdown in 5 minutes.


From sashak at voltaire.com  Tue Feb 24 06:37:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Feb 2009 16:37:06 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
	for the newly discovered port of the known node
In-Reply-To: <499C7E2D.8050301@dev.mellanox.co.il>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218181955.GX5910@sashak.voltaire.com>
	<499C7E2D.8050301@dev.mellanox.co.il>
Message-ID: <20090224143706.GO7641@sashak.voltaire.com>

Hi Yevgeny,

On 23:31 Wed 18 Feb     , Yevgeny Kliteynik wrote:

[snip...]
>
> Good point.
> I'll repost the patch when we finish discussing it.

Let's go this way now. Please resend the patch.

After looking closer into scenario with SwithInfo/PortInfo race I'm
thinking about two optimizations there:

1. Initialize all switch ports (and not only local and port 0) right on
first NodeInfo receiving (via osm_node_new()) - this makes your patch
unnecessary, but it is a bigger change which will definitely require some
heavy testing, so it is fine IMO to do it subsequently.

2. Request PortInfo for all switch ports right on first NodeInfo
receiving (not wait for SwitchInfo), just in parallel with SwitchInfo
request. This should simplify subnet discovery flow and speed it up.
And also this will require some heavy testing...

What do you think about (1) and (2). Could you see any disadvantages?

Sasha


From cameron at harr.org  Tue Feb 24 09:18:33 2009
From: cameron at harr.org (Cameron Harr)
Date: Tue, 24 Feb 2009 10:18:33 -0700
Subject: [Scst-devel] [ofa-general]
	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <4995D1EE.4000807@vlnb.net>
References: <48E386F6.5040502@fusionio.com>	<48EBA72B.4000909@harr.org>	<48EBBDB1.1080203@harr.org>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>
	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl nb.net>
	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
Message-ID: <49A42BE9.4030603@harr.org>


Vladislav Bolkhovitin wrote:
>>
>> Vladislav Bolkhovitin wrote:
>>> Try the following variants:
>>>
>>> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 
>>> and 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 
>>> to CPU5, fct2-worker to CPU7
>>>
>>> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to 
>>> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for 
>>> other processes.
>>>
>>> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to 
>>> all CPUs, except CPU0 and CPU4, no affinity for other processes.
>> These are tests 1, 2 and 3, respectively
>>> Or other similar variants you'd like (even CPUs relate to physical 
>>> CPU0, odd CPUs relate to physical CPU1). For instance, you can try 
>>> to affine IRQs 169 and 177 to CPU1.
>> I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ 
>> (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, 
>> 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and 
>> scsi_tgt1 on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6.
>>> No points to run for srptthread=1, for it just produce a baseline 
>>> with no affinity at all.
>> I ran with these anyway to look at differences among the tests. 
>> Having this thread enabled always results in better performance.
>>> Please do each run several times and write down an average result 
>>> between runs and approximate variation between them in %%. Otherwise 
>>> we can't make any reliable conclusions.
>> I ran each test 3 times and took the averages. In order to get a 
>> quick look at performance per run, I added a column in the summary 
>> that sums the IOPs for each test with SRPT thread enabled and then 
>> not enabled. Test 4 seems to give the best results. Here's a brief 
>> summary of that summary with just SRPT thread=0:
>>
>> Baseline: 356226.39
>> Test 1:   371217.6533
>> Test 2:   370553.78
>> Test 3:   373295.2033
>> Test 4:   399385.2233
>> Test 5:   393204.5833
>
> Linux CPU scheduler does really impressive job!
>
> Interesting, will something change with:
>
> 1. The latest SVN. It has some changes, which might make a difference.
Sorry for the delay.
This is with SVN rev 673. I don't hit the high I hit before, but at a 
1.8% difference (with test 4), it's statistically noise.

Test 1: 390631.5133
Test 2: 386125.4133
Test 3: 356268.0267
Test 4: 392237.7867
Test 5: 390012.1467
>
> 2. Pass-through dev handler instead of BLOCKIO, which you are using.
>
The ioDrive driver doesn't provide a full SCSI emulation layer and shows 
up as /dev/fio[abc...]. From my understanding of the pass-through 
handler, I need to have the SCSI Host:Channel:ID:LUN and those aren't 
available to me.

Cameron


From daniel.miles at rnanetworks.com  Tue Feb 24 09:48:03 2009
From: daniel.miles at rnanetworks.com (Daniel Miles)
Date: Tue, 24 Feb 2009 09:48:03 -0800
Subject: [ofa-general] how do I take IB interfaces offline?
Message-ID: <C5C972D3.2FF%daniel.miles@rnanetworks.com>

Hello, everyone. I wonder if anyone can tell me how to take an IB interface
offline on a running Linux (CENTOS 5 with OFED 1.3.1) system? I can cause it
to loose its IP address with ifdown but it seems that the IP address is only
involved in establishing new connections and removing it doesn¹t prevent the
device from fielding traffic on established connections.

Does anybody know how this is done?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/3aedcd21/attachment.html>

From cameron at harr.org  Tue Feb 24 09:50:55 2009
From: cameron at harr.org (Cameron Harr)
Date: Tue, 24 Feb 2009 10:50:55 -0700
Subject: [ofa-general] how do I take IB interfaces offline?
In-Reply-To: <C5C972D3.2FF%daniel.miles@rnanetworks.com>
References: <C5C972D3.2FF%daniel.miles@rnanetworks.com>
Message-ID: <49A4337F.8040304@harr.org>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/780c4240/attachment.html>

From vst at vlnb.net  Tue Feb 24 09:54:01 2009
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Tue, 24 Feb 2009 20:54:01 +0300
Subject: ***SPAM*** Re: [Scst-devel]
	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A42BE9.4030603@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48EBA72B.4000909@harr.org>	<48EBBDB1.1080203@harr.org>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl
	nb.net>	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
	<49A42BE9.4030603@har r.org>
Message-ID: <49A43439.7080405@vlnb.net>

Cameron Harr, on 02/24/2009 08:18 PM wrote:
> Vladislav Bolkhovitin wrote:
>>> Vladislav Bolkhovitin wrote:
>>>> Try the following variants:
>>>>
>>>> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 
>>>> and 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 
>>>> to CPU5, fct2-worker to CPU7
>>>>
>>>> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to 
>>>> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for 
>>>> other processes.
>>>>
>>>> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to 
>>>> all CPUs, except CPU0 and CPU4, no affinity for other processes.
>>> These are tests 1, 2 and 3, respectively
>>>> Or other similar variants you'd like (even CPUs relate to physical 
>>>> CPU0, odd CPUs relate to physical CPU1). For instance, you can try 
>>>> to affine IRQs 169 and 177 to CPU1.
>>> I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ 
>>> (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, 
>>> 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and 
>>> scsi_tgt1 on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6.
>>>> No points to run for srptthread=1, for it just produce a baseline 
>>>> with no affinity at all.
>>> I ran with these anyway to look at differences among the tests. 
>>> Having this thread enabled always results in better performance.
>>>> Please do each run several times and write down an average result 
>>>> between runs and approximate variation between them in %%. Otherwise 
>>>> we can't make any reliable conclusions.
>>> I ran each test 3 times and took the averages. In order to get a 
>>> quick look at performance per run, I added a column in the summary 
>>> that sums the IOPs for each test with SRPT thread enabled and then 
>>> not enabled. Test 4 seems to give the best results. Here's a brief 
>>> summary of that summary with just SRPT thread=0:
>>>
>>> Baseline: 356226.39
>>> Test 1:   371217.6533
>>> Test 2:   370553.78
>>> Test 3:   373295.2033
>>> Test 4:   399385.2233
>>> Test 5:   393204.5833
>> Linux CPU scheduler does really impressive job!
>>
>> Interesting, will something change with:
>>
>> 1. The latest SVN. It has some changes, which might make a difference.
> Sorry for the delay.
> This is with SVN rev 673. I don't hit the high I hit before, but at a 
> 1.8% difference (with test 4), it's statistically noise.
> 
> Test 1: 390631.5133
> Test 2: 386125.4133
> Test 3: 356268.0267
> Test 4: 392237.7867
> Test 5: 390012.1467
>> 2. Pass-through dev handler instead of BLOCKIO, which you are using.
>>
> The ioDrive driver doesn't provide a full SCSI emulation layer and shows 
> up as /dev/fio[abc...]. From my understanding of the pass-through 
> handler, I need to have the SCSI Host:Channel:ID:LUN and those aren't 
> available to me.

Yes. Although this is strange, because you use sdX devices, hence they 
should have full SCSI emulation and lsscsi should show the 
Host:Channel:ID:LUN numbers.

Thanks,
Vlad


From cameron at harr.org  Tue Feb 24 09:55:25 2009
From: cameron at harr.org (Cameron Harr)
Date: Tue, 24 Feb 2009 10:55:25 -0700
Subject: [Scst-devel]
	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A43439.7080405@vlnb.net>
References: <48E386F6.5040502@fusionio.com>	<48EBBDB1.1080203@harr.org>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl
	nb.net>	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
	<49A42BE9.4030603@har r.org> <49A43439.7080405@vl nb.net>
Message-ID: <49A4348D.6020303@harr.org>

Vladislav Bolkhovitin wrote:
>> 2. Pass-through dev handler instead of BLOCKIO, which you are using.
>>>
>> The ioDrive driver doesn't provide a full SCSI emulation layer and 
>> shows up as /dev/fio[abc...]. From my understanding of the 
>> pass-through handler, I need to have the SCSI Host:Channel:ID:LUN and 
>> those aren't available to me.
>
> Yes. Although this is strange, because you use sdX devices, hence they 
> should have full SCSI emulation and lsscsi should show the 
> Host:Channel:ID:LUN numbers.

I actually don't have sdX devices unless they are SRP targets on an 
initiator. On the target server, the native drive is /dev/fioX.
Cameron


From daniel.miles at rnanetworks.com  Tue Feb 24 09:56:13 2009
From: daniel.miles at rnanetworks.com (Daniel Miles)
Date: Tue, 24 Feb 2009 09:56:13 -0800
Subject: [ofa-general] how do I take IB interfaces offline?
In-Reply-To: <49A4337F.8040304@harr.org>
Message-ID: <C5C974BD.306%daniel.miles@rnanetworks.com>

Well, that would work, but it fails, telling me the mlx4_ib module is in
use. I suspect this is because there are active RDMA connections on it,
which is the reason I want to bring it down (I¹m doing QA, I need to know
what happens if the card goes offline).


On 2/24/09 9:50 AM, "Cameron Harr" <cameron at harr.org> wrote:

> Have you tried /etc/init.d/openibd stop, or are you wanting something that
> doesn't shut down the whole IB system?
> 
> Daniel Miles wrote:
>>  how do I take IB interfaces offline? Hello, everyone. I wonder if anyone can
>> tell me how to take an IB interface offline on a running Linux (CENTOS 5 with
>> OFED 1.3.1) system? I can cause it to loose its IP address with ifdown but it
>> seems that the IP address is only involved in establishing new connections
>> and removing it doesn¹t prevent the device from fielding traffic on
>> established connections.
>>  
>> Does anybody know how this is done?
>> 
>> 
>> 
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> 
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/6610a50d/attachment.html>

From cameron at harr.org  Tue Feb 24 09:58:36 2009
From: cameron at harr.org (Cameron Harr)
Date: Tue, 24 Feb 2009 10:58:36 -0700
Subject: [ofa-general] how do I take IB interfaces offline?
In-Reply-To: <C5C974BD.306%daniel.miles@rnanetworks.com>
References: <C5C974BD.306%daniel.miles@rnanetworks.com>
Message-ID: <49A4354C.4050904@harr.org>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/93bd9518/attachment.html>

From weiny2 at llnl.gov  Tue Feb 24 11:14:37 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 24 Feb 2009 11:14:37 -0800
Subject: [ofa-general] how do I take IB interfaces offline?
In-Reply-To: <49A4354C.4050904@harr.org>
References: <C5C974BD.306%daniel.miles@rnanetworks.com>
	<49A4354C.4050904@harr.org>
Message-ID: <20090224111437.1f10eaa6.weiny2@llnl.gov>

On Tue, 24 Feb 2009 10:58:36 -0700
Cameron Harr <cameron at harr.org> wrote:

> This is may be because you have the SM running. Try /etc/init.d/opensmd stop. If that doesn't work you'll want to find out what is actually using it. When you say RDMA, are you doing iSER or SRP? If that's the case, you'll need to free it up by removing it as a target or just unloading the modules.

Stopping the SM will not stop the traffic.  Try "ibportstate <switch_lid>
<port> disable" on the switch/port the HCA is plugged into.  This will
simulate the port going down.  You can then use "enable" to re-enable it.

Ira

> Cameron
> 
> Daniel Miles wrote: Re: [ofa-general] how do I take IB interfaces offline? Well, that would work, but it fails, telling me the mlx4_ib module is in use. I suspect this is because there are active RDMA connections on it, which is the reason I want to bring it down (I’m doing QA, I need to know what happens if the card goes offline).
> 


-- 
Ira Weiny
Math Programer/Computer Scientist
Larence Livermore National Lab
weiny2 at llnl.gov


From cameron at harr.org  Tue Feb 24 15:22:18 2009
From: cameron at harr.org (Cameron Harr)
Date: Tue, 24 Feb 2009 16:22:18 -0700
Subject: [Scst-devel]
	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A43439.7080405@vlnb.net>
References: <48E386F6.5040502@fusionio.com>	<48EBBDB1.1080203@harr.org>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl
	nb.net>	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
	<49A42BE9.4030603@har r.org> <49A43439.7080405@vl nb.net>
Message-ID: <49A4812A.8050202@harr.org>

Vladislav Bolkhovitin wrote:
>>>> I ran each test 3 times and took the averages. In order to get a 
>>>> quick look at performance per run, I added a column in the summary 
>>>> that sums the IOPs for each test with SRPT thread enabled and then 
>>>> not enabled. Test 4 seems to give the best results. Here's a brief 
>>>> summary of that summary with just SRPT thread=0:
>>>>
>>>> Baseline: 356226.39
>>>> Test 1:   371217.6533
>>>> Test 2:   370553.78
>>>> Test 3:   373295.2033
>>>> Test 4:   399385.2233
>>>> Test 5:   393204.5833
>>> Linux CPU scheduler does really impressive job!
>>>
>>> Interesting, will something change with:
>>>
>>> 1. The latest SVN. It has some changes, which might make a difference.
>> Sorry for the delay.
>> This is with SVN rev 673. I don't hit the high I hit before, but at a 
>> 1.8% difference (with test 4), it's statistically noise.
>>
>> Test 1: 390631.5133
>> Test 2: 386125.4133
>> Test 3: 356268.0267
>> Test 4: 392237.7867
>> Test 5: 390012.1467 
I just ran again, this time with rev 680 and am a little concerned to 
see the drop in performance. I verified that debug is not on. I'll try 
to start another run on 680 to see if I get similar results.

Test 1:368342.41
Test 2:366787.2067
Test 3:345334.68
Test 4:372684.58
Test 5:372184.8333


From Jie.Cai at cs.anu.edu.au  Tue Feb 24 16:44:08 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Wed, 25 Feb 2009 11:44:08 +1100
Subject: [ofa-general] Bandwidth of performance with multirail IB
In-Reply-To: <200902240941.58634.cap@nsc.liu.se>
References: <20090223211155.730AFE28137@openfabrics.org>
	<49A378BC.5010806@cs.anu.edu.au>
	<200902240941.58634.cap@nsc.liu.se>
Message-ID: <49A49458.9070003@cs.anu.edu.au>


Peter Kjellstrom wrote:
> On Tuesday 24 February 2009, Jie Cai wrote:
>   
>> I have implemented a uDAPL program to measure the bandwidth on IB with
>> multirail connections.
>>
>> The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two
>> ports.
>>
>> The program utilize the two port on each node of cluster to build
>> multirail IB connections.
>>
>> The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which
>> is almost the same as single rail connections.
>>     
>
> Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus 
> not being able to keep up with the HCA. Since the bus is holding even a 
> single DDR IB port back you see no improvement with two ports.
>
>   
I do connect HCA in a 16x pci-e slot on each node.
However, I am trying to drive 2 ports simultaneously.

The workstation i am using is Sun Ultra 24,
and the HCA is Mellanox ConnectX  MHGH28-XTC.
The data for the HCA and Ultra 24 is

MHGH28-XTC 
IB ports: Dual Copper 4X 20Gb/s 
Host Bus: PCIe 2.0 2.5GT/s

Ultra 24 workstation:

1333 MHz frontside bus with DDR2 memory support upto (10.67 GB per 
second bandwidth)
PCI Express Slots

    * Two full-length x16 Gen-2 slots (where the HCA has been connected to)
    * One full-length x8 slot
    * One full-length x1 slot

So, it may not be the problem of bottleneck in bus.


> To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x 
> 5 GT/s. For one QDR or two DDR you'll need even more...
>
>   

The  pci-e slot in Ultra 24 is PCI Express Gen2 x16. The data transfer 
rate is
5 Gpbs.

Will this be sufficient to drive the 2 ddr ports on MHGH28-XTC ?

Or is there any other possible reasons?
> /Peter
>
>   
>> Does anyone have similar experience?
>>     


From andy.grover at oracle.com  Tue Feb 24 17:30:17 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:17 -0800
Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2
Message-ID: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>


Hi,

This patchset against net-next adds support for RDS sockets. RDS is an
Oracle-originated protocol used to send IPC datagrams (up to 1MB)
reliably, and is used currently in Oracle RAC and Exadata products. 

I've addressed all the issues from comments on take 1. (thanks!) This patchset
squashes the changes into the original changeset, but I've also included
a tree where the un-squashed changes since last time may be reviewed:
git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6.git
rds-broken-out-fixes

Major changes since last time include moving to net/rds, and the
additional inclusion of iwarp transport support.

shortlog for patchseries follows.

Thanks -- Regards -- Andy

Andy Grover (26):
      RDS: Socket interface
      RDS: Main header file
      RDS: Congestion-handling code
      RDS: Transport code
      RDS: Info and stats
      RDS: Connection handling
      RDS: loopback
      RDS: sysctls
      RDS: Message parsing
      RDS: send.c
      RDS: recv.c
      RDS: RDMA support
      RDS/IB: Infiniband transport
      RDS/IB: Ring-handling code.
      RDS/IB: Implement RDMA ops using FMRs
      RDS/IB: Implement IB-specific datagram send.
      RDS/IB: Receive datagrams via IB
      RDS/IB: Stats and sysctls
      RDS: Add iWARP support
      RDS: Common RDMA transport code
      RDS: Documentation
      RDS: Kconfig and Makefile
      RDS: Add AF and PF #defines for RDS sockets
      RDS: Add MAINTAINERS entry
      RDS: Add userspace header
      RDS: Add RDS to AF key strings

 Documentation/networking/rds.txt |  356 ++++++++++++++
 MAINTAINERS                      |    6 +
 include/linux/rds.h              |  250 ++++++++++
 include/linux/socket.h           |    5 +-
 net/Kconfig                      |    1 +
 net/Makefile                     |    1 +
 net/core/sock.c                  |    6 +-
 net/rds/Kconfig                  |   13 +
 net/rds/Makefile                 |   14 +
 net/rds/af_rds.c                 |  586 ++++++++++++++++++++++
 net/rds/bind.c                   |  199 ++++++++
 net/rds/cong.c                   |  402 +++++++++++++++
 net/rds/connection.c             |  487 ++++++++++++++++++
 net/rds/ib.c                     |  323 ++++++++++++
 net/rds/ib.h                     |  367 ++++++++++++++
 net/rds/ib_cm.c                  |  726 +++++++++++++++++++++++++++
 net/rds/ib_rdma.c                |  641 ++++++++++++++++++++++++
 net/rds/ib_recv.c                |  869
+++++++++++++++++++++++++++++++++
 net/rds/ib_ring.c                |  168 +++++++
 net/rds/ib_send.c                |  874
+++++++++++++++++++++++++++++++++
 net/rds/ib_stats.c               |   95 ++++
 net/rds/ib_sysctl.c              |  137 ++++++
 net/rds/info.c                   |  241 +++++++++
 net/rds/info.h                   |   30 ++
 net/rds/iw.c                     |  333 +++++++++++++
 net/rds/iw.h                     |  395 +++++++++++++++
 net/rds/iw_cm.c                  |  750 ++++++++++++++++++++++++++++
 net/rds/iw_rdma.c                |  888
+++++++++++++++++++++++++++++++++
 net/rds/iw_recv.c                |  869
+++++++++++++++++++++++++++++++++
 net/rds/iw_ring.c                |  169 +++++++
 net/rds/iw_send.c                |  975
++++++++++++++++++++++++++++++++++++
 net/rds/iw_stats.c               |   95 ++++
 net/rds/iw_sysctl.c              |  137 ++++++
 net/rds/loop.c                   |  188 +++++++
 net/rds/loop.h                   |    9 +
 net/rds/message.c                |  402 +++++++++++++++
 net/rds/page.c                   |  221 +++++++++
 net/rds/rdma.c                   |  679 ++++++++++++++++++++++++++
 net/rds/rdma.h                   |   84 ++++
 net/rds/rdma_transport.c         |  214 ++++++++
 net/rds/rdma_transport.h         |   28 +
 net/rds/rds.h                    |  686 ++++++++++++++++++++++++++
 net/rds/recv.c                   |  542 ++++++++++++++++++++
 net/rds/send.c                   | 1003
++++++++++++++++++++++++++++++++++++++
 net/rds/stats.c                  |  148 ++++++
 net/rds/sysctl.c                 |  122 +++++
 net/rds/threads.c                |  265 ++++++++++
 net/rds/transport.c              |  117 +++++
 48 files changed, 16112 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/networking/rds.txt
 create mode 100644 include/linux/rds.h
 create mode 100644 net/rds/Kconfig
 create mode 100644 net/rds/Makefile
 create mode 100644 net/rds/af_rds.c
 create mode 100644 net/rds/bind.c
 create mode 100644 net/rds/cong.c
 create mode 100644 net/rds/connection.c
 create mode 100644 net/rds/ib.c
 create mode 100644 net/rds/ib.h
 create mode 100644 net/rds/ib_cm.c
 create mode 100644 net/rds/ib_rdma.c
 create mode 100644 net/rds/ib_recv.c
 create mode 100644 net/rds/ib_ring.c
 create mode 100644 net/rds/ib_send.c
 create mode 100644 net/rds/ib_stats.c
 create mode 100644 net/rds/ib_sysctl.c
 create mode 100644 net/rds/info.c
 create mode 100644 net/rds/info.h
 create mode 100644 net/rds/iw.c
 create mode 100644 net/rds/iw.h
 create mode 100644 net/rds/iw_cm.c
 create mode 100644 net/rds/iw_rdma.c
 create mode 100644 net/rds/iw_recv.c
 create mode 100644 net/rds/iw_ring.c
 create mode 100644 net/rds/iw_send.c
 create mode 100644 net/rds/iw_stats.c
 create mode 100644 net/rds/iw_sysctl.c
 create mode 100644 net/rds/loop.c
 create mode 100644 net/rds/loop.h
 create mode 100644 net/rds/message.c
 create mode 100644 net/rds/page.c
 create mode 100644 net/rds/rdma.c
 create mode 100644 net/rds/rdma.h
 create mode 100644 net/rds/rdma_transport.c
 create mode 100644 net/rds/rdma_transport.h
 create mode 100644 net/rds/rds.h
 create mode 100644 net/rds/recv.c
 create mode 100644 net/rds/send.c
 create mode 100644 net/rds/stats.c
 create mode 100644 net/rds/sysctl.c
 create mode 100644 net/rds/threads.c
 create mode 100644 net/rds/transport.c

end


From andy.grover at oracle.com  Tue Feb 24 17:30:18 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:18 -0800
Subject: [ofa-general] [PATCH 01/26] RDS: Socket interface
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-2-git-send-email-andy.grover@oracle.com>

Implement the RDS (Reliable Datagram Sockets) interface.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/af_rds.c |  586 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/bind.c   |  199 ++++++++++++++++++
 2 files changed, 785 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/af_rds.c
 create mode 100644 net/rds/bind.c

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
new file mode 100644
index 0000000..20cf16f
--- /dev/null
+++ b/net/rds/af_rds.c
@@ -0,0 +1,586 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/poll.h>
+#include <linux/version.h>
+#include <net/sock.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "rdma_transport.h"
+
+/* this is just used for stats gathering :/ */
+static DEFINE_SPINLOCK(rds_sock_lock);
+static unsigned long rds_sock_count;
+static LIST_HEAD(rds_sock_list);
+DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq);
+
+/*
+ * This is called as the final descriptor referencing this socket is closed.
+ * We have to unbind the socket so that another socket can be bound to the
+ * address it was using.
+ *
+ * We have to be careful about racing with the incoming path.  sock_orphan()
+ * sets SOCK_DEAD and we use that as an indicator to the rx path that new
+ * messages shouldn't be queued.
+ */
+static int rds_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	if (sk == NULL)
+		goto out;
+
+	rs = rds_sk_to_rs(sk);
+
+	sock_orphan(sk);
+	/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
+	 * that ensures the recv path has completed messing
+	 * with the socket. */
+	rds_clear_recv_queue(rs);
+	rds_cong_remove_socket(rs);
+	rds_remove_bound(rs);
+	rds_send_drop_to(rs, NULL);
+	rds_rdma_drop_keys(rs);
+	rds_notify_queue_get(rs, NULL);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+	list_del_init(&rs->rs_item);
+	rds_sock_count--;
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	sock->sk = NULL;
+	sock_put(sk);
+out:
+	return 0;
+}
+
+/*
+ * Careful not to race with rds_release -> sock_orphan which clears sk_sleep.
+ * _bh() isn't OK here, we're called from interrupt handlers.  It's probably OK
+ * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but
+ * this seems more conservative.
+ * NB - normally, one would use sk_callback_lock for this, but we can
+ * get here from interrupts, whereas the network code grabs sk_callback_lock
+ * with _lock_bh only - so relying on sk_callback_lock introduces livelocks.
+ */
+void rds_wake_sk_sleep(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	read_lock_irqsave(&rs->rs_recv_lock, flags);
+	__rds_wake_sk_sleep(rds_rs_to_sk(rs));
+	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+}
+
+static int rds_getname(struct socket *sock, struct sockaddr *uaddr,
+		       int *uaddr_len, int peer)
+{
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+
+	memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+
+	/* racey, don't care */
+	if (peer) {
+		if (!rs->rs_conn_addr)
+			return -ENOTCONN;
+
+		sin->sin_port = rs->rs_conn_port;
+		sin->sin_addr.s_addr = rs->rs_conn_addr;
+	} else {
+		sin->sin_port = rs->rs_bound_port;
+		sin->sin_addr.s_addr = rs->rs_bound_addr;
+	}
+
+	sin->sin_family = AF_INET;
+
+	*uaddr_len = sizeof(*sin);
+	return 0;
+}
+
+/*
+ * RDS' poll is without a doubt the least intuitive part of the interface,
+ * as POLLIN and POLLOUT do not behave entirely as you would expect from
+ * a network protocol.
+ *
+ * POLLIN is asserted if
+ *  -	there is data on the receive queue.
+ *  -	to signal that a previously congested destination may have become
+ *	uncongested
+ *  -	A notification has been queued to the socket (this can be a congestion
+ *	update, or a RDMA completion).
+ *
+ * POLLOUT is asserted if there is room on the send queue. This does not mean
+ * however, that the next sendmsg() call will succeed. If the application tries
+ * to send to a congested destination, the system call may still fail (and
+ * return ENOBUFS).
+ */
+static unsigned int rds_poll(struct file *file, struct socket *sock,
+			     poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	unsigned int mask = 0;
+	unsigned long flags;
+
+	poll_wait(file, sk->sk_sleep, wait);
+
+	poll_wait(file, &rds_poll_waitq, wait);
+
+	read_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!rs->rs_cong_monitor) {
+		/* When a congestion map was updated, we signal POLLIN for
+		 * "historical" reasons. Applications can also poll for
+		 * WRBAND instead. */
+		if (rds_cong_updated_since(&rs->rs_cong_track))
+			mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
+	} else {
+		spin_lock(&rs->rs_lock);
+		if (rs->rs_cong_notify)
+			mask |= (POLLIN | POLLRDNORM);
+		spin_unlock(&rs->rs_lock);
+	}
+	if (!list_empty(&rs->rs_recv_queue)
+	 || !list_empty(&rs->rs_notify_queue))
+		mask |= (POLLIN | POLLRDNORM);
+	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
+		mask |= (POLLOUT | POLLWRNORM);
+	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+	return mask;
+}
+
+static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+	return -ENOIOCTLCMD;
+}
+
+static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
+			      int len)
+{
+	struct sockaddr_in sin;
+	int ret = 0;
+
+	/* racing with another thread binding seems ok here */
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (len < sizeof(struct sockaddr_in)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (copy_from_user(&sin, optval, sizeof(sin))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	rds_send_drop_to(rs, &sin);
+out:
+	return ret;
+}
+
+static int rds_set_bool_option(unsigned char *optvar, char __user *optval,
+			       int optlen)
+{
+	int value;
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+	if (get_user(value, (int __user *) optval))
+		return -EFAULT;
+	*optvar = !!value;
+	return 0;
+}
+
+static int rds_cong_monitor(struct rds_sock *rs, char __user *optval,
+			    int optlen)
+{
+	int ret;
+
+	ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen);
+	if (ret == 0) {
+		if (rs->rs_cong_monitor) {
+			rds_cong_add_socket(rs);
+		} else {
+			rds_cong_remove_socket(rs);
+			rs->rs_cong_mask = 0;
+			rs->rs_cong_notify = 0;
+		}
+	}
+	return ret;
+}
+
+static int rds_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int optlen)
+{
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	int ret;
+
+	if (level != SOL_RDS) {
+		ret = -ENOPROTOOPT;
+		goto out;
+	}
+
+	switch (optname) {
+	case RDS_CANCEL_SENT_TO:
+		ret = rds_cancel_sent_to(rs, optval, optlen);
+		break;
+	case RDS_GET_MR:
+		ret = rds_get_mr(rs, optval, optlen);
+		break;
+	case RDS_FREE_MR:
+		ret = rds_free_mr(rs, optval, optlen);
+		break;
+	case RDS_RECVERR:
+		ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen);
+		break;
+	case RDS_CONG_MONITOR:
+		ret = rds_cong_monitor(rs, optval, optlen);
+		break;
+	default:
+		ret = -ENOPROTOOPT;
+	}
+out:
+	return ret;
+}
+
+static int rds_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	int ret = -ENOPROTOOPT, len;
+
+	if (level != SOL_RDS)
+		goto out;
+
+	if (get_user(len, optlen)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (optname) {
+	case RDS_INFO_FIRST ... RDS_INFO_LAST:
+		ret = rds_info_getsockopt(sock, optname, optval,
+					  optlen);
+		break;
+
+	case RDS_RECVERR:
+		if (len < sizeof(int))
+			ret = -EINVAL;
+		else
+		if (put_user(rs->rs_recverr, (int __user *) optval)
+		 || put_user(sizeof(int), optlen))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	default:
+		break;
+	}
+
+out:
+	return ret;
+
+}
+
+static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
+		       int addr_len, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	int ret = 0;
+
+	lock_sock(sk);
+
+	if (addr_len != sizeof(struct sockaddr_in)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (sin->sin_family != AF_INET) {
+		ret = -EAFNOSUPPORT;
+		goto out;
+	}
+
+	if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+		ret = -EDESTADDRREQ;
+		goto out;
+	}
+
+	rs->rs_conn_addr = sin->sin_addr.s_addr;
+	rs->rs_conn_port = sin->sin_port;
+
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static struct proto rds_proto = {
+	.name	  = "RDS",
+	.owner	  = THIS_MODULE,
+	.obj_size = sizeof(struct rds_sock),
+};
+
+static struct proto_ops rds_proto_ops = {
+	.family =	AF_RDS,
+	.owner =	THIS_MODULE,
+	.release =	rds_release,
+	.bind =		rds_bind,
+	.connect =	rds_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	rds_getname,
+	.poll =		rds_poll,
+	.ioctl =	rds_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	rds_setsockopt,
+	.getsockopt =	rds_getsockopt,
+	.sendmsg =	rds_sendmsg,
+	.recvmsg =	rds_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
+{
+	unsigned long flags;
+	struct rds_sock *rs;
+
+	sock_init_data(sock, sk);
+	sock->ops		= &rds_proto_ops;
+	sk->sk_protocol		= protocol;
+
+	rs = rds_sk_to_rs(sk);
+	spin_lock_init(&rs->rs_lock);
+	rwlock_init(&rs->rs_recv_lock);
+	INIT_LIST_HEAD(&rs->rs_send_queue);
+	INIT_LIST_HEAD(&rs->rs_recv_queue);
+	INIT_LIST_HEAD(&rs->rs_notify_queue);
+	INIT_LIST_HEAD(&rs->rs_cong_list);
+	spin_lock_init(&rs->rs_rdma_lock);
+	rs->rs_rdma_keys = RB_ROOT;
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+	list_add_tail(&rs->rs_item, &rds_sock_list);
+	rds_sock_count++;
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	return 0;
+}
+
+static int rds_create(struct net *net, struct socket *sock, int protocol)
+{
+	struct sock *sk;
+
+	if (sock->type != SOCK_SEQPACKET || protocol)
+		return -ESOCKTNOSUPPORT;
+
+	sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto);
+	if (!sk)
+		return -ENOMEM;
+
+	return __rds_create(sock, sk, protocol);
+}
+
+void rds_sock_addref(struct rds_sock *rs)
+{
+	sock_hold(rds_rs_to_sk(rs));
+}
+
+void rds_sock_put(struct rds_sock *rs)
+{
+	sock_put(rds_rs_to_sk(rs));
+}
+
+static struct net_proto_family rds_family_ops = {
+	.family =	AF_RDS,
+	.create =	rds_create,
+	.owner	=	THIS_MODULE,
+};
+
+static void rds_sock_inc_info(struct socket *sock, unsigned int len,
+			      struct rds_info_iterator *iter,
+			      struct rds_info_lengths *lens)
+{
+	struct rds_sock *rs;
+	struct sock *sk;
+	struct rds_incoming *inc;
+	unsigned long flags;
+	unsigned int total = 0;
+
+	len /= sizeof(struct rds_info_message);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sk = rds_rs_to_sk(rs);
+		read_lock(&rs->rs_recv_lock);
+
+		/* XXX too lazy to maintain counts.. */
+		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
+			total++;
+			if (total <= len)
+				rds_inc_info_copy(inc, iter, inc->i_saddr,
+						  rs->rs_bound_addr, 1);
+		}
+
+		read_unlock(&rs->rs_recv_lock);
+	}
+
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds_info_message);
+}
+
+static void rds_sock_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens)
+{
+	struct rds_info_socket sinfo;
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	len /= sizeof(struct rds_info_socket);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+
+	if (len < rds_sock_count)
+		goto out;
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sinfo.sndbuf = rds_sk_sndbuf(rs);
+		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
+		sinfo.bound_addr = rs->rs_bound_addr;
+		sinfo.connected_addr = rs->rs_conn_addr;
+		sinfo.bound_port = rs->rs_bound_port;
+		sinfo.connected_port = rs->rs_conn_port;
+		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
+
+		rds_info_copy(iter, &sinfo, sizeof(sinfo));
+	}
+
+out:
+	lens->nr = rds_sock_count;
+	lens->each = sizeof(struct rds_info_socket);
+
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+}
+
+static void __exit rds_exit(void)
+{
+	rds_rdma_exit();
+	sock_unregister(rds_family_ops.family);
+	proto_unregister(&rds_proto);
+	rds_conn_exit();
+	rds_cong_exit();
+	rds_sysctl_exit();
+	rds_threads_exit();
+	rds_stats_exit();
+	rds_page_exit();
+	rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
+	rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+}
+module_exit(rds_exit);
+
+static int __init rds_init(void)
+{
+	int ret;
+
+	ret = rds_conn_init();
+	if (ret)
+		goto out;
+	ret = rds_threads_init();
+	if (ret)
+		goto out_conn;
+	ret = rds_sysctl_init();
+	if (ret)
+		goto out_threads;
+	ret = rds_stats_init();
+	if (ret)
+		goto out_sysctl;
+	ret = proto_register(&rds_proto, 1);
+	if (ret)
+		goto out_stats;
+	ret = sock_register(&rds_family_ops);
+	if (ret)
+		goto out_proto;
+
+	rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
+	rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+
+	/* ib/iwarp transports currently compiled-in */
+	ret = rds_rdma_init();
+	if (ret)
+		goto out_sock;
+	goto out;
+
+out_sock:
+	sock_unregister(rds_family_ops.family);
+out_proto:
+	proto_unregister(&rds_proto);
+out_stats:
+	rds_stats_exit();
+out_sysctl:
+	rds_sysctl_exit();
+out_threads:
+	rds_threads_exit();
+out_conn:
+	rds_conn_exit();
+	rds_cong_exit();
+	rds_page_exit();
+out:
+	return ret;
+}
+module_init(rds_init);
+
+#define DRV_VERSION     "4.0"
+#define DRV_RELDATE     "Feb 12, 2009"
+
+MODULE_AUTHOR("Oracle Corporation <rds-devel at oss.oracle.com>");
+MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets"
+		   " v" DRV_VERSION " (" DRV_RELDATE ")");
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_ALIAS_NETPROTO(PF_RDS);
diff --git a/net/rds/bind.c b/net/rds/bind.c
new file mode 100644
index 0000000..c17cc39
--- /dev/null
+++ b/net/rds/bind.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+#include <linux/if_arp.h>
+#include "rds.h"
+
+/*
+ * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't
+ * particularly zippy.
+ *
+ * This is now called for every incoming frame so we arguably care much more
+ * about it than we used to.
+ */
+static DEFINE_SPINLOCK(rds_bind_lock);
+static struct rb_root rds_bind_tree = RB_ROOT;
+
+static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port,
+					   struct rds_sock *insert)
+{
+	struct rb_node **p = &rds_bind_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_sock *rs;
+	u64 cmp;
+	u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
+
+	while (*p) {
+		parent = *p;
+		rs = rb_entry(parent, struct rds_sock, rs_bound_node);
+
+		cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
+		      be16_to_cpu(rs->rs_bound_port);
+
+		if (needle < cmp)
+			p = &(*p)->rb_left;
+		else if (needle > cmp)
+			p = &(*p)->rb_right;
+		else
+			return rs;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->rs_bound_node, parent, p);
+		rb_insert_color(&insert->rs_bound_node, &rds_bind_tree);
+	}
+	return NULL;
+}
+
+/*
+ * Return the rds_sock bound at the given local address.
+ *
+ * The rx path can race with rds_release.  We notice if rds_release() has
+ * marked this socket and don't return a rs ref to the rx path.
+ */
+struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
+{
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+	rs = rds_bind_tree_walk(addr, port, NULL);
+	if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
+		rds_sock_addref(rs);
+	else
+		rs = NULL;
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+
+	rdsdebug("returning rs %p for %pI4:%u\n", rs, &addr,
+		ntohs(port));
+	return rs;
+}
+
+/* returns -ve errno or +ve port */
+static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
+{
+	unsigned long flags;
+	int ret = -EADDRINUSE;
+	u16 rover, last;
+
+	if (*port != 0) {
+		rover = be16_to_cpu(*port);
+		last = rover;
+	} else {
+		rover = max_t(u16, net_random(), 2);
+		last = rover - 1;
+	}
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+
+	do {
+		if (rover == 0)
+			rover++;
+		if (rds_bind_tree_walk(addr, cpu_to_be16(rover), rs) == NULL) {
+			*port = cpu_to_be16(rover);
+			ret = 0;
+			break;
+		}
+	} while (rover++ != last);
+
+	if (ret == 0)  {
+		rs->rs_bound_addr = addr;
+		rs->rs_bound_port = *port;
+		rds_sock_addref(rs);
+
+		rdsdebug("rs %p binding to %pI4:%d\n",
+		  rs, &addr, (int)ntohs(*port));
+	}
+
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+
+	return ret;
+}
+
+void rds_remove_bound(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+
+	if (rs->rs_bound_addr) {
+		rdsdebug("rs %p unbinding from %pI4:%d\n",
+		  rs, &rs->rs_bound_addr,
+		  ntohs(rs->rs_bound_port));
+
+		rb_erase(&rs->rs_bound_node, &rds_bind_tree);
+		rds_sock_put(rs);
+		rs->rs_bound_addr = 0;
+	}
+
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+}
+
+int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	struct rds_transport *trans;
+	int ret = 0;
+
+	lock_sock(sk);
+
+	if (addr_len != sizeof(struct sockaddr_in) ||
+	    sin->sin_family != AF_INET ||
+	    rs->rs_bound_addr ||
+	    sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = rds_add_bound(rs, sin->sin_addr.s_addr, &sin->sin_port);
+	if (ret)
+		goto out;
+
+	trans = rds_trans_get_preferred(sin->sin_addr.s_addr);
+	if (trans == NULL) {
+		ret = -EADDRNOTAVAIL;
+		rds_remove_bound(rs);
+		goto out;
+	}
+
+	rs->rs_transport = trans;
+	ret = 0;
+
+out:
+	release_sock(sk);
+	return ret;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:19 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:19 -0800
Subject: [ofa-general] [PATCH 02/26] RDS: Main header file
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-3-git-send-email-andy.grover@oracle.com>

RDS's main data structure definitions and exported functions.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/rds.h |  686 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 686 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/rds.h

diff --git a/net/rds/rds.h b/net/rds/rds.h
new file mode 100644
index 0000000..0604007
--- /dev/null
+++ b/net/rds/rds.h
@@ -0,0 +1,686 @@
+#ifndef _RDS_RDS_H
+#define _RDS_RDS_H
+
+#include <net/sock.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <rdma/rdma_cm.h>
+#include <linux/mutex.h>
+#include <linux/rds.h>
+
+#include "info.h"
+
+/*
+ * RDS Network protocol version
+ */
+#define RDS_PROTOCOL_3_0	0x0300
+#define RDS_PROTOCOL_3_1	0x0301
+#define RDS_PROTOCOL_VERSION	RDS_PROTOCOL_3_1
+#define RDS_PROTOCOL_MAJOR(v)	((v) >> 8)
+#define RDS_PROTOCOL_MINOR(v)	((v) & 255)
+#define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
+
+/*
+ * XXX randomly chosen, but at least seems to be unused:
+ * #               18464-18768 Unassigned
+ * We should do better.  We want a reserved port to discourage unpriv'ed
+ * userspace from listening.
+ */
+#define RDS_PORT	18634
+
+#ifdef DEBUG
+#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args)
+#else
+/* sigh, pr_debug() causes unused variable warnings */
+static inline void __attribute__ ((format (printf, 1, 2)))
+rdsdebug(char *fmt, ...)
+{
+}
+#endif
+
+/* XXX is there one of these somewhere? */
+#define ceil(x, y) \
+	({ unsigned long __x = (x), __y = (y); (__x + __y - 1) / __y; })
+
+#define RDS_FRAG_SHIFT	12
+#define RDS_FRAG_SIZE	((unsigned int)(1 << RDS_FRAG_SHIFT))
+
+#define RDS_CONG_MAP_BYTES	(65536 / 8)
+#define RDS_CONG_MAP_LONGS	(RDS_CONG_MAP_BYTES / sizeof(unsigned long))
+#define RDS_CONG_MAP_PAGES	(PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE)
+#define RDS_CONG_MAP_PAGE_BITS	(PAGE_SIZE * 8)
+
+struct rds_cong_map {
+	struct rb_node		m_rb_node;
+	__be32			m_addr;
+	wait_queue_head_t	m_waitq;
+	struct list_head	m_conn_list;
+	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
+};
+
+
+/*
+ * This is how we will track the connection state:
+ * A connection is always in one of the following
+ * states. Updates to the state are atomic and imply
+ * a memory barrier.
+ */
+enum {
+	RDS_CONN_DOWN = 0,
+	RDS_CONN_CONNECTING,
+	RDS_CONN_DISCONNECTING,
+	RDS_CONN_UP,
+	RDS_CONN_ERROR,
+};
+
+/* Bits for c_flags */
+#define RDS_LL_SEND_FULL	0
+#define RDS_RECONNECT_PENDING	1
+
+struct rds_connection {
+	struct hlist_node	c_hash_node;
+	__be32			c_laddr;
+	__be32			c_faddr;
+	unsigned int		c_loopback:1;
+	struct rds_connection	*c_passive;
+
+	struct rds_cong_map	*c_lcong;
+	struct rds_cong_map	*c_fcong;
+
+	struct mutex		c_send_lock;	/* protect send ring */
+	struct rds_message	*c_xmit_rm;
+	unsigned long		c_xmit_sg;
+	unsigned int		c_xmit_hdr_off;
+	unsigned int		c_xmit_data_off;
+	unsigned int		c_xmit_rdma_sent;
+
+	spinlock_t		c_lock;		/* protect msg queues */
+	u64			c_next_tx_seq;
+	struct list_head	c_send_queue;
+	struct list_head	c_retrans;
+
+	u64			c_next_rx_seq;
+
+	struct rds_transport	*c_trans;
+	void			*c_transport_data;
+
+	atomic_t		c_state;
+	unsigned long		c_flags;
+	unsigned long		c_reconnect_jiffies;
+	struct delayed_work	c_send_w;
+	struct delayed_work	c_recv_w;
+	struct delayed_work	c_conn_w;
+	struct work_struct	c_down_w;
+	struct mutex		c_cm_lock;	/* protect conn state & cm */
+
+	struct list_head	c_map_item;
+	unsigned long		c_map_queued;
+	unsigned long		c_map_offset;
+	unsigned long		c_map_bytes;
+
+	unsigned int		c_unacked_packets;
+	unsigned int		c_unacked_bytes;
+
+	/* Protocol version */
+	unsigned int		c_version;
+};
+
+#define RDS_FLAG_CONG_BITMAP	0x01
+#define RDS_FLAG_ACK_REQUIRED	0x02
+#define RDS_FLAG_RETRANSMITTED	0x04
+#define RDS_MAX_ADV_CREDIT	127
+
+/*
+ * Maximum space available for extension headers.
+ */
+#define RDS_HEADER_EXT_SPACE	16
+
+struct rds_header {
+	__be64	h_sequence;
+	__be64	h_ack;
+	__be32	h_len;
+	__be16	h_sport;
+	__be16	h_dport;
+	u8	h_flags;
+	u8	h_credit;
+	u8	h_padding[4];
+	__sum16	h_csum;
+
+	u8	h_exthdr[RDS_HEADER_EXT_SPACE];
+};
+
+/*
+ * Reserved - indicates end of extensions
+ */
+#define RDS_EXTHDR_NONE		0
+
+/*
+ * This extension header is included in the very
+ * first message that is sent on a new connection,
+ * and identifies the protocol level. This will help
+ * rolling updates if a future change requires breaking
+ * the protocol.
+ * NB: This is no longer true for IB, where we do a version
+ * negotiation during the connection setup phase (protocol
+ * version information is included in the RDMA CM private data).
+ */
+#define RDS_EXTHDR_VERSION	1
+struct rds_ext_header_version {
+	__be32			h_version;
+};
+
+/*
+ * This extension header is included in the RDS message
+ * chasing an RDMA operation.
+ */
+#define RDS_EXTHDR_RDMA		2
+struct rds_ext_header_rdma {
+	__be32			h_rdma_rkey;
+};
+
+/*
+ * This extension header tells the peer about the
+ * destination <R_Key,offset> of the requested RDMA
+ * operation.
+ */
+#define RDS_EXTHDR_RDMA_DEST	3
+struct rds_ext_header_rdma_dest {
+	__be32			h_rdma_rkey;
+	__be32			h_rdma_offset;
+};
+
+#define __RDS_EXTHDR_MAX	16 /* for now */
+
+struct rds_incoming {
+	atomic_t		i_refcount;
+	struct list_head	i_item;
+	struct rds_connection	*i_conn;
+	struct rds_header	i_hdr;
+	unsigned long		i_rx_jiffies;
+	__be32			i_saddr;
+
+	rds_rdma_cookie_t	i_rdma_cookie;
+};
+
+/*
+ * m_sock_item and m_conn_item are on lists that are serialized under
+ * conn->c_lock.  m_sock_item has additional meaning in that once it is empty
+ * the message will not be put back on the retransmit list after being sent.
+ * messages that are canceled while being sent rely on this.
+ *
+ * m_inc is used by loopback so that it can pass an incoming message straight
+ * back up into the rx path.  It embeds a wire header which is also used by
+ * the send path, which is kind of awkward.
+ *
+ * m_sock_item indicates the message's presence on a socket's send or receive
+ * queue.  m_rs will point to that socket.
+ *
+ * m_daddr is used by cancellation to prune messages to a given destination.
+ *
+ * The RDS_MSG_ON_SOCK and RDS_MSG_ON_CONN flags are used to avoid lock
+ * nesting.  As paths iterate over messages on a sock, or conn, they must
+ * also lock the conn, or sock, to remove the message from those lists too.
+ * Testing the flag to determine if the message is still on the lists lets
+ * us avoid testing the list_head directly.  That means each path can use
+ * the message's list_head to keep it on a local list while juggling locks
+ * without confusing the other path.
+ *
+ * m_ack_seq is an optional field set by transports who need a different
+ * sequence number range to invalidate.  They can use this in a callback
+ * that they pass to rds_send_drop_acked() to see if each message has been
+ * acked.  The HAS_ACK_SEQ flag can be used to detect messages which haven't
+ * had ack_seq set yet.
+ */
+#define RDS_MSG_ON_SOCK		1
+#define RDS_MSG_ON_CONN		2
+#define RDS_MSG_HAS_ACK_SEQ	3
+#define RDS_MSG_ACK_REQUIRED	4
+#define RDS_MSG_RETRANSMITTED	5
+#define RDS_MSG_MAPPED		6
+#define RDS_MSG_PAGEVEC		7
+
+struct rds_message {
+	atomic_t		m_refcount;
+	struct list_head	m_sock_item;
+	struct list_head	m_conn_item;
+	struct rds_incoming	m_inc;
+	u64			m_ack_seq;
+	__be32			m_daddr;
+	unsigned long		m_flags;
+
+	/* Never access m_rs without holding m_rs_lock.
+	 * Lock nesting is
+	 *  rm->m_rs_lock
+	 *   -> rs->rs_lock
+	 */
+	spinlock_t		m_rs_lock;
+	struct rds_sock		*m_rs;
+	struct rds_rdma_op	*m_rdma_op;
+	rds_rdma_cookie_t	m_rdma_cookie;
+	struct rds_mr		*m_rdma_mr;
+	unsigned int		m_nents;
+	unsigned int		m_count;
+	struct scatterlist	m_sg[0];
+};
+
+/*
+ * The RDS notifier is used (optionally) to tell the application about
+ * completed RDMA operations. Rather than keeping the whole rds message
+ * around on the queue, we allocate a small notifier that is put on the
+ * socket's notifier_list. Notifications are delivered to the application
+ * through control messages.
+ */
+struct rds_notifier {
+	struct list_head	n_list;
+	uint64_t		n_user_token;
+	int			n_status;
+};
+
+/**
+ * struct rds_transport -  transport specific behavioural hooks
+ *
+ * @xmit: .xmit is called by rds_send_xmit() to tell the transport to send
+ *        part of a message.  The caller serializes on the send_sem so this
+ *        doesn't need to be reentrant for a given conn.  The header must be
+ *        sent before the data payload.  .xmit must be prepared to send a
+ *        message with no data payload.  .xmit should return the number of
+ *        bytes that were sent down the connection, including header bytes.
+ *        Returning 0 tells the caller that it doesn't need to perform any
+ *        additional work now.  This is usually the case when the transport has
+ *        filled the sending queue for its connection and will handle
+ *        triggering the rds thread to continue the send when space becomes
+ *        available.  Returning -EAGAIN tells the caller to retry the send
+ *        immediately.  Returning -ENOMEM tells the caller to retry the send at
+ *        some point in the future.
+ *
+ * @conn_shutdown: conn_shutdown stops traffic on the given connection.  Once
+ *                 it returns the connection can not call rds_recv_incoming().
+ *                 This will only be called once after conn_connect returns
+ *                 non-zero success and will The caller serializes this with
+ *                 the send and connecting paths (xmit_* and conn_*).  The
+ *                 transport is responsible for other serialization, including
+ *                 rds_recv_incoming().  This is called in process context but
+ *                 should try hard not to block.
+ *
+ * @xmit_cong_map: This asks the transport to send the local bitmap down the
+ * 		   given connection.  XXX get a better story about the bitmap
+ * 		   flag and header.
+ */
+
+struct rds_transport {
+	char			t_name[TRANSNAMSIZ];
+	struct list_head	t_item;
+	struct module		*t_owner;
+	unsigned int		t_prefer_loopback:1;
+
+	int (*laddr_check)(__be32 addr);
+	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
+	void (*conn_free)(void *data);
+	int (*conn_connect)(struct rds_connection *conn);
+	void (*conn_shutdown)(struct rds_connection *conn);
+	void (*xmit_prepare)(struct rds_connection *conn);
+	void (*xmit_complete)(struct rds_connection *conn);
+	int (*xmit)(struct rds_connection *conn, struct rds_message *rm,
+		    unsigned int hdr_off, unsigned int sg, unsigned int off);
+	int (*xmit_cong_map)(struct rds_connection *conn,
+			     struct rds_cong_map *map, unsigned long offset);
+	int (*xmit_rdma)(struct rds_connection *conn, struct rds_rdma_op *op);
+	int (*recv)(struct rds_connection *conn);
+	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iovec *iov,
+				size_t size);
+	void (*inc_purge)(struct rds_incoming *inc);
+	void (*inc_free)(struct rds_incoming *inc);
+
+	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
+				 struct rdma_cm_event *event);
+	int (*cm_initiate_connect)(struct rdma_cm_id *cm_id);
+	void (*cm_connect_complete)(struct rds_connection *conn,
+				    struct rdma_cm_event *event);
+
+	unsigned int (*stats_info_copy)(struct rds_info_iterator *iter,
+					unsigned int avail);
+	void (*exit)(void);
+	void *(*get_mr)(struct scatterlist *sg, unsigned long nr_sg,
+			struct rds_sock *rs, u32 *key_ret);
+	void (*sync_mr)(void *trans_private, int direction);
+	void (*free_mr)(void *trans_private, int invalidate);
+	void (*flush_mrs)(void);
+};
+
+struct rds_sock {
+	struct sock		rs_sk;
+
+	u64			rs_user_addr;
+	u64			rs_user_bytes;
+
+	/*
+	 * bound_addr used for both incoming and outgoing, no INADDR_ANY
+	 * support.
+	 */
+	struct rb_node		rs_bound_node;
+	__be32			rs_bound_addr;
+	__be32			rs_conn_addr;
+	__be16			rs_bound_port;
+	__be16			rs_conn_port;
+
+	/*
+	 * This is only used to communicate the transport between bind and
+	 * initiating connections.  All other trans use is referenced through
+	 * the connection.
+	 */
+	struct rds_transport    *rs_transport;
+
+	/*
+	 * rds_sendmsg caches the conn it used the last time around.
+	 * This helps avoid costly lookups.
+	 */
+	struct rds_connection	*rs_conn;
+
+	/* flag indicating we were congested or not */
+	int			rs_congested;
+
+	/* rs_lock protects all these adjacent members before the newline */
+	spinlock_t		rs_lock;
+	struct list_head	rs_send_queue;
+	u32			rs_snd_bytes;
+	int			rs_rcv_bytes;
+	struct list_head	rs_notify_queue;	/* currently used for failed RDMAs */
+
+	/* Congestion wake_up. If rs_cong_monitor is set, we use cong_mask
+	 * to decide whether the application should be woken up.
+	 * If not set, we use rs_cong_track to find out whether a cong map
+	 * update arrived.
+	 */
+	uint64_t		rs_cong_mask;
+	uint64_t		rs_cong_notify;
+	struct list_head	rs_cong_list;
+	unsigned long		rs_cong_track;
+
+	/*
+	 * rs_recv_lock protects the receive queue, and is
+	 * used to serialize with rds_release.
+	 */
+	rwlock_t		rs_recv_lock;
+	struct list_head	rs_recv_queue;
+
+	/* just for stats reporting */
+	struct list_head	rs_item;
+
+	/* these have their own lock */
+	spinlock_t		rs_rdma_lock;
+	struct rb_root		rs_rdma_keys;
+
+	/* Socket options - in case there will be more */
+	unsigned char		rs_recverr,
+				rs_cong_monitor;
+};
+
+static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
+{
+	return container_of(sk, struct rds_sock, rs_sk);
+}
+static inline struct sock *rds_rs_to_sk(struct rds_sock *rs)
+{
+	return &rs->rs_sk;
+}
+
+/*
+ * The stack assigns sk_sndbuf and sk_rcvbuf to twice the specified value
+ * to account for overhead.  We don't account for overhead, we just apply
+ * the number of payload bytes to the specified value.
+ */
+static inline int rds_sk_sndbuf(struct rds_sock *rs)
+{
+	return rds_rs_to_sk(rs)->sk_sndbuf / 2;
+}
+static inline int rds_sk_rcvbuf(struct rds_sock *rs)
+{
+	return rds_rs_to_sk(rs)->sk_rcvbuf / 2;
+}
+
+struct rds_statistics {
+	uint64_t	s_conn_reset;
+	uint64_t	s_recv_drop_bad_checksum;
+	uint64_t	s_recv_drop_old_seq;
+	uint64_t	s_recv_drop_no_sock;
+	uint64_t	s_recv_drop_dead_sock;
+	uint64_t	s_recv_deliver_raced;
+	uint64_t	s_recv_delivered;
+	uint64_t	s_recv_queued;
+	uint64_t	s_recv_immediate_retry;
+	uint64_t	s_recv_delayed_retry;
+	uint64_t	s_recv_ack_required;
+	uint64_t	s_recv_rdma_bytes;
+	uint64_t	s_recv_ping;
+	uint64_t	s_send_queue_empty;
+	uint64_t	s_send_queue_full;
+	uint64_t	s_send_sem_contention;
+	uint64_t	s_send_sem_queue_raced;
+	uint64_t	s_send_immediate_retry;
+	uint64_t	s_send_delayed_retry;
+	uint64_t	s_send_drop_acked;
+	uint64_t	s_send_ack_required;
+	uint64_t	s_send_queued;
+	uint64_t	s_send_rdma;
+	uint64_t	s_send_rdma_bytes;
+	uint64_t	s_send_pong;
+	uint64_t	s_page_remainder_hit;
+	uint64_t	s_page_remainder_miss;
+	uint64_t	s_copy_to_user;
+	uint64_t	s_copy_from_user;
+	uint64_t	s_cong_update_queued;
+	uint64_t	s_cong_update_received;
+	uint64_t	s_cong_send_error;
+	uint64_t	s_cong_send_blocked;
+};
+
+/* af_rds.c */
+void rds_sock_addref(struct rds_sock *rs);
+void rds_sock_put(struct rds_sock *rs);
+void rds_wake_sk_sleep(struct rds_sock *rs);
+static inline void __rds_wake_sk_sleep(struct sock *sk)
+{
+	wait_queue_head_t *waitq = sk->sk_sleep;
+
+	if (!sock_flag(sk, SOCK_DEAD) && waitq)
+		wake_up(waitq);
+}
+extern wait_queue_head_t rds_poll_waitq;
+
+
+/* bind.c */
+int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+void rds_remove_bound(struct rds_sock *rs);
+struct rds_sock *rds_find_bound(__be32 addr, __be16 port);
+
+/* cong.c */
+int rds_cong_get_maps(struct rds_connection *conn);
+void rds_cong_add_conn(struct rds_connection *conn);
+void rds_cong_remove_conn(struct rds_connection *conn);
+void rds_cong_set_bit(struct rds_cong_map *map, __be16 port);
+void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port);
+int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, struct rds_sock *rs);
+void rds_cong_queue_updates(struct rds_cong_map *map);
+void rds_cong_map_updated(struct rds_cong_map *map, uint64_t);
+int rds_cong_updated_since(unsigned long *recent);
+void rds_cong_add_socket(struct rds_sock *);
+void rds_cong_remove_socket(struct rds_sock *);
+void rds_cong_exit(void);
+struct rds_message *rds_cong_update_alloc(struct rds_connection *conn);
+
+/* conn.c */
+int __init rds_conn_init(void);
+void rds_conn_exit(void);
+struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp);
+struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr,
+			       struct rds_transport *trans, gfp_t gfp);
+void rds_conn_destroy(struct rds_connection *conn);
+void rds_conn_reset(struct rds_connection *conn);
+void rds_conn_drop(struct rds_connection *conn);
+void rds_for_each_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens,
+			  int (*visitor)(struct rds_connection *, void *),
+			  size_t item_len);
+void __rds_conn_error(struct rds_connection *conn, const char *, ...)
+				__attribute__ ((format (printf, 2, 3)));
+#define rds_conn_error(conn, fmt...) \
+	__rds_conn_error(conn, KERN_WARNING "RDS: " fmt)
+
+static inline int
+rds_conn_transition(struct rds_connection *conn, int old, int new)
+{
+	return atomic_cmpxchg(&conn->c_state, old, new) == old;
+}
+
+static inline int
+rds_conn_state(struct rds_connection *conn)
+{
+	return atomic_read(&conn->c_state);
+}
+
+static inline int
+rds_conn_up(struct rds_connection *conn)
+{
+	return atomic_read(&conn->c_state) == RDS_CONN_UP;
+}
+
+static inline int
+rds_conn_connecting(struct rds_connection *conn)
+{
+	return atomic_read(&conn->c_state) == RDS_CONN_CONNECTING;
+}
+
+/* message.c */
+struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
+struct rds_message *rds_message_copy_from_user(struct iovec *first_iov,
+					       size_t total_len);
+struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
+void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
+				 __be16 dport, u64 seq);
+int rds_message_add_extension(struct rds_header *hdr,
+			      unsigned int type, const void *data, unsigned int len);
+int rds_message_next_extension(struct rds_header *hdr,
+			       unsigned int *pos, void *buf, unsigned int *buflen);
+int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version);
+int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version);
+int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset);
+int rds_message_inc_copy_to_user(struct rds_incoming *inc,
+				 struct iovec *first_iov, size_t size);
+void rds_message_inc_purge(struct rds_incoming *inc);
+void rds_message_inc_free(struct rds_incoming *inc);
+void rds_message_addref(struct rds_message *rm);
+void rds_message_put(struct rds_message *rm);
+void rds_message_wait(struct rds_message *rm);
+void rds_message_unmapped(struct rds_message *rm);
+
+static inline void rds_message_make_checksum(struct rds_header *hdr)
+{
+	hdr->h_csum = 0;
+	hdr->h_csum = ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2);
+}
+
+static inline int rds_message_verify_checksum(const struct rds_header *hdr)
+{
+	return !hdr->h_csum || ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2) == 0;
+}
+
+
+/* page.c */
+int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
+			     gfp_t gfp);
+int rds_page_copy_user(struct page *page, unsigned long offset,
+		       void __user *ptr, unsigned long bytes,
+		       int to_user);
+#define rds_page_copy_to_user(page, offset, ptr, bytes) \
+	rds_page_copy_user(page, offset, ptr, bytes, 1)
+#define rds_page_copy_from_user(page, offset, ptr, bytes) \
+	rds_page_copy_user(page, offset, ptr, bytes, 0)
+void rds_page_exit(void);
+
+/* recv.c */
+void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
+		  __be32 saddr);
+void rds_inc_addref(struct rds_incoming *inc);
+void rds_inc_put(struct rds_incoming *inc);
+void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+		       struct rds_incoming *inc, gfp_t gfp, enum km_type km);
+int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t size, int msg_flags);
+void rds_clear_recv_queue(struct rds_sock *rs);
+int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msg);
+void rds_inc_info_copy(struct rds_incoming *inc,
+		       struct rds_info_iterator *iter,
+		       __be32 saddr, __be32 daddr, int flip);
+
+/* send.c */
+int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t payload_len);
+void rds_send_reset(struct rds_connection *conn);
+int rds_send_xmit(struct rds_connection *conn);
+struct sockaddr_in;
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest);
+typedef int (*is_acked_func)(struct rds_message *rm, uint64_t ack);
+void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
+			 is_acked_func is_acked);
+int rds_send_acked_before(struct rds_connection *conn, u64 seq);
+void rds_send_remove_from_sock(struct list_head *messages, int status);
+int rds_send_pong(struct rds_connection *conn, __be16 dport);
+struct rds_message *rds_send_get_message(struct rds_connection *,
+					 struct rds_rdma_op *);
+
+/* rdma.c */
+void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force);
+
+/* stats.c */
+DECLARE_PER_CPU(struct rds_statistics, rds_stats);
+#define rds_stats_inc_which(which, member) do {		\
+	per_cpu(which, get_cpu()).member++;		\
+	put_cpu();					\
+} while (0)
+#define rds_stats_inc(member) rds_stats_inc_which(rds_stats, member)
+#define rds_stats_add_which(which, member, count) do {		\
+	per_cpu(which, get_cpu()).member += count;	\
+	put_cpu();					\
+} while (0)
+#define rds_stats_add(member, count) rds_stats_add_which(rds_stats, member, count)
+int __init rds_stats_init(void);
+void rds_stats_exit(void);
+void rds_stats_info_copy(struct rds_info_iterator *iter,
+			 uint64_t *values, char **names, size_t nr);
+
+/* sysctl.c */
+int __init rds_sysctl_init(void);
+void rds_sysctl_exit(void);
+extern unsigned long rds_sysctl_sndbuf_min;
+extern unsigned long rds_sysctl_sndbuf_default;
+extern unsigned long rds_sysctl_sndbuf_max;
+extern unsigned long rds_sysctl_reconnect_min_jiffies;
+extern unsigned long rds_sysctl_reconnect_max_jiffies;
+extern unsigned int  rds_sysctl_max_unacked_packets;
+extern unsigned int  rds_sysctl_max_unacked_bytes;
+extern unsigned int  rds_sysctl_ping_enable;
+extern unsigned long rds_sysctl_trace_flags;
+extern unsigned int  rds_sysctl_trace_level;
+
+/* threads.c */
+int __init rds_threads_init(void);
+void rds_threads_exit(void);
+extern struct workqueue_struct *rds_wq;
+void rds_connect_worker(struct work_struct *);
+void rds_shutdown_worker(struct work_struct *);
+void rds_send_worker(struct work_struct *);
+void rds_recv_worker(struct work_struct *);
+void rds_connect_complete(struct rds_connection *conn);
+
+/* transport.c */
+int rds_trans_register(struct rds_transport *trans);
+void rds_trans_unregister(struct rds_transport *trans);
+struct rds_transport *rds_trans_get_preferred(__be32 addr);
+unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter,
+				       unsigned int avail);
+int __init rds_trans_init(void);
+void rds_trans_exit(void);
+
+#endif
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:20 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:20 -0800
Subject: [ofa-general] [PATCH 03/26] RDS: Congestion-handling code
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-4-git-send-email-andy.grover@oracle.com>

RDS handles per-socket congestion by updating peers with a complete
congestion map (8KB). This code keeps track of these maps for itself
and ones received from peers.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/cong.c |  402 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 402 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/cong.c

diff --git a/net/rds/cong.c b/net/rds/cong.c
new file mode 100644
index 0000000..90e6b31
--- /dev/null
+++ b/net/rds/cong.c
@@ -0,0 +1,402 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+#include "rds.h"
+
+/*
+ * This file implements the receive side of the unconventional congestion
+ * management in RDS.
+ *
+ * Messages waiting in the receive queue on the receiving socket are accounted
+ * against the sockets SO_RCVBUF option value.  Only the payload bytes in the
+ * message are accounted for.  If the number of bytes queued equals or exceeds
+ * rcvbuf then the socket is congested.  All sends attempted to this socket's
+ * address should return block or return -EWOULDBLOCK.
+ *
+ * Applications are expected to be reasonably tuned such that this situation
+ * very rarely occurs.  An application encountering this "back-pressure" is
+ * considered a bug.
+ *
+ * This is implemented by having each node maintain bitmaps which indicate
+ * which ports on bound addresses are congested.  As the bitmap changes it is
+ * sent through all the connections which terminate in the local address of the
+ * bitmap which changed.
+ *
+ * The bitmaps are allocated as connections are brought up.  This avoids
+ * allocation in the interrupt handling path which queues messages on sockets.
+ * The dense bitmaps let transports send the entire bitmap on any bitmap change
+ * reasonably efficiently.  This is much easier to implement than some
+ * finer-grained communication of per-port congestion.  The sender does a very
+ * inexpensive bit test to test if the port it's about to send to is congested
+ * or not.
+ */
+
+/*
+ * Interaction with poll is a tad tricky. We want all processes stuck in
+ * poll to wake up and check whether a congested destination became uncongested.
+ * The really sad thing is we have no idea which destinations the application
+ * wants to send to - we don't even know which rds_connections are involved.
+ * So until we implement a more flexible rds poll interface, we have to make
+ * do with this:
+ * We maintain a global counter that is incremented each time a congestion map
+ * update is received. Each rds socket tracks this value, and if rds_poll
+ * finds that the saved generation number is smaller than the global generation
+ * number, it wakes up the process.
+ */
+static atomic_t		rds_cong_generation = ATOMIC_INIT(0);
+
+/*
+ * Congestion monitoring
+ */
+static LIST_HEAD(rds_cong_monitor);
+static DEFINE_RWLOCK(rds_cong_monitor_lock);
+
+/*
+ * Yes, a global lock.  It's used so infrequently that it's worth keeping it
+ * global to simplify the locking.  It's only used in the following
+ * circumstances:
+ *
+ *  - on connection buildup to associate a conn with its maps
+ *  - on map changes to inform conns of a new map to send
+ *
+ *  It's sadly ordered under the socket callback lock and the connection lock.
+ *  Receive paths can mark ports congested from interrupt context so the
+ *  lock masks interrupts.
+ */
+static DEFINE_SPINLOCK(rds_cong_lock);
+static struct rb_root rds_cong_tree = RB_ROOT;
+
+static struct rds_cong_map *rds_cong_tree_walk(__be32 addr,
+					       struct rds_cong_map *insert)
+{
+	struct rb_node **p = &rds_cong_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_cong_map *map;
+
+	while (*p) {
+		parent = *p;
+		map = rb_entry(parent, struct rds_cong_map, m_rb_node);
+
+		if (addr < map->m_addr)
+			p = &(*p)->rb_left;
+		else if (addr > map->m_addr)
+			p = &(*p)->rb_right;
+		else
+			return map;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->m_rb_node, parent, p);
+		rb_insert_color(&insert->m_rb_node, &rds_cong_tree);
+	}
+	return NULL;
+}
+
+/*
+ * There is only ever one bitmap for any address.  Connections try and allocate
+ * these bitmaps in the process getting pointers to them.  The bitmaps are only
+ * ever freed as the module is removed after all connections have been freed.
+ */
+static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
+{
+	struct rds_cong_map *map;
+	struct rds_cong_map *ret = NULL;
+	unsigned long zp;
+	unsigned long i;
+	unsigned long flags;
+
+	map = kzalloc(sizeof(struct rds_cong_map), GFP_KERNEL);
+	if (map == NULL)
+		return NULL;
+
+	map->m_addr = addr;
+	init_waitqueue_head(&map->m_waitq);
+	INIT_LIST_HEAD(&map->m_conn_list);
+
+	for (i = 0; i < RDS_CONG_MAP_PAGES; i++) {
+		zp = get_zeroed_page(GFP_KERNEL);
+		if (zp == 0)
+			goto out;
+		map->m_page_addrs[i] = zp;
+	}
+
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	ret = rds_cong_tree_walk(addr, map);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+
+	if (ret == NULL) {
+		ret = map;
+		map = NULL;
+	}
+
+out:
+	if (map) {
+		for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++)
+			free_page(map->m_page_addrs[i]);
+		kfree(map);
+	}
+
+	rdsdebug("map %p for addr %x\n", ret, be32_to_cpu(addr));
+
+	return ret;
+}
+
+/*
+ * Put the conn on its local map's list.  This is called when the conn is
+ * really added to the hash.  It's nested under the rds_conn_lock, sadly.
+ */
+void rds_cong_add_conn(struct rds_connection *conn)
+{
+	unsigned long flags;
+
+	rdsdebug("conn %p now on map %p\n", conn, conn->c_lcong);
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	list_add_tail(&conn->c_map_item, &conn->c_lcong->m_conn_list);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+void rds_cong_remove_conn(struct rds_connection *conn)
+{
+	unsigned long flags;
+
+	rdsdebug("removing conn %p from map %p\n", conn, conn->c_lcong);
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	list_del_init(&conn->c_map_item);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+int rds_cong_get_maps(struct rds_connection *conn)
+{
+	conn->c_lcong = rds_cong_from_addr(conn->c_laddr);
+	conn->c_fcong = rds_cong_from_addr(conn->c_faddr);
+
+	if (conn->c_lcong == NULL || conn->c_fcong == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void rds_cong_queue_updates(struct rds_cong_map *map)
+{
+	struct rds_connection *conn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_cong_lock, flags);
+
+	list_for_each_entry(conn, &map->m_conn_list, c_map_item) {
+		if (!test_and_set_bit(0, &conn->c_map_queued)) {
+			rds_stats_inc(s_cong_update_queued);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask)
+{
+	rdsdebug("waking map %p for %pI4\n",
+	  map, &map->m_addr);
+	rds_stats_inc(s_cong_update_received);
+	atomic_inc(&rds_cong_generation);
+	if (waitqueue_active(&map->m_waitq))
+		wake_up(&map->m_waitq);
+	if (waitqueue_active(&rds_poll_waitq))
+		wake_up_all(&rds_poll_waitq);
+
+	if (portmask && !list_empty(&rds_cong_monitor)) {
+		unsigned long flags;
+		struct rds_sock *rs;
+
+		read_lock_irqsave(&rds_cong_monitor_lock, flags);
+		list_for_each_entry(rs, &rds_cong_monitor, rs_cong_list) {
+			spin_lock(&rs->rs_lock);
+			rs->rs_cong_notify |= (rs->rs_cong_mask & portmask);
+			rs->rs_cong_mask &= ~portmask;
+			spin_unlock(&rs->rs_lock);
+			if (rs->rs_cong_notify)
+				rds_wake_sk_sleep(rs);
+		}
+		read_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+	}
+}
+
+int rds_cong_updated_since(unsigned long *recent)
+{
+	unsigned long gen = atomic_read(&rds_cong_generation);
+
+	if (likely(*recent == gen))
+		return 0;
+	*recent = gen;
+	return 1;
+}
+
+/*
+ * We're called under the locking that protects the sockets receive buffer
+ * consumption.  This makes it a lot easier for the caller to only call us
+ * when it knows that an existing set bit needs to be cleared, and vice versa.
+ * We can't block and we need to deal with concurrent sockets working against
+ * the same per-address map.
+ */
+void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	rdsdebug("setting congestion for %pI4:%u in map %p\n",
+	  &map->m_addr, ntohs(port), map);
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	generic___set_le_bit(off, (void *)map->m_page_addrs[i]);
+}
+
+void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	rdsdebug("clearing congestion for %pI4:%u in map %p\n",
+	  &map->m_addr, ntohs(port), map);
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	generic___clear_le_bit(off, (void *)map->m_page_addrs[i]);
+}
+
+static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	return generic_test_le_bit(off, (void *)map->m_page_addrs[i]);
+}
+
+void rds_cong_add_socket(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	write_lock_irqsave(&rds_cong_monitor_lock, flags);
+	if (list_empty(&rs->rs_cong_list))
+		list_add(&rs->rs_cong_list, &rds_cong_monitor);
+	write_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+}
+
+void rds_cong_remove_socket(struct rds_sock *rs)
+{
+	unsigned long flags;
+	struct rds_cong_map *map;
+
+	write_lock_irqsave(&rds_cong_monitor_lock, flags);
+	list_del_init(&rs->rs_cong_list);
+	write_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+
+	/* update congestion map for now-closed port */
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	map = rds_cong_tree_walk(rs->rs_bound_addr, NULL);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+
+	if (map && rds_cong_test_bit(map, rs->rs_bound_port)) {
+		rds_cong_clear_bit(map, rs->rs_bound_port);
+		rds_cong_queue_updates(map);
+	}
+}
+
+int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock,
+		  struct rds_sock *rs)
+{
+	if (!rds_cong_test_bit(map, port))
+		return 0;
+	if (nonblock) {
+		if (rs && rs->rs_cong_monitor) {
+			unsigned long flags;
+
+			/* It would have been nice to have an atomic set_bit on
+			 * a uint64_t. */
+			spin_lock_irqsave(&rs->rs_lock, flags);
+			rs->rs_cong_mask |= RDS_CONG_MONITOR_MASK(ntohs(port));
+			spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+			/* Test again - a congestion update may have arrived in
+			 * the meantime. */
+			if (!rds_cong_test_bit(map, port))
+				return 0;
+		}
+		rds_stats_inc(s_cong_send_error);
+		return -ENOBUFS;
+	}
+
+	rds_stats_inc(s_cong_send_blocked);
+	rdsdebug("waiting on map %p for port %u\n", map, be16_to_cpu(port));
+
+	return wait_event_interruptible(map->m_waitq,
+					!rds_cong_test_bit(map, port));
+}
+
+void rds_cong_exit(void)
+{
+	struct rb_node *node;
+	struct rds_cong_map *map;
+	unsigned long i;
+
+	while ((node = rb_first(&rds_cong_tree))) {
+		map = rb_entry(node, struct rds_cong_map, m_rb_node);
+		rdsdebug("freeing map %p\n", map);
+		rb_erase(&map->m_rb_node, &rds_cong_tree);
+		for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++)
+			free_page(map->m_page_addrs[i]);
+		kfree(map);
+	}
+}
+
+/*
+ * Allocate a RDS message containing a congestion update.
+ */
+struct rds_message *rds_cong_update_alloc(struct rds_connection *conn)
+{
+	struct rds_cong_map *map = conn->c_lcong;
+	struct rds_message *rm;
+
+	rm = rds_message_map_pages(map->m_page_addrs, RDS_CONG_MAP_BYTES);
+	if (!IS_ERR(rm))
+		rm->m_inc.i_hdr.h_flags = RDS_FLAG_CONG_BITMAP;
+
+	return rm;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:21 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:21 -0800
Subject: [ofa-general] [PATCH 04/26] RDS: Transport code
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-5-git-send-email-andy.grover@oracle.com>

RDS supports multiple transports. While this initial submission
only supports Infiniband transport, this abstraction allows others
to be added. We're working on an iWARP transport, and also see
UDP over DCB as another possibility.

This code handles transport registration.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/transport.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 117 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/transport.c

diff --git a/net/rds/transport.c b/net/rds/transport.c
new file mode 100644
index 0000000..767da61
--- /dev/null
+++ b/net/rds/transport.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "loop.h"
+
+static LIST_HEAD(rds_transports);
+static DECLARE_RWSEM(rds_trans_sem);
+
+int rds_trans_register(struct rds_transport *trans)
+{
+	BUG_ON(strlen(trans->t_name) + 1 > TRANSNAMSIZ);
+
+	down_write(&rds_trans_sem);
+
+	list_add_tail(&trans->t_item, &rds_transports);
+	printk(KERN_INFO "Registered RDS/%s transport\n", trans->t_name);
+
+	up_write(&rds_trans_sem);
+
+	return 0;
+}
+
+void rds_trans_unregister(struct rds_transport *trans)
+{
+	down_write(&rds_trans_sem);
+
+	list_del_init(&trans->t_item);
+	printk(KERN_INFO "Unregistered RDS/%s transport\n", trans->t_name);
+
+	up_write(&rds_trans_sem);
+}
+
+struct rds_transport *rds_trans_get_preferred(__be32 addr)
+{
+	struct rds_transport *trans;
+	struct rds_transport *ret = NULL;
+
+	if (IN_LOOPBACK(ntohl(addr)))
+		return &rds_loop_transport;
+
+	down_read(&rds_trans_sem);
+	list_for_each_entry(trans, &rds_transports, t_item) {
+		if (trans->laddr_check(addr) == 0) {
+			ret = trans;
+			break;
+		}
+	}
+	up_read(&rds_trans_sem);
+
+	return ret;
+}
+
+/*
+ * This returns the number of stats entries in the snapshot and only
+ * copies them using the iter if there is enough space for them.  The
+ * caller passes in the global stats so that we can size and copy while
+ * holding the lock.
+ */
+unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter,
+				       unsigned int avail)
+
+{
+	struct rds_transport *trans;
+	unsigned int total = 0;
+	unsigned int part;
+
+	rds_info_iter_unmap(iter);
+	down_read(&rds_trans_sem);
+
+	list_for_each_entry(trans, &rds_transports, t_item) {
+		if (trans->stats_info_copy == NULL)
+			continue;
+
+		part = trans->stats_info_copy(iter, avail);
+		avail -= min(avail, part);
+		total += part;
+	}
+
+	up_read(&rds_trans_sem);
+
+	return total;
+}
+
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:22 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:22 -0800
Subject: [ofa-general] [PATCH 05/26] RDS: Info and stats
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-6-git-send-email-andy.grover@oracle.com>

RDS currently generates a lot of stats that are accessible via
the rds-info utility. This code implements the support for this.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/info.c  |  241 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/info.h  |   30 +++++++
 net/rds/stats.c |  148 ++++++++++++++++++++++++++++++++++
 3 files changed, 419 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/info.c
 create mode 100644 net/rds/info.h
 create mode 100644 net/rds/stats.c

diff --git a/net/rds/info.c b/net/rds/info.c
new file mode 100644
index 0000000..1d88553
--- /dev/null
+++ b/net/rds/info.c
@@ -0,0 +1,241 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+/*
+ * This file implements a getsockopt() call which copies a set of fixed
+ * sized structs into a user-specified buffer as a means of providing
+ * read-only information about RDS.
+ *
+ * For a given information source there are a given number of fixed sized
+ * structs at a given time.  The structs are only copied if the user-specified
+ * buffer is big enough.  The destination pages that make up the buffer
+ * are pinned for the duration of the copy.
+ *
+ * This gives us the following benefits:
+ *
+ * - simple implementation, no copy "position" across multiple calls
+ * - consistent snapshot of an info source
+ * - atomic copy works well with whatever locking info source has
+ * - one portable tool to get rds info across implementations
+ * - long-lived tool can get info without allocating
+ *
+ * at the following costs:
+ *
+ * - info source copy must be pinned, may be "large"
+ */
+
+struct rds_info_iterator {
+	struct page **pages;
+	void *addr;
+	unsigned long offset;
+};
+
+static DEFINE_SPINLOCK(rds_info_lock);
+static rds_info_func rds_info_funcs[RDS_INFO_LAST - RDS_INFO_FIRST + 1];
+
+void rds_info_register_func(int optname, rds_info_func func)
+{
+	int offset = optname - RDS_INFO_FIRST;
+
+	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
+
+	spin_lock(&rds_info_lock);
+	BUG_ON(rds_info_funcs[offset] != NULL);
+	rds_info_funcs[offset] = func;
+	spin_unlock(&rds_info_lock);
+}
+
+void rds_info_deregister_func(int optname, rds_info_func func)
+{
+	int offset = optname - RDS_INFO_FIRST;
+
+	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
+
+	spin_lock(&rds_info_lock);
+	BUG_ON(rds_info_funcs[offset] != func);
+	rds_info_funcs[offset] = NULL;
+	spin_unlock(&rds_info_lock);
+}
+
+/*
+ * Typically we hold an atomic kmap across multiple rds_info_copy() calls
+ * because the kmap is so expensive.  This must be called before using blocking
+ * operations while holding the mapping and as the iterator is torn down.
+ */
+void rds_info_iter_unmap(struct rds_info_iterator *iter)
+{
+	if (iter->addr != NULL) {
+		kunmap_atomic(iter->addr, KM_USER0);
+		iter->addr = NULL;
+	}
+}
+
+/*
+ * get_user_pages() called flush_dcache_page() on the pages for us.
+ */
+void rds_info_copy(struct rds_info_iterator *iter, void *data,
+		   unsigned long bytes)
+{
+	unsigned long this;
+
+	while (bytes) {
+		if (iter->addr == NULL)
+			iter->addr = kmap_atomic(*iter->pages, KM_USER0);
+
+		this = min(bytes, PAGE_SIZE - iter->offset);
+
+		rdsdebug("page %p addr %p offset %lu this %lu data %p "
+			  "bytes %lu\n", *iter->pages, iter->addr,
+			  iter->offset, this, data, bytes);
+
+		memcpy(iter->addr + iter->offset, data, this);
+
+		data += this;
+		bytes -= this;
+		iter->offset += this;
+
+		if (iter->offset == PAGE_SIZE) {
+			kunmap_atomic(iter->addr, KM_USER0);
+			iter->addr = NULL;
+			iter->offset = 0;
+			iter->pages++;
+		}
+	}
+}
+
+/*
+ * @optval points to the userspace buffer that the information snapshot
+ * will be copied into.
+ *
+ * @optlen on input is the size of the buffer in userspace.  @optlen
+ * on output is the size of the requested snapshot in bytes.
+ *
+ * This function returns -errno if there is a failure, particularly -ENOSPC
+ * if the given userspace buffer was not large enough to fit the snapshot.
+ * On success it returns the positive number of bytes of each array element
+ * in the snapshot.
+ */
+int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
+			int __user *optlen)
+{
+	struct rds_info_iterator iter;
+	struct rds_info_lengths lens;
+	unsigned long nr_pages = 0;
+	unsigned long start;
+	unsigned long i;
+	rds_info_func func;
+	struct page **pages = NULL;
+	int ret;
+	int len;
+	int total;
+
+	if (get_user(len, optlen)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* check for all kinds of wrapping and the like */
+	start = (unsigned long)optval;
+	if (len < 0 || len + PAGE_SIZE - 1 < len || start + len < start) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* a 0 len call is just trying to probe its length */
+	if (len == 0)
+		goto call_func;
+
+	nr_pages = (PAGE_ALIGN(start + len) - (start & PAGE_MASK))
+			>> PAGE_SHIFT;
+
+	pages = kmalloc(nr_pages * sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, start, nr_pages, 1, 0,
+			     pages, NULL);
+	up_read(&current->mm->mmap_sem);
+	if (ret != nr_pages) {
+		if (ret > 0)
+			nr_pages = ret;
+		else
+			nr_pages = 0;
+		ret = -EAGAIN; /* XXX ? */
+		goto out;
+	}
+
+	rdsdebug("len %d nr_pages %lu\n", len, nr_pages);
+
+call_func:
+	func = rds_info_funcs[optname - RDS_INFO_FIRST];
+	if (func == NULL) {
+		ret = -ENOPROTOOPT;
+		goto out;
+	}
+
+	iter.pages = pages;
+	iter.addr = NULL;
+	iter.offset = start & (PAGE_SIZE - 1);
+
+	func(sock, len, &iter, &lens);
+	BUG_ON(lens.each == 0);
+
+	total = lens.nr * lens.each;
+
+	rds_info_iter_unmap(&iter);
+
+	if (total > len) {
+		len = total;
+		ret = -ENOSPC;
+	} else {
+		len = total;
+		ret = lens.each;
+	}
+
+	if (put_user(len, optlen))
+		ret = -EFAULT;
+
+out:
+	for (i = 0; pages != NULL && i < nr_pages; i++)
+		put_page(pages[i]);
+	kfree(pages);
+
+	return ret;
+}
diff --git a/net/rds/info.h b/net/rds/info.h
new file mode 100644
index 0000000..b6c052c
--- /dev/null
+++ b/net/rds/info.h
@@ -0,0 +1,30 @@
+#ifndef _RDS_INFO_H
+#define _RDS_INFO_H
+
+struct rds_info_lengths {
+	unsigned int	nr;
+	unsigned int	each;
+};
+
+struct rds_info_iterator;
+
+/*
+ * These functions must fill in the fields of @lens to reflect the size
+ * of the available info source.  If the snapshot fits in @len then it
+ * should be copied using @iter.  The caller will deduce if it was copied
+ * or not by comparing the lengths.
+ */
+typedef void (*rds_info_func)(struct socket *sock, unsigned int len,
+			      struct rds_info_iterator *iter,
+			      struct rds_info_lengths *lens);
+
+void rds_info_register_func(int optname, rds_info_func func);
+void rds_info_deregister_func(int optname, rds_info_func func);
+int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
+			int __user *optlen);
+void rds_info_copy(struct rds_info_iterator *iter, void *data,
+		   unsigned long bytes);
+void rds_info_iter_unmap(struct rds_info_iterator *iter);
+
+
+#endif
diff --git a/net/rds/stats.c b/net/rds/stats.c
new file mode 100644
index 0000000..6371468
--- /dev/null
+++ b/net/rds/stats.c
@@ -0,0 +1,148 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_statistics, rds_stats);
+
+/* :.,$s/unsigned long\>.*\<s_\(.*\);/"\1",/g */
+
+static char *rds_stat_names[] = {
+	"conn_reset",
+	"recv_drop_bad_checksum",
+	"recv_drop_old_seq",
+	"recv_drop_no_sock",
+	"recv_drop_dead_sock",
+	"recv_deliver_raced",
+	"recv_delivered",
+	"recv_queued",
+	"recv_immediate_retry",
+	"recv_delayed_retry",
+	"recv_ack_required",
+	"recv_rdma_bytes",
+	"recv_ping",
+	"send_queue_empty",
+	"send_queue_full",
+	"send_sem_contention",
+	"send_sem_queue_raced",
+	"send_immediate_retry",
+	"send_delayed_retry",
+	"send_drop_acked",
+	"send_ack_required",
+	"send_queued",
+	"send_rdma",
+	"send_rdma_bytes",
+	"send_pong",
+	"page_remainder_hit",
+	"page_remainder_miss",
+	"copy_to_user",
+	"copy_from_user",
+	"cong_update_queued",
+	"cong_update_received",
+	"cong_send_error",
+	"cong_send_blocked",
+};
+
+void rds_stats_info_copy(struct rds_info_iterator *iter,
+			 uint64_t *values, char **names, size_t nr)
+{
+	struct rds_info_counter ctr;
+	size_t i;
+
+	for (i = 0; i < nr; i++) {
+		BUG_ON(strlen(names[i]) >= sizeof(ctr.name));
+		strncpy(ctr.name, names[i], sizeof(ctr.name) - 1);
+		ctr.value = values[i];
+
+		rds_info_copy(iter, &ctr, sizeof(ctr));
+	}
+}
+
+/*
+ * This gives global counters across all the transports.  The strings
+ * are copied in so that the tool doesn't need knowledge of the specific
+ * stats that we're exporting.  Some are pretty implementation dependent
+ * and may change over time.  That doesn't stop them from being useful.
+ *
+ * This is the only function in the chain that knows about the byte granular
+ * length in userspace.  It converts it to number of stat entries that the
+ * rest of the functions operate in.
+ */
+static void rds_stats_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	struct rds_statistics stats = {0, };
+	uint64_t *src;
+	uint64_t *sum;
+	size_t i;
+	int cpu;
+	unsigned int avail;
+
+	avail = len / sizeof(struct rds_info_counter);
+
+	if (avail < ARRAY_SIZE(rds_stat_names)) {
+		avail = 0;
+		goto trans;
+	}
+
+	for_each_online_cpu(cpu) {
+		src = (uint64_t *)&(per_cpu(rds_stats, cpu));
+		sum = (uint64_t *)&stats;
+		for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++)
+			*(sum++) += *(src++);
+	}
+
+	rds_stats_info_copy(iter, (uint64_t *)&stats, rds_stat_names,
+			    ARRAY_SIZE(rds_stat_names));
+	avail -= ARRAY_SIZE(rds_stat_names);
+
+trans:
+	lens->each = sizeof(struct rds_info_counter);
+	lens->nr = rds_trans_stats_info_copy(iter, avail) +
+		   ARRAY_SIZE(rds_stat_names);
+}
+
+void rds_stats_exit(void)
+{
+	rds_info_deregister_func(RDS_INFO_COUNTERS, rds_stats_info);
+}
+
+int __init rds_stats_init(void)
+{
+	rds_info_register_func(RDS_INFO_COUNTERS, rds_stats_info);
+	return 0;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:23 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:23 -0800
Subject: [ofa-general] [PATCH 06/26] RDS: Connection handling
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-7-git-send-email-andy.grover@oracle.com>

While arguably the fact that the underlying transport needs a
connection to convey RDS's datagrame reliably is not important
to rds proper, the transports implemented so far (IB and TCP)
have both been connection-oriented, and so the connection
state machine-related code is in the common rds code.

This patch also includes several work items, to handle connecting,
sending, receiving, and shutdown.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/connection.c |  487 ++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/threads.c    |  265 +++++++++++++++++++++++++++
 2 files changed, 752 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/connection.c
 create mode 100644 net/rds/threads.c

diff --git a/net/rds/connection.c b/net/rds/connection.c
new file mode 100644
index 0000000..273f064
--- /dev/null
+++ b/net/rds/connection.c
@@ -0,0 +1,487 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <net/inet_hashtables.h>
+
+#include "rds.h"
+#include "loop.h"
+#include "rdma.h"
+
+#define RDS_CONNECTION_HASH_BITS 12
+#define RDS_CONNECTION_HASH_ENTRIES (1 << RDS_CONNECTION_HASH_BITS)
+#define RDS_CONNECTION_HASH_MASK (RDS_CONNECTION_HASH_ENTRIES - 1)
+
+/* converting this to RCU is a chore for another day.. */
+static DEFINE_SPINLOCK(rds_conn_lock);
+static unsigned long rds_conn_count;
+static struct hlist_head rds_conn_hash[RDS_CONNECTION_HASH_ENTRIES];
+static struct kmem_cache *rds_conn_slab;
+
+static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr)
+{
+	/* Pass NULL, don't need struct net for hash */
+	unsigned long hash = inet_ehashfn(NULL,
+					  be32_to_cpu(laddr), 0,
+					  be32_to_cpu(faddr), 0);
+	return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK];
+}
+
+#define rds_conn_info_set(var, test, suffix) do {		\
+	if (test)						\
+		var |= RDS_INFO_CONNECTION_FLAG_##suffix;	\
+} while (0)
+
+static inline int rds_conn_is_sending(struct rds_connection *conn)
+{
+	int ret = 0;
+
+	if (!mutex_trylock(&conn->c_send_lock))
+		ret = 1;
+	else
+		mutex_unlock(&conn->c_send_lock);
+
+	return ret;
+}
+
+static struct rds_connection *rds_conn_lookup(struct hlist_head *head,
+					      __be32 laddr, __be32 faddr,
+					      struct rds_transport *trans)
+{
+	struct rds_connection *conn, *ret = NULL;
+	struct hlist_node *pos;
+
+	hlist_for_each_entry(conn, pos, head, c_hash_node) {
+		if (conn->c_faddr == faddr && conn->c_laddr == laddr &&
+				conn->c_trans == trans) {
+			ret = conn;
+			break;
+		}
+	}
+	rdsdebug("returning conn %p for %pI4 -> %pI4\n", ret,
+		 &laddr, &faddr);
+	return ret;
+}
+
+/*
+ * This is called by transports as they're bringing down a connection.
+ * It clears partial message state so that the transport can start sending
+ * and receiving over this connection again in the future.  It is up to
+ * the transport to have serialized this call with its send and recv.
+ */
+void rds_conn_reset(struct rds_connection *conn)
+{
+	rdsdebug("connection %pI4 to %pI4 reset\n",
+	  &conn->c_laddr, &conn->c_faddr);
+
+	rds_stats_inc(s_conn_reset);
+	rds_send_reset(conn);
+	conn->c_flags = 0;
+
+	/* Do not clear next_rx_seq here, else we cannot distinguish
+	 * retransmitted packets from new packets, and will hand all
+	 * of them to the application. That is not consistent with the
+	 * reliability guarantees of RDS. */
+}
+
+/*
+ * There is only every one 'conn' for a given pair of addresses in the
+ * system at a time.  They contain messages to be retransmitted and so
+ * span the lifetime of the actual underlying transport connections.
+ *
+ * For now they are not garbage collected once they're created.  They
+ * are torn down as the module is removed, if ever.
+ */
+static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp,
+				       int is_outgoing)
+{
+	struct rds_connection *conn, *tmp, *parent = NULL;
+	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+	conn = rds_conn_lookup(head, laddr, faddr, trans);
+	if (conn
+	 && conn->c_loopback
+	 && conn->c_trans != &rds_loop_transport
+	 && !is_outgoing) {
+		/* This is a looped back IB connection, and we're
+		 * called by the code handling the incoming connect.
+		 * We need a second connection object into which we
+		 * can stick the other QP. */
+		parent = conn;
+		conn = parent->c_passive;
+	}
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+	if (conn)
+		goto out;
+
+	conn = kmem_cache_alloc(rds_conn_slab, gfp);
+	if (conn == NULL) {
+		conn = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	memset(conn, 0, sizeof(*conn));
+
+	INIT_HLIST_NODE(&conn->c_hash_node);
+	conn->c_version = RDS_PROTOCOL_3_0;
+	conn->c_laddr = laddr;
+	conn->c_faddr = faddr;
+	spin_lock_init(&conn->c_lock);
+	conn->c_next_tx_seq = 1;
+
+	mutex_init(&conn->c_send_lock);
+	INIT_LIST_HEAD(&conn->c_send_queue);
+	INIT_LIST_HEAD(&conn->c_retrans);
+
+	ret = rds_cong_get_maps(conn);
+	if (ret) {
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = ERR_PTR(ret);
+		goto out;
+	}
+
+	/*
+	 * This is where a connection becomes loopback.  If *any* RDS sockets
+	 * can bind to the destination address then we'd rather the messages
+	 * flow through loopback rather than either transport.
+	 */
+	if (rds_trans_get_preferred(faddr)) {
+		conn->c_loopback = 1;
+		if (is_outgoing && trans->t_prefer_loopback) {
+			/* "outgoing" connection - and the transport
+			 * says it wants the connection handled by the
+			 * loopback transport. This is what TCP does.
+			 */
+			trans = &rds_loop_transport;
+		}
+	}
+
+	conn->c_trans = trans;
+
+	ret = trans->conn_alloc(conn, gfp);
+	if (ret) {
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = ERR_PTR(ret);
+		goto out;
+	}
+
+	atomic_set(&conn->c_state, RDS_CONN_DOWN);
+	conn->c_reconnect_jiffies = 0;
+	INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker);
+	INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker);
+	INIT_DELAYED_WORK(&conn->c_conn_w, rds_connect_worker);
+	INIT_WORK(&conn->c_down_w, rds_shutdown_worker);
+	mutex_init(&conn->c_cm_lock);
+	conn->c_flags = 0;
+
+	rdsdebug("allocated conn %p for %pI4 -> %pI4 over %s %s\n",
+	  conn, &laddr, &faddr,
+	  trans->t_name ? trans->t_name : "[unknown]",
+	  is_outgoing ? "(outgoing)" : "");
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+	if (parent == NULL) {
+		tmp = rds_conn_lookup(head, laddr, faddr, trans);
+		if (tmp == NULL)
+			hlist_add_head(&conn->c_hash_node, head);
+	} else {
+		tmp = parent->c_passive;
+		if (!tmp)
+			parent->c_passive = conn;
+	}
+
+	if (tmp) {
+		trans->conn_free(conn->c_transport_data);
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = tmp;
+	} else {
+		rds_cong_add_conn(conn);
+		rds_conn_count++;
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+
+out:
+	return conn;
+}
+
+struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp)
+{
+	return __rds_conn_create(laddr, faddr, trans, gfp, 0);
+}
+
+struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp)
+{
+	return __rds_conn_create(laddr, faddr, trans, gfp, 1);
+}
+
+void rds_conn_destroy(struct rds_connection *conn)
+{
+	struct rds_message *rm, *rtmp;
+
+	rdsdebug("freeing conn %p for %pI4 -> "
+		 "%pI4\n", conn, &conn->c_laddr,
+		 &conn->c_faddr);
+
+	hlist_del_init(&conn->c_hash_node);
+
+	/* wait for the rds thread to shut it down */
+	atomic_set(&conn->c_state, RDS_CONN_ERROR);
+	cancel_delayed_work(&conn->c_conn_w);
+	queue_work(rds_wq, &conn->c_down_w);
+	flush_workqueue(rds_wq);
+
+	/* tear down queued messages */
+	list_for_each_entry_safe(rm, rtmp,
+				 &conn->c_send_queue,
+				 m_conn_item) {
+		list_del_init(&rm->m_conn_item);
+		BUG_ON(!list_empty(&rm->m_sock_item));
+		rds_message_put(rm);
+	}
+	if (conn->c_xmit_rm)
+		rds_message_put(conn->c_xmit_rm);
+
+	conn->c_trans->conn_free(conn->c_transport_data);
+
+	/*
+	 * The congestion maps aren't freed up here.  They're
+	 * freed by rds_cong_exit() after all the connections
+	 * have been freed.
+	 */
+	rds_cong_remove_conn(conn);
+
+	BUG_ON(!list_empty(&conn->c_retrans));
+	kmem_cache_free(rds_conn_slab, conn);
+
+	rds_conn_count--;
+}
+
+static void rds_conn_message_info(struct socket *sock, unsigned int len,
+				  struct rds_info_iterator *iter,
+				  struct rds_info_lengths *lens,
+				  int want_send)
+{
+	struct hlist_head *head;
+	struct hlist_node *pos;
+	struct list_head *list;
+	struct rds_connection *conn;
+	struct rds_message *rm;
+	unsigned long flags;
+	unsigned int total = 0;
+	size_t i;
+
+	len /= sizeof(struct rds_info_message);
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+
+	for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash);
+	     i++, head++) {
+		hlist_for_each_entry(conn, pos, head, c_hash_node) {
+			if (want_send)
+				list = &conn->c_send_queue;
+			else
+				list = &conn->c_retrans;
+
+			spin_lock(&conn->c_lock);
+
+			/* XXX too lazy to maintain counts.. */
+			list_for_each_entry(rm, list, m_conn_item) {
+				total++;
+				if (total <= len)
+					rds_inc_info_copy(&rm->m_inc, iter,
+							  conn->c_laddr,
+							  conn->c_faddr, 0);
+			}
+
+			spin_unlock(&conn->c_lock);
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds_info_message);
+}
+
+static void rds_conn_message_info_send(struct socket *sock, unsigned int len,
+				       struct rds_info_iterator *iter,
+				       struct rds_info_lengths *lens)
+{
+	rds_conn_message_info(sock, len, iter, lens, 1);
+}
+
+static void rds_conn_message_info_retrans(struct socket *sock,
+					  unsigned int len,
+					  struct rds_info_iterator *iter,
+					  struct rds_info_lengths *lens)
+{
+	rds_conn_message_info(sock, len, iter, lens, 0);
+}
+
+void rds_for_each_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens,
+			  int (*visitor)(struct rds_connection *, void *),
+			  size_t item_len)
+{
+	uint64_t buffer[(item_len + 7) / 8];
+	struct hlist_head *head;
+	struct hlist_node *pos;
+	struct hlist_node *tmp;
+	struct rds_connection *conn;
+	unsigned long flags;
+	size_t i;
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+
+	lens->nr = 0;
+	lens->each = item_len;
+
+	for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash);
+	     i++, head++) {
+		hlist_for_each_entry_safe(conn, pos, tmp, head, c_hash_node) {
+
+			/* XXX no c_lock usage.. */
+			if (!visitor(conn, buffer))
+				continue;
+
+			/* We copy as much as we can fit in the buffer,
+			 * but we count all items so that the caller
+			 * can resize the buffer. */
+			if (len >= item_len) {
+				rds_info_copy(iter, buffer, item_len);
+				len -= item_len;
+			}
+			lens->nr++;
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+}
+
+static int rds_conn_info_visitor(struct rds_connection *conn,
+				  void *buffer)
+{
+	struct rds_info_connection *cinfo = buffer;
+
+	cinfo->next_tx_seq = conn->c_next_tx_seq;
+	cinfo->next_rx_seq = conn->c_next_rx_seq;
+	cinfo->laddr = conn->c_laddr;
+	cinfo->faddr = conn->c_faddr;
+	strncpy(cinfo->transport, conn->c_trans->t_name,
+		sizeof(cinfo->transport));
+	cinfo->flags = 0;
+
+	rds_conn_info_set(cinfo->flags,
+			  rds_conn_is_sending(conn), SENDING);
+	/* XXX Future: return the state rather than these funky bits */
+	rds_conn_info_set(cinfo->flags,
+			  atomic_read(&conn->c_state) == RDS_CONN_CONNECTING,
+			  CONNECTING);
+	rds_conn_info_set(cinfo->flags,
+			  atomic_read(&conn->c_state) == RDS_CONN_UP,
+			  CONNECTED);
+	return 1;
+}
+
+static void rds_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens)
+{
+	rds_for_each_conn_info(sock, len, iter, lens,
+				rds_conn_info_visitor,
+				sizeof(struct rds_info_connection));
+}
+
+int __init rds_conn_init(void)
+{
+	rds_conn_slab = kmem_cache_create("rds_connection",
+					  sizeof(struct rds_connection),
+					  0, 0, NULL);
+	if (rds_conn_slab == NULL)
+		return -ENOMEM;
+
+	rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info);
+	rds_info_register_func(RDS_INFO_SEND_MESSAGES,
+			       rds_conn_message_info_send);
+	rds_info_register_func(RDS_INFO_RETRANS_MESSAGES,
+			       rds_conn_message_info_retrans);
+
+	return 0;
+}
+
+void rds_conn_exit(void)
+{
+	rds_loop_exit();
+
+	WARN_ON(!hlist_empty(rds_conn_hash));
+
+	kmem_cache_destroy(rds_conn_slab);
+
+	rds_info_deregister_func(RDS_INFO_CONNECTIONS, rds_conn_info);
+	rds_info_deregister_func(RDS_INFO_SEND_MESSAGES,
+				 rds_conn_message_info_send);
+	rds_info_deregister_func(RDS_INFO_RETRANS_MESSAGES,
+				 rds_conn_message_info_retrans);
+}
+
+/*
+ * Force a disconnect
+ */
+void rds_conn_drop(struct rds_connection *conn)
+{
+	atomic_set(&conn->c_state, RDS_CONN_ERROR);
+	queue_work(rds_wq, &conn->c_down_w);
+}
+
+/*
+ * An error occurred on the connection
+ */
+void
+__rds_conn_error(struct rds_connection *conn, const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vprintk(fmt, ap);
+	va_end(ap);
+
+	rds_conn_drop(conn);
+}
diff --git a/net/rds/threads.c b/net/rds/threads.c
new file mode 100644
index 0000000..828a1bf
--- /dev/null
+++ b/net/rds/threads.c
@@ -0,0 +1,265 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/random.h>
+
+#include "rds.h"
+
+/*
+ * All of connection management is simplified by serializing it through
+ * work queues that execute in a connection managing thread.
+ *
+ * TCP wants to send acks through sendpage() in response to data_ready(),
+ * but it needs a process context to do so.
+ *
+ * The receive paths need to allocate but can't drop packets (!) so we have
+ * a thread around to block allocating if the receive fast path sees an
+ * allocation failure.
+ */
+
+/* Grand Unified Theory of connection life cycle:
+ * At any point in time, the connection can be in one of these states:
+ * DOWN, CONNECTING, UP, DISCONNECTING, ERROR
+ *
+ * The following transitions are possible:
+ *  ANY		  -> ERROR
+ *  UP		  -> DISCONNECTING
+ *  ERROR	  -> DISCONNECTING
+ *  DISCONNECTING -> DOWN
+ *  DOWN	  -> CONNECTING
+ *  CONNECTING	  -> UP
+ *
+ * Transition to state DISCONNECTING/DOWN:
+ *  -	Inside the shutdown worker; synchronizes with xmit path
+ *	through c_send_lock, and with connection management callbacks
+ *	via c_cm_lock.
+ *
+ *	For receive callbacks, we rely on the underlying transport
+ *	(TCP, IB/RDMA) to provide the necessary synchronisation.
+ */
+struct workqueue_struct *rds_wq;
+
+void rds_connect_complete(struct rds_connection *conn)
+{
+	if (!rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_UP)) {
+		printk(KERN_WARNING "%s: Cannot transition to state UP, "
+				"current state is %d\n",
+				__func__,
+				atomic_read(&conn->c_state));
+		atomic_set(&conn->c_state, RDS_CONN_ERROR);
+		queue_work(rds_wq, &conn->c_down_w);
+		return;
+	}
+
+	rdsdebug("conn %p for %pI4 to %pI4 complete\n",
+	  conn, &conn->c_laddr, &conn->c_faddr);
+
+	conn->c_reconnect_jiffies = 0;
+	set_bit(0, &conn->c_map_queued);
+	queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+	queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+}
+
+/*
+ * This random exponential backoff is relied on to eventually resolve racing
+ * connects.
+ *
+ * If connect attempts race then both parties drop both connections and come
+ * here to wait for a random amount of time before trying again.  Eventually
+ * the backoff range will be so much greater than the time it takes to
+ * establish a connection that one of the pair will establish the connection
+ * before the other's random delay fires.
+ *
+ * Connection attempts that arrive while a connection is already established
+ * are also considered to be racing connects.  This lets a connection from
+ * a rebooted machine replace an existing stale connection before the transport
+ * notices that the connection has failed.
+ *
+ * We should *always* start with a random backoff; otherwise a broken connection
+ * will always take several iterations to be re-established.
+ */
+static void rds_queue_reconnect(struct rds_connection *conn)
+{
+	unsigned long rand;
+
+	rdsdebug("conn %p for %pI4 to %pI4 reconnect jiffies %lu\n",
+	  conn, &conn->c_laddr, &conn->c_faddr,
+	  conn->c_reconnect_jiffies);
+
+	set_bit(RDS_RECONNECT_PENDING, &conn->c_flags);
+	if (conn->c_reconnect_jiffies == 0) {
+		conn->c_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies;
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+		return;
+	}
+
+	get_random_bytes(&rand, sizeof(rand));
+	rdsdebug("%lu delay %lu ceil conn %p for %pI4 -> %pI4\n",
+		 rand % conn->c_reconnect_jiffies, conn->c_reconnect_jiffies,
+		 conn, &conn->c_laddr, &conn->c_faddr);
+	queue_delayed_work(rds_wq, &conn->c_conn_w,
+			   rand % conn->c_reconnect_jiffies);
+
+	conn->c_reconnect_jiffies = min(conn->c_reconnect_jiffies * 2,
+					rds_sysctl_reconnect_max_jiffies);
+}
+
+void rds_connect_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_conn_w.work);
+	int ret;
+
+	clear_bit(RDS_RECONNECT_PENDING, &conn->c_flags);
+	if (rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
+		ret = conn->c_trans->conn_connect(conn);
+		rdsdebug("conn %p for %pI4 to %pI4 dispatched, ret %d\n",
+			conn, &conn->c_laddr, &conn->c_faddr, ret);
+
+		if (ret) {
+			if (rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_DOWN))
+				rds_queue_reconnect(conn);
+			else
+				rds_conn_error(conn, "RDS: connect failed\n");
+		}
+	}
+}
+
+void rds_shutdown_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_down_w);
+
+	/* shut it down unless it's down already */
+	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_DOWN)) {
+		/*
+		 * Quiesce the connection mgmt handlers before we start tearing
+		 * things down. We don't hold the mutex for the entire
+		 * duration of the shutdown operation, else we may be
+		 * deadlocking with the CM handler. Instead, the CM event
+		 * handler is supposed to check for state DISCONNECTING
+		 */
+		mutex_lock(&conn->c_cm_lock);
+		if (!rds_conn_transition(conn, RDS_CONN_UP, RDS_CONN_DISCONNECTING)
+		 && !rds_conn_transition(conn, RDS_CONN_ERROR, RDS_CONN_DISCONNECTING)) {
+			rds_conn_error(conn, "shutdown called in state %d\n",
+					atomic_read(&conn->c_state));
+			mutex_unlock(&conn->c_cm_lock);
+			return;
+		}
+		mutex_unlock(&conn->c_cm_lock);
+
+		mutex_lock(&conn->c_send_lock);
+		conn->c_trans->conn_shutdown(conn);
+		rds_conn_reset(conn);
+		mutex_unlock(&conn->c_send_lock);
+
+		if (!rds_conn_transition(conn, RDS_CONN_DISCONNECTING, RDS_CONN_DOWN)) {
+			/* This can happen - eg when we're in the middle of tearing
+			 * down the connection, and someone unloads the rds module.
+			 * Quite reproduceable with loopback connections.
+			 * Mostly harmless.
+			 */
+			rds_conn_error(conn,
+				"%s: failed to transition to state DOWN, "
+				"current state is %d\n",
+				__func__,
+				atomic_read(&conn->c_state));
+			return;
+		}
+	}
+
+	/* Then reconnect if it's still live.
+	 * The passive side of an IB loopback connection is never added
+	 * to the conn hash, so we never trigger a reconnect on this
+	 * conn - the reconnect is always triggered by the active peer. */
+	cancel_delayed_work(&conn->c_conn_w);
+	if (!hlist_unhashed(&conn->c_hash_node))
+		rds_queue_reconnect(conn);
+}
+
+void rds_send_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_send_w.work);
+	int ret;
+
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		ret = rds_send_xmit(conn);
+		rdsdebug("conn %p ret %d\n", conn, ret);
+		switch (ret) {
+		case -EAGAIN:
+			rds_stats_inc(s_send_immediate_retry);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+			break;
+		case -ENOMEM:
+			rds_stats_inc(s_send_delayed_retry);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 2);
+		default:
+			break;
+		}
+	}
+}
+
+void rds_recv_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_recv_w.work);
+	int ret;
+
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		ret = conn->c_trans->recv(conn);
+		rdsdebug("conn %p ret %d\n", conn, ret);
+		switch (ret) {
+		case -EAGAIN:
+			rds_stats_inc(s_recv_immediate_retry);
+			queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+			break;
+		case -ENOMEM:
+			rds_stats_inc(s_recv_delayed_retry);
+			queue_delayed_work(rds_wq, &conn->c_recv_w, 2);
+		default:
+			break;
+		}
+	}
+}
+
+void rds_threads_exit(void)
+{
+	destroy_workqueue(rds_wq);
+}
+
+int __init rds_threads_init(void)
+{
+	rds_wq = create_singlethread_workqueue("krdsd");
+	if (rds_wq == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:24 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:24 -0800
Subject: [ofa-general] [PATCH 07/26] RDS: loopback
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-8-git-send-email-andy.grover@oracle.com>

A simple rds transport to handle loopback connections.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/loop.c |  188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/loop.h |    9 +++
 2 files changed, 197 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/loop.c
 create mode 100644 net/rds/loop.h

diff --git a/net/rds/loop.c b/net/rds/loop.c
new file mode 100644
index 0000000..4a61997
--- /dev/null
+++ b/net/rds/loop.c
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "loop.h"
+
+static DEFINE_SPINLOCK(loop_conns_lock);
+static LIST_HEAD(loop_conns);
+
+/*
+ * This 'loopback' transport is a special case for flows that originate
+ * and terminate on the same machine.
+ *
+ * Connection build-up notices if the destination address is thought of
+ * as a local address by a transport.  At that time it decides to use the
+ * loopback transport instead of the bound transport of the sending socket.
+ *
+ * The loopback transport's sending path just hands the sent rds_message
+ * straight to the receiving path via an embedded rds_incoming.
+ */
+
+/*
+ * Usually a message transits both the sender and receiver's conns as it
+ * flows to the receiver.  In the loopback case, though, the receive path
+ * is handed the sending conn so the sense of the addresses is reversed.
+ */
+static int rds_loop_xmit(struct rds_connection *conn, struct rds_message *rm,
+			 unsigned int hdr_off, unsigned int sg,
+			 unsigned int off)
+{
+	BUG_ON(hdr_off || sg || off);
+
+	rds_inc_init(&rm->m_inc, conn, conn->c_laddr);
+	rds_message_addref(rm); /* for the inc */
+
+	rds_recv_incoming(conn, conn->c_laddr, conn->c_faddr, &rm->m_inc,
+			  GFP_KERNEL, KM_USER0);
+
+	rds_send_drop_acked(conn, be64_to_cpu(rm->m_inc.i_hdr.h_sequence),
+			    NULL);
+
+	rds_inc_put(&rm->m_inc);
+
+	return sizeof(struct rds_header) + be32_to_cpu(rm->m_inc.i_hdr.h_len);
+}
+
+static int rds_loop_xmit_cong_map(struct rds_connection *conn,
+				  struct rds_cong_map *map,
+				  unsigned long offset)
+{
+	unsigned long i;
+
+	BUG_ON(offset);
+	BUG_ON(map != conn->c_lcong);
+
+	for (i = 0; i < RDS_CONG_MAP_PAGES; i++) {
+		memcpy((void *)conn->c_fcong->m_page_addrs[i],
+		       (void *)map->m_page_addrs[i], PAGE_SIZE);
+	}
+
+	rds_cong_map_updated(conn->c_fcong, ~(u64) 0);
+
+	return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES;
+}
+
+/* we need to at least give the thread something to succeed */
+static int rds_loop_recv(struct rds_connection *conn)
+{
+	return 0;
+}
+
+struct rds_loop_connection {
+	struct list_head loop_node;
+	struct rds_connection *conn;
+};
+
+/*
+ * Even the loopback transport needs to keep track of its connections,
+ * so it can call rds_conn_destroy() on them on exit. N.B. there are
+ * 1+ loopback addresses (127.*.*.*) so it's not a bug to have
+ * multiple loopback conns allocated, although rather useless.
+ */
+static int rds_loop_conn_alloc(struct rds_connection *conn, gfp_t gfp)
+{
+	struct rds_loop_connection *lc;
+	unsigned long flags;
+
+	lc = kzalloc(sizeof(struct rds_loop_connection), GFP_KERNEL);
+	if (lc == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&lc->loop_node);
+	lc->conn = conn;
+	conn->c_transport_data = lc;
+
+	spin_lock_irqsave(&loop_conns_lock, flags);
+	list_add_tail(&lc->loop_node, &loop_conns);
+	spin_unlock_irqrestore(&loop_conns_lock, flags);
+
+	return 0;
+}
+
+static void rds_loop_conn_free(void *arg)
+{
+	struct rds_loop_connection *lc = arg;
+	rdsdebug("lc %p\n", lc);
+	list_del(&lc->loop_node);
+	kfree(lc);
+}
+
+static int rds_loop_conn_connect(struct rds_connection *conn)
+{
+	rds_connect_complete(conn);
+	return 0;
+}
+
+static void rds_loop_conn_shutdown(struct rds_connection *conn)
+{
+}
+
+void rds_loop_exit(void)
+{
+	struct rds_loop_connection *lc, *_lc;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&loop_conns_lock);
+	list_splice(&loop_conns, &tmp_list);
+	INIT_LIST_HEAD(&loop_conns);
+	spin_unlock_irq(&loop_conns_lock);
+
+	list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) {
+		WARN_ON(lc->conn->c_passive);
+		rds_conn_destroy(lc->conn);
+	}
+}
+
+/*
+ * This is missing .xmit_* because loop doesn't go through generic
+ * rds_send_xmit() and doesn't call rds_recv_incoming().  .listen_stop and
+ * .laddr_check are missing because transport.c doesn't iterate over
+ * rds_loop_transport.
+ */
+struct rds_transport rds_loop_transport = {
+	.xmit			= rds_loop_xmit,
+	.xmit_cong_map		= rds_loop_xmit_cong_map,
+	.recv			= rds_loop_recv,
+	.conn_alloc		= rds_loop_conn_alloc,
+	.conn_free		= rds_loop_conn_free,
+	.conn_connect		= rds_loop_conn_connect,
+	.conn_shutdown		= rds_loop_conn_shutdown,
+	.inc_copy_to_user	= rds_message_inc_copy_to_user,
+	.inc_purge		= rds_message_inc_purge,
+	.inc_free		= rds_message_inc_free,
+	.t_name			= "loopback",
+};
diff --git a/net/rds/loop.h b/net/rds/loop.h
new file mode 100644
index 0000000..f32b093
--- /dev/null
+++ b/net/rds/loop.h
@@ -0,0 +1,9 @@
+#ifndef _RDS_LOOP_H
+#define _RDS_LOOP_H
+
+/* loop.c */
+extern struct rds_transport rds_loop_transport;
+
+void rds_loop_exit(void);
+
+#endif
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:25 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:25 -0800
Subject: [ofa-general] [PATCH 08/26] RDS: sysctls
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-9-git-send-email-andy.grover@oracle.com>

RDS exposes a few tunable parameters via sysctls.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/sysctl.c |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 122 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/sysctl.c

diff --git a/net/rds/sysctl.c b/net/rds/sysctl.c
new file mode 100644
index 0000000..307dc5c
--- /dev/null
+++ b/net/rds/sysctl.c
@@ -0,0 +1,122 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/sysctl.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+static struct ctl_table_header *rds_sysctl_reg_table;
+
+static unsigned long rds_sysctl_reconnect_min = 1;
+static unsigned long rds_sysctl_reconnect_max = ~0UL;
+
+unsigned long rds_sysctl_reconnect_min_jiffies;
+unsigned long rds_sysctl_reconnect_max_jiffies = HZ;
+
+unsigned int  rds_sysctl_max_unacked_packets = 8;
+unsigned int  rds_sysctl_max_unacked_bytes = (16 << 20);
+
+unsigned int rds_sysctl_ping_enable = 1;
+
+static ctl_table rds_sysctl_rds_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "reconnect_min_delay_ms",
+		.data		= &rds_sysctl_reconnect_min_jiffies,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &rds_sysctl_reconnect_min,
+		.extra2		= &rds_sysctl_reconnect_max_jiffies,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "reconnect_max_delay_ms",
+		.data		= &rds_sysctl_reconnect_max_jiffies,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &rds_sysctl_reconnect_min_jiffies,
+		.extra2		= &rds_sysctl_reconnect_max,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_unacked_packets",
+		.data		= &rds_sysctl_max_unacked_packets,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_unacked_bytes",
+		.data		= &rds_sysctl_max_unacked_bytes,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ping_enable",
+		.data		= &rds_sysctl_ping_enable,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+	{ .ctl_name = 0}
+};
+
+static struct ctl_path rds_sysctl_path[] = {
+	{ .procname = "net", .ctl_name = CTL_NET, },
+	{ .procname = "rds", .ctl_name = CTL_UNNUMBERED, },
+	{ }
+};
+
+
+void rds_sysctl_exit(void)
+{
+	if (rds_sysctl_reg_table)
+		unregister_sysctl_table(rds_sysctl_reg_table);
+}
+
+int __init rds_sysctl_init(void)
+{
+	rds_sysctl_reconnect_min = msecs_to_jiffies(1);
+	rds_sysctl_reconnect_min_jiffies = rds_sysctl_reconnect_min;
+
+	rds_sysctl_reg_table = register_sysctl_paths(rds_sysctl_path, rds_sysctl_rds_table);
+	if (rds_sysctl_reg_table == NULL)
+		return -ENOMEM;
+	return 0;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:26 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:26 -0800
Subject: [ofa-general] [PATCH 09/26] RDS: Message parsing
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-10-git-send-email-andy.grover@oracle.com>

Parsing of newly-received RDS message headers (including ext.
headers) and copy-to/from-user routines.

page.c implements a per-cpu page remainder cache, to reduce the
number of allocations needed for small datagrams.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/message.c |  402 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/page.c    |  221 +++++++++++++++++++++++++++++
 2 files changed, 623 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/message.c
 create mode 100644 net/rds/page.c

diff --git a/net/rds/message.c b/net/rds/message.c
new file mode 100644
index 0000000..5a15dc8
--- /dev/null
+++ b/net/rds/message.c
@@ -0,0 +1,402 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+static DECLARE_WAIT_QUEUE_HEAD(rds_message_flush_waitq);
+
+static unsigned int	rds_exthdr_size[__RDS_EXTHDR_MAX] = {
+[RDS_EXTHDR_NONE]	= 0,
+[RDS_EXTHDR_VERSION]	= sizeof(struct rds_ext_header_version),
+[RDS_EXTHDR_RDMA]	= sizeof(struct rds_ext_header_rdma),
+[RDS_EXTHDR_RDMA_DEST]	= sizeof(struct rds_ext_header_rdma_dest),
+};
+
+
+void rds_message_addref(struct rds_message *rm)
+{
+	rdsdebug("addref rm %p ref %d\n", rm, atomic_read(&rm->m_refcount));
+	atomic_inc(&rm->m_refcount);
+}
+
+/*
+ * This relies on dma_map_sg() not touching sg[].page during merging.
+ */
+static void rds_message_purge(struct rds_message *rm)
+{
+	unsigned long i;
+
+	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
+		return;
+
+	for (i = 0; i < rm->m_nents; i++) {
+		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->m_sg[i]));
+		/* XXX will have to put_page for page refs */
+		__free_page(sg_page(&rm->m_sg[i]));
+	}
+	rm->m_nents = 0;
+
+	if (rm->m_rdma_op)
+		rds_rdma_free_op(rm->m_rdma_op);
+	if (rm->m_rdma_mr)
+		rds_mr_put(rm->m_rdma_mr);
+}
+
+void rds_message_inc_purge(struct rds_incoming *inc)
+{
+	struct rds_message *rm = container_of(inc, struct rds_message, m_inc);
+	rds_message_purge(rm);
+}
+
+void rds_message_put(struct rds_message *rm)
+{
+	rdsdebug("put rm %p ref %d\n", rm, atomic_read(&rm->m_refcount));
+
+	if (atomic_dec_and_test(&rm->m_refcount)) {
+		BUG_ON(!list_empty(&rm->m_sock_item));
+		BUG_ON(!list_empty(&rm->m_conn_item));
+		rds_message_purge(rm);
+
+		kfree(rm);
+	}
+}
+
+void rds_message_inc_free(struct rds_incoming *inc)
+{
+	struct rds_message *rm = container_of(inc, struct rds_message, m_inc);
+	rds_message_put(rm);
+}
+
+void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
+				 __be16 dport, u64 seq)
+{
+	hdr->h_flags = 0;
+	hdr->h_sport = sport;
+	hdr->h_dport = dport;
+	hdr->h_sequence = cpu_to_be64(seq);
+	hdr->h_exthdr[0] = RDS_EXTHDR_NONE;
+}
+
+int rds_message_add_extension(struct rds_header *hdr,
+		unsigned int type, const void *data, unsigned int len)
+{
+	unsigned int ext_len = sizeof(u8) + len;
+	unsigned char *dst;
+
+	/* For now, refuse to add more than one extension header */
+	if (hdr->h_exthdr[0] != RDS_EXTHDR_NONE)
+		return 0;
+
+	if (type >= __RDS_EXTHDR_MAX
+	 || len != rds_exthdr_size[type])
+		return 0;
+
+	if (ext_len >= RDS_HEADER_EXT_SPACE)
+		return 0;
+	dst = hdr->h_exthdr;
+
+	*dst++ = type;
+	memcpy(dst, data, len);
+
+	dst[len] = RDS_EXTHDR_NONE;
+	return 1;
+}
+
+/*
+ * If a message has extension headers, retrieve them here.
+ * Call like this:
+ *
+ * unsigned int pos = 0;
+ *
+ * while (1) {
+ *	buflen = sizeof(buffer);
+ *	type = rds_message_next_extension(hdr, &pos, buffer, &buflen);
+ *	if (type == RDS_EXTHDR_NONE)
+ *		break;
+ *	...
+ * }
+ */
+int rds_message_next_extension(struct rds_header *hdr,
+		unsigned int *pos, void *buf, unsigned int *buflen)
+{
+	unsigned int offset, ext_type, ext_len;
+	u8 *src = hdr->h_exthdr;
+
+	offset = *pos;
+	if (offset >= RDS_HEADER_EXT_SPACE)
+		goto none;
+
+	/* Get the extension type and length. For now, the
+	 * length is implied by the extension type. */
+	ext_type = src[offset++];
+
+	if (ext_type == RDS_EXTHDR_NONE || ext_type >= __RDS_EXTHDR_MAX)
+		goto none;
+	ext_len = rds_exthdr_size[ext_type];
+	if (offset + ext_len > RDS_HEADER_EXT_SPACE)
+		goto none;
+
+	*pos = offset + ext_len;
+	if (ext_len < *buflen)
+		*buflen = ext_len;
+	memcpy(buf, src + offset, *buflen);
+	return ext_type;
+
+none:
+	*pos = RDS_HEADER_EXT_SPACE;
+	*buflen = 0;
+	return RDS_EXTHDR_NONE;
+}
+
+int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version)
+{
+	struct rds_ext_header_version ext_hdr;
+
+	ext_hdr.h_version = cpu_to_be32(version);
+	return rds_message_add_extension(hdr, RDS_EXTHDR_VERSION, &ext_hdr, sizeof(ext_hdr));
+}
+
+int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version)
+{
+	struct rds_ext_header_version ext_hdr;
+	unsigned int pos = 0, len = sizeof(ext_hdr);
+
+	/* We assume the version extension is the only one present */
+	if (rds_message_next_extension(hdr, &pos, &ext_hdr, &len) != RDS_EXTHDR_VERSION)
+		return 0;
+	*version = be32_to_cpu(ext_hdr.h_version);
+	return 1;
+}
+
+int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset)
+{
+	struct rds_ext_header_rdma_dest ext_hdr;
+
+	ext_hdr.h_rdma_rkey = cpu_to_be32(r_key);
+	ext_hdr.h_rdma_offset = cpu_to_be32(offset);
+	return rds_message_add_extension(hdr, RDS_EXTHDR_RDMA_DEST, &ext_hdr, sizeof(ext_hdr));
+}
+
+struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp)
+{
+	struct rds_message *rm;
+
+	rm = kzalloc(sizeof(struct rds_message) +
+		     (nents * sizeof(struct scatterlist)), gfp);
+	if (!rm)
+		goto out;
+
+	if (nents)
+		sg_init_table(rm->m_sg, nents);
+	atomic_set(&rm->m_refcount, 1);
+	INIT_LIST_HEAD(&rm->m_sock_item);
+	INIT_LIST_HEAD(&rm->m_conn_item);
+	spin_lock_init(&rm->m_rs_lock);
+
+out:
+	return rm;
+}
+
+struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len)
+{
+	struct rds_message *rm;
+	unsigned int i;
+
+	rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL);
+	if (rm == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	set_bit(RDS_MSG_PAGEVEC, &rm->m_flags);
+	rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len);
+	rm->m_nents = ceil(total_len, PAGE_SIZE);
+
+	for (i = 0; i < rm->m_nents; ++i) {
+		sg_set_page(&rm->m_sg[i],
+				virt_to_page(page_addrs[i]),
+				PAGE_SIZE, 0);
+	}
+
+	return rm;
+}
+
+struct rds_message *rds_message_copy_from_user(struct iovec *first_iov,
+					       size_t total_len)
+{
+	unsigned long to_copy;
+	unsigned long iov_off;
+	unsigned long sg_off;
+	struct rds_message *rm;
+	struct iovec *iov;
+	struct scatterlist *sg;
+	int ret;
+
+	rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL);
+	if (rm == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len);
+
+	/*
+	 * now allocate and copy in the data payload.
+	 */
+	sg = rm->m_sg;
+	iov = first_iov;
+	iov_off = 0;
+	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
+
+	while (total_len) {
+		if (sg_page(sg) == NULL) {
+			ret = rds_page_remainder_alloc(sg, total_len,
+						       GFP_HIGHUSER);
+			if (ret)
+				goto out;
+			rm->m_nents++;
+			sg_off = 0;
+		}
+
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, sg->length - sg_off);
+		to_copy = min_t(size_t, to_copy, total_len);
+
+		rdsdebug("copying %lu bytes from user iov [%p, %zu] + %lu to "
+			 "sg [%p, %u, %u] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 (void *)sg_page(sg), sg->offset, sg->length, sg_off);
+
+		ret = rds_page_copy_from_user(sg_page(sg), sg->offset + sg_off,
+					      iov->iov_base + iov_off,
+					      to_copy);
+		if (ret)
+			goto out;
+
+		iov_off += to_copy;
+		total_len -= to_copy;
+		sg_off += to_copy;
+
+		if (sg_off == sg->length)
+			sg++;
+	}
+
+	ret = 0;
+out:
+	if (ret) {
+		if (rm)
+			rds_message_put(rm);
+		rm = ERR_PTR(ret);
+	}
+	return rm;
+}
+
+int rds_message_inc_copy_to_user(struct rds_incoming *inc,
+				 struct iovec *first_iov, size_t size)
+{
+	struct rds_message *rm;
+	struct iovec *iov;
+	struct scatterlist *sg;
+	unsigned long to_copy;
+	unsigned long iov_off;
+	unsigned long vec_off;
+	int copied;
+	int ret;
+	u32 len;
+
+	rm = container_of(inc, struct rds_message, m_inc);
+	len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	iov = first_iov;
+	iov_off = 0;
+	sg = rm->m_sg;
+	vec_off = 0;
+	copied = 0;
+
+	while (copied < size && copied < len) {
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, sg->length - vec_off);
+		to_copy = min_t(size_t, to_copy, size - copied);
+		to_copy = min_t(unsigned long, to_copy, len - copied);
+
+		rdsdebug("copying %lu bytes to user iov [%p, %zu] + %lu to "
+			 "sg [%p, %u, %u] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 sg_page(sg), sg->offset, sg->length, vec_off);
+
+		ret = rds_page_copy_to_user(sg_page(sg), sg->offset + vec_off,
+					    iov->iov_base + iov_off,
+					    to_copy);
+		if (ret) {
+			copied = ret;
+			break;
+		}
+
+		iov_off += to_copy;
+		vec_off += to_copy;
+		copied += to_copy;
+
+		if (vec_off == sg->length) {
+			vec_off = 0;
+			sg++;
+		}
+	}
+
+	return copied;
+}
+
+/*
+ * If the message is still on the send queue, wait until the transport
+ * is done with it. This is particularly important for RDMA operations.
+ */
+void rds_message_wait(struct rds_message *rm)
+{
+	wait_event(rds_message_flush_waitq,
+			!test_bit(RDS_MSG_MAPPED, &rm->m_flags));
+}
+
+void rds_message_unmapped(struct rds_message *rm)
+{
+	clear_bit(RDS_MSG_MAPPED, &rm->m_flags);
+	if (waitqueue_active(&rds_message_flush_waitq))
+		wake_up(&rds_message_flush_waitq);
+}
+
diff --git a/net/rds/page.c b/net/rds/page.c
new file mode 100644
index 0000000..c460743
--- /dev/null
+++ b/net/rds/page.c
@@ -0,0 +1,221 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/highmem.h>
+
+#include "rds.h"
+
+struct rds_page_remainder {
+	struct page	*r_page;
+	unsigned long	r_offset;
+};
+
+DEFINE_PER_CPU(struct rds_page_remainder, rds_page_remainders) ____cacheline_aligned;
+
+/*
+ * returns 0 on success or -errno on failure.
+ *
+ * We don't have to worry about flush_dcache_page() as this only works
+ * with private pages.  If, say, we were to do directed receive to pinned
+ * user pages we'd have to worry more about cache coherence.  (Though
+ * the flush_dcache_page() in get_user_pages() would probably be enough).
+ */
+int rds_page_copy_user(struct page *page, unsigned long offset,
+		       void __user *ptr, unsigned long bytes,
+		       int to_user)
+{
+	unsigned long ret;
+	void *addr;
+
+	if (to_user)
+		rds_stats_add(s_copy_to_user, bytes);
+	else
+		rds_stats_add(s_copy_from_user, bytes);
+
+	addr = kmap_atomic(page, KM_USER0);
+	if (to_user)
+		ret = __copy_to_user_inatomic(ptr, addr + offset, bytes);
+	else
+		ret = __copy_from_user_inatomic(addr + offset, ptr, bytes);
+	kunmap_atomic(addr, KM_USER0);
+
+	if (ret) {
+		addr = kmap(page);
+		if (to_user)
+			ret = copy_to_user(ptr, addr + offset, bytes);
+		else
+			ret = copy_from_user(addr + offset, ptr, bytes);
+		kunmap(page);
+		if (ret)
+			return -EFAULT;
+	}
+
+	return 0;
+}
+
+/*
+ * Message allocation uses this to build up regions of a message.
+ *
+ * @bytes - the number of bytes needed.
+ * @gfp - the waiting behaviour of the allocation
+ *
+ * @gfp is always ored with __GFP_HIGHMEM.  Callers must be prepared to
+ * kmap the pages, etc.
+ *
+ * If @bytes is at least a full page then this just returns a page from
+ * alloc_page().
+ *
+ * If @bytes is a partial page then this stores the unused region of the
+ * page in a per-cpu structure.  Future partial-page allocations may be
+ * satisfied from that cached region.  This lets us waste less memory on
+ * small allocations with minimal complexity.  It works because the transmit
+ * path passes read-only page regions down to devices.  They hold a page
+ * reference until they are done with the region.
+ */
+int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
+			     gfp_t gfp)
+{
+	struct rds_page_remainder *rem;
+	unsigned long flags;
+	struct page *page;
+	int ret;
+
+	gfp |= __GFP_HIGHMEM;
+
+	/* jump straight to allocation if we're trying for a huge page */
+	if (bytes >= PAGE_SIZE) {
+		page = alloc_page(gfp);
+		if (page == NULL) {
+			ret = -ENOMEM;
+		} else {
+			sg_set_page(scat, page, PAGE_SIZE, 0);
+			ret = 0;
+		}
+		goto out;
+	}
+
+	rem = &per_cpu(rds_page_remainders, get_cpu());
+	local_irq_save(flags);
+
+	while (1) {
+		/* avoid a tiny region getting stuck by tossing it */
+		if (rem->r_page && bytes > (PAGE_SIZE - rem->r_offset)) {
+			rds_stats_inc(s_page_remainder_miss);
+			__free_page(rem->r_page);
+			rem->r_page = NULL;
+		}
+
+		/* hand out a fragment from the cached page */
+		if (rem->r_page && bytes <= (PAGE_SIZE - rem->r_offset)) {
+			sg_set_page(scat, rem->r_page, bytes, rem->r_offset);
+			get_page(sg_page(scat));
+
+			if (rem->r_offset != 0)
+				rds_stats_inc(s_page_remainder_hit);
+
+			rem->r_offset += bytes;
+			if (rem->r_offset == PAGE_SIZE) {
+				__free_page(rem->r_page);
+				rem->r_page = NULL;
+			}
+			ret = 0;
+			break;
+		}
+
+		/* alloc if there is nothing for us to use */
+		local_irq_restore(flags);
+		put_cpu();
+
+		page = alloc_page(gfp);
+
+		rem = &per_cpu(rds_page_remainders, get_cpu());
+		local_irq_save(flags);
+
+		if (page == NULL) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		/* did someone race to fill the remainder before us? */
+		if (rem->r_page) {
+			__free_page(page);
+			continue;
+		}
+
+		/* otherwise install our page and loop around to alloc */
+		rem->r_page = page;
+		rem->r_offset = 0;
+	}
+
+	local_irq_restore(flags);
+	put_cpu();
+out:
+	rdsdebug("bytes %lu ret %d %p %u %u\n", bytes, ret,
+		 ret ? NULL : sg_page(scat), ret ? 0 : scat->offset,
+		 ret ? 0 : scat->length);
+	return ret;
+}
+
+static int rds_page_remainder_cpu_notify(struct notifier_block *self,
+					 unsigned long action, void *hcpu)
+{
+	struct rds_page_remainder *rem;
+	long cpu = (long)hcpu;
+
+	rem = &per_cpu(rds_page_remainders, cpu);
+
+	rdsdebug("cpu %ld action 0x%lx\n", cpu, action);
+
+	switch (action) {
+	case CPU_DEAD:
+		if (rem->r_page)
+			__free_page(rem->r_page);
+		rem->r_page = NULL;
+		break;
+	}
+
+	return 0;
+}
+
+static struct notifier_block rds_page_remainder_nb = {
+	.notifier_call = rds_page_remainder_cpu_notify,
+};
+
+void rds_page_exit(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		rds_page_remainder_cpu_notify(&rds_page_remainder_nb,
+					      (unsigned long)CPU_DEAD,
+					      (void *)(long)i);
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:27 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:27 -0800
Subject: [ofa-general] [PATCH 10/26] RDS: send.c
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-11-git-send-email-andy.grover@oracle.com>

This is the code to send an RDS datagram.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/send.c | 1003 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1003 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/send.c

diff --git a/net/rds/send.c b/net/rds/send.c
new file mode 100644
index 0000000..1b37364
--- /dev/null
+++ b/net/rds/send.c
@@ -0,0 +1,1003 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+#include <linux/list.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+/* When transmitting messages in rds_send_xmit, we need to emerge from
+ * time to time and briefly release the CPU. Otherwise the softlock watchdog
+ * will kick our shin.
+ * Also, it seems fairer to not let one busy connection stall all the
+ * others.
+ *
+ * send_batch_count is the number of times we'll loop in send_xmit. Setting
+ * it to 0 will restore the old behavior (where we looped until we had
+ * drained the queue).
+ */
+static int send_batch_count = 64;
+module_param(send_batch_count, int, 0444);
+MODULE_PARM_DESC(send_batch_count, " batch factor when working the send queue");
+
+/*
+ * Reset the send state. Caller must hold c_send_lock when calling here.
+ */
+void rds_send_reset(struct rds_connection *conn)
+{
+	struct rds_message *rm, *tmp;
+	unsigned long flags;
+
+	if (conn->c_xmit_rm) {
+		/* Tell the user the RDMA op is no longer mapped by the
+		 * transport. This isn't entirely true (it's flushed out
+		 * independently) but as the connection is down, there's
+		 * no ongoing RDMA to/from that memory */
+		rds_message_unmapped(conn->c_xmit_rm);
+		rds_message_put(conn->c_xmit_rm);
+		conn->c_xmit_rm = NULL;
+	}
+	conn->c_xmit_sg = 0;
+	conn->c_xmit_hdr_off = 0;
+	conn->c_xmit_data_off = 0;
+	conn->c_xmit_rdma_sent = 0;
+
+	conn->c_map_queued = 0;
+
+	conn->c_unacked_packets = rds_sysctl_max_unacked_packets;
+	conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes;
+
+	/* Mark messages as retransmissions, and move them to the send q */
+	spin_lock_irqsave(&conn->c_lock, flags);
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+		set_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags);
+	}
+	list_splice_init(&conn->c_retrans, &conn->c_send_queue);
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+}
+
+/*
+ * We're making the concious trade-off here to only send one message
+ * down the connection at a time.
+ *   Pro:
+ *      - tx queueing is a simple fifo list
+ *   	- reassembly is optional and easily done by transports per conn
+ *      - no per flow rx lookup at all, straight to the socket
+ *   	- less per-frag memory and wire overhead
+ *   Con:
+ *      - queued acks can be delayed behind large messages
+ *   Depends:
+ *      - small message latency is higher behind queued large messages
+ *      - large message latency isn't starved by intervening small sends
+ */
+int rds_send_xmit(struct rds_connection *conn)
+{
+	struct rds_message *rm;
+	unsigned long flags;
+	unsigned int tmp;
+	unsigned int send_quota = send_batch_count;
+	struct scatterlist *sg;
+	int ret = 0;
+	int was_empty = 0;
+	LIST_HEAD(to_be_dropped);
+
+	/*
+	 * sendmsg calls here after having queued its message on the send
+	 * queue.  We only have one task feeding the connection at a time.  If
+	 * another thread is already feeding the queue then we back off.  This
+	 * avoids blocking the caller and trading per-connection data between
+	 * caches per message.
+	 *
+	 * The sem holder will issue a retry if they notice that someone queued
+	 * a message after they stopped walking the send queue but before they
+	 * dropped the sem.
+	 */
+	if (!mutex_trylock(&conn->c_send_lock)) {
+		rds_stats_inc(s_send_sem_contention);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (conn->c_trans->xmit_prepare)
+		conn->c_trans->xmit_prepare(conn);
+
+	/*
+	 * spin trying to push headers and data down the connection until
+	 * the connection doens't make forward progress.
+	 */
+	while (--send_quota) {
+		/*
+		 * See if need to send a congestion map update if we're
+		 * between sending messages.  The send_sem protects our sole
+		 * use of c_map_offset and _bytes.
+		 * Note this is used only by transports that define a special
+		 * xmit_cong_map function. For all others, we create allocate
+		 * a cong_map message and treat it just like any other send.
+		 */
+		if (conn->c_map_bytes) {
+			ret = conn->c_trans->xmit_cong_map(conn, conn->c_lcong,
+						conn->c_map_offset);
+			if (ret <= 0)
+				break;
+
+			conn->c_map_offset += ret;
+			conn->c_map_bytes -= ret;
+			if (conn->c_map_bytes)
+				continue;
+		}
+
+		/* If we're done sending the current message, clear the
+		 * offset and S/G temporaries.
+		 */
+		rm = conn->c_xmit_rm;
+		if (rm != NULL &&
+		    conn->c_xmit_hdr_off == sizeof(struct rds_header) &&
+		    conn->c_xmit_sg == rm->m_nents) {
+			conn->c_xmit_rm = NULL;
+			conn->c_xmit_sg = 0;
+			conn->c_xmit_hdr_off = 0;
+			conn->c_xmit_data_off = 0;
+			conn->c_xmit_rdma_sent = 0;
+
+			/* Release the reference to the previous message. */
+			rds_message_put(rm);
+			rm = NULL;
+		}
+
+		/* If we're asked to send a cong map update, do so.
+		 */
+		if (rm == NULL && test_and_clear_bit(0, &conn->c_map_queued)) {
+			if (conn->c_trans->xmit_cong_map != NULL) {
+				conn->c_map_offset = 0;
+				conn->c_map_bytes = sizeof(struct rds_header) +
+					RDS_CONG_MAP_BYTES;
+				continue;
+			}
+
+			rm = rds_cong_update_alloc(conn);
+			if (IS_ERR(rm)) {
+				ret = PTR_ERR(rm);
+				break;
+			}
+
+			conn->c_xmit_rm = rm;
+		}
+
+		/*
+		 * Grab the next message from the send queue, if there is one.
+		 *
+		 * c_xmit_rm holds a ref while we're sending this message down
+		 * the connction.  We can use this ref while holding the
+		 * send_sem.. rds_send_reset() is serialized with it.
+		 */
+		if (rm == NULL) {
+			unsigned int len;
+
+			spin_lock_irqsave(&conn->c_lock, flags);
+
+			if (!list_empty(&conn->c_send_queue)) {
+				rm = list_entry(conn->c_send_queue.next,
+						struct rds_message,
+						m_conn_item);
+				rds_message_addref(rm);
+
+				/*
+				 * Move the message from the send queue to the retransmit
+				 * list right away.
+				 */
+				list_move_tail(&rm->m_conn_item, &conn->c_retrans);
+			}
+
+			spin_unlock_irqrestore(&conn->c_lock, flags);
+
+			if (rm == NULL) {
+				was_empty = 1;
+				break;
+			}
+
+			/* Unfortunately, the way Infiniband deals with
+			 * RDMA to a bad MR key is by moving the entire
+			 * queue pair to error state. We cold possibly
+			 * recover from that, but right now we drop the
+			 * connection.
+			 * Therefore, we never retransmit messages with RDMA ops.
+			 */
+			if (rm->m_rdma_op
+			 && test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) {
+				spin_lock_irqsave(&conn->c_lock, flags);
+				if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags))
+					list_move(&rm->m_conn_item, &to_be_dropped);
+				spin_unlock_irqrestore(&conn->c_lock, flags);
+				rds_message_put(rm);
+				continue;
+			}
+
+			/* Require an ACK every once in a while */
+			len = ntohl(rm->m_inc.i_hdr.h_len);
+			if (conn->c_unacked_packets == 0
+			 || conn->c_unacked_bytes < len) {
+				__set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+
+				conn->c_unacked_packets = rds_sysctl_max_unacked_packets;
+				conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes;
+				rds_stats_inc(s_send_ack_required);
+			} else {
+				conn->c_unacked_bytes -= len;
+				conn->c_unacked_packets--;
+			}
+
+			conn->c_xmit_rm = rm;
+		}
+
+		/*
+		 * Try and send an rdma message.  Let's see if we can
+		 * keep this simple and require that the transport either
+		 * send the whole rdma or none of it.
+		 */
+		if (rm->m_rdma_op && !conn->c_xmit_rdma_sent) {
+			ret = conn->c_trans->xmit_rdma(conn, rm->m_rdma_op);
+			if (ret)
+				break;
+			conn->c_xmit_rdma_sent = 1;
+			/* The transport owns the mapped memory for now.
+			 * You can't unmap it while it's on the send queue */
+			set_bit(RDS_MSG_MAPPED, &rm->m_flags);
+		}
+
+		if (conn->c_xmit_hdr_off < sizeof(struct rds_header) ||
+		    conn->c_xmit_sg < rm->m_nents) {
+			ret = conn->c_trans->xmit(conn, rm,
+						  conn->c_xmit_hdr_off,
+						  conn->c_xmit_sg,
+						  conn->c_xmit_data_off);
+			if (ret <= 0)
+				break;
+
+			if (conn->c_xmit_hdr_off < sizeof(struct rds_header)) {
+				tmp = min_t(int, ret,
+					    sizeof(struct rds_header) -
+					    conn->c_xmit_hdr_off);
+				conn->c_xmit_hdr_off += tmp;
+				ret -= tmp;
+			}
+
+			sg = &rm->m_sg[conn->c_xmit_sg];
+			while (ret) {
+				tmp = min_t(int, ret, sg->length -
+						      conn->c_xmit_data_off);
+				conn->c_xmit_data_off += tmp;
+				ret -= tmp;
+				if (conn->c_xmit_data_off == sg->length) {
+					conn->c_xmit_data_off = 0;
+					sg++;
+					conn->c_xmit_sg++;
+					BUG_ON(ret != 0 &&
+					       conn->c_xmit_sg == rm->m_nents);
+				}
+			}
+		}
+	}
+
+	/* Nuke any messages we decided not to retransmit. */
+	if (!list_empty(&to_be_dropped))
+		rds_send_remove_from_sock(&to_be_dropped, RDS_RDMA_DROPPED);
+
+	if (conn->c_trans->xmit_complete)
+		conn->c_trans->xmit_complete(conn);
+
+	/*
+	 * We might be racing with another sender who queued a message but
+	 * backed off on noticing that we held the c_send_lock.  If we check
+	 * for queued messages after dropping the sem then either we'll
+	 * see the queued message or the queuer will get the sem.  If we
+	 * notice the queued message then we trigger an immediate retry.
+	 *
+	 * We need to be careful only to do this when we stopped processing
+	 * the send queue because it was empty.  It's the only way we
+	 * stop processing the loop when the transport hasn't taken
+	 * responsibility for forward progress.
+	 */
+	mutex_unlock(&conn->c_send_lock);
+
+	if (conn->c_map_bytes || (send_quota == 0 && !was_empty)) {
+		/* We exhausted the send quota, but there's work left to
+		 * do. Return and (re-)schedule the send worker.
+		 */
+		ret = -EAGAIN;
+	}
+
+	if (ret == 0 && was_empty) {
+		/* A simple bit test would be way faster than taking the
+		 * spin lock */
+		spin_lock_irqsave(&conn->c_lock, flags);
+		if (!list_empty(&conn->c_send_queue)) {
+			rds_stats_inc(s_send_sem_queue_raced);
+			ret = -EAGAIN;
+		}
+		spin_unlock_irqrestore(&conn->c_lock, flags);
+	}
+out:
+	return ret;
+}
+
+static void rds_send_sndbuf_remove(struct rds_sock *rs, struct rds_message *rm)
+{
+	u32 len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	assert_spin_locked(&rs->rs_lock);
+
+	BUG_ON(rs->rs_snd_bytes < len);
+	rs->rs_snd_bytes -= len;
+
+	if (rs->rs_snd_bytes == 0)
+		rds_stats_inc(s_send_queue_empty);
+}
+
+static inline int rds_send_is_acked(struct rds_message *rm, u64 ack,
+				    is_acked_func is_acked)
+{
+	if (is_acked)
+		return is_acked(rm, ack);
+	return be64_to_cpu(rm->m_inc.i_hdr.h_sequence) <= ack;
+}
+
+/*
+ * Returns true if there are no messages on the send and retransmit queues
+ * which have a sequence number greater than or equal to the given sequence
+ * number.
+ */
+int rds_send_acked_before(struct rds_connection *conn, u64 seq)
+{
+	struct rds_message *rm, *tmp;
+	int ret = 1;
+
+	spin_lock(&conn->c_lock);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq)
+			ret = 0;
+		break;
+	}
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) {
+		if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq)
+			ret = 0;
+		break;
+	}
+
+	spin_unlock(&conn->c_lock);
+
+	return ret;
+}
+
+/*
+ * This is pretty similar to what happens below in the ACK
+ * handling code - except that we call here as soon as we get
+ * the IB send completion on the RDMA op and the accompanying
+ * message.
+ */
+void rds_rdma_send_complete(struct rds_message *rm, int status)
+{
+	struct rds_sock *rs = NULL;
+	struct rds_rdma_op *ro;
+	struct rds_notifier *notifier;
+
+	spin_lock(&rm->m_rs_lock);
+
+	ro = rm->m_rdma_op;
+	if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags)
+	 && ro && ro->r_notify && ro->r_notifier) {
+		notifier = ro->r_notifier;
+		rs = rm->m_rs;
+		sock_hold(rds_rs_to_sk(rs));
+
+		notifier->n_status = status;
+		spin_lock(&rs->rs_lock);
+		list_add_tail(&notifier->n_list, &rs->rs_notify_queue);
+		spin_unlock(&rs->rs_lock);
+
+		ro->r_notifier = NULL;
+	}
+
+	spin_unlock(&rm->m_rs_lock);
+
+	if (rs) {
+		rds_wake_sk_sleep(rs);
+		sock_put(rds_rs_to_sk(rs));
+	}
+}
+
+/*
+ * This is the same as rds_rdma_send_complete except we
+ * don't do any locking - we have all the ingredients (message,
+ * socket, socket lock) and can just move the notifier.
+ */
+static inline void
+__rds_rdma_send_complete(struct rds_sock *rs, struct rds_message *rm, int status)
+{
+	struct rds_rdma_op *ro;
+
+	ro = rm->m_rdma_op;
+	if (ro && ro->r_notify && ro->r_notifier) {
+		ro->r_notifier->n_status = status;
+		list_add_tail(&ro->r_notifier->n_list, &rs->rs_notify_queue);
+		ro->r_notifier = NULL;
+	}
+
+	/* No need to wake the app - caller does this */
+}
+
+/*
+ * This is called from the IB send completion when we detect
+ * a RDMA operation that failed with remote access error.
+ * So speed is not an issue here.
+ */
+struct rds_message *rds_send_get_message(struct rds_connection *conn,
+					 struct rds_rdma_op *op)
+{
+	struct rds_message *rm, *tmp, *found = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (rm->m_rdma_op == op) {
+			atomic_inc(&rm->m_refcount);
+			found = rm;
+			goto out;
+		}
+	}
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) {
+		if (rm->m_rdma_op == op) {
+			atomic_inc(&rm->m_refcount);
+			found = rm;
+			break;
+		}
+	}
+
+out:
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	return found;
+}
+
+/*
+ * This removes messages from the socket's list if they're on it.  The list
+ * argument must be private to the caller, we must be able to modify it
+ * without locks.  The messages must have a reference held for their
+ * position on the list.  This function will drop that reference after
+ * removing the messages from the 'messages' list regardless of if it found
+ * the messages on the socket list or not.
+ */
+void rds_send_remove_from_sock(struct list_head *messages, int status)
+{
+	unsigned long flags = 0; /* silence gcc :P */
+	struct rds_sock *rs = NULL;
+	struct rds_message *rm;
+
+	local_irq_save(flags);
+	while (!list_empty(messages)) {
+		rm = list_entry(messages->next, struct rds_message,
+				m_conn_item);
+		list_del_init(&rm->m_conn_item);
+
+		/*
+		 * If we see this flag cleared then we're *sure* that someone
+		 * else beat us to removing it from the sock.  If we race
+		 * with their flag update we'll get the lock and then really
+		 * see that the flag has been cleared.
+		 *
+		 * The message spinlock makes sure nobody clears rm->m_rs
+		 * while we're messing with it. It does not prevent the
+		 * message from being removed from the socket, though.
+		 */
+		spin_lock(&rm->m_rs_lock);
+		if (!test_bit(RDS_MSG_ON_SOCK, &rm->m_flags))
+			goto unlock_and_drop;
+
+		if (rs != rm->m_rs) {
+			if (rs) {
+				spin_unlock(&rs->rs_lock);
+				rds_wake_sk_sleep(rs);
+				sock_put(rds_rs_to_sk(rs));
+			}
+			rs = rm->m_rs;
+			spin_lock(&rs->rs_lock);
+			sock_hold(rds_rs_to_sk(rs));
+		}
+
+		if (test_and_clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) {
+			struct rds_rdma_op *ro = rm->m_rdma_op;
+			struct rds_notifier *notifier;
+
+			list_del_init(&rm->m_sock_item);
+			rds_send_sndbuf_remove(rs, rm);
+
+			if (ro && ro->r_notifier
+			   && (status || ro->r_notify)) {
+				notifier = ro->r_notifier;
+				list_add_tail(&notifier->n_list,
+						&rs->rs_notify_queue);
+				if (!notifier->n_status)
+					notifier->n_status = status;
+				rm->m_rdma_op->r_notifier = NULL;
+			}
+			rds_message_put(rm);
+			rm->m_rs = NULL;
+		}
+
+unlock_and_drop:
+		spin_unlock(&rm->m_rs_lock);
+		rds_message_put(rm);
+	}
+
+	if (rs) {
+		spin_unlock(&rs->rs_lock);
+		rds_wake_sk_sleep(rs);
+		sock_put(rds_rs_to_sk(rs));
+	}
+	local_irq_restore(flags);
+}
+
+/*
+ * Transports call here when they've determined that the receiver queued
+ * messages up to, and including, the given sequence number.  Messages are
+ * moved to the retrans queue when rds_send_xmit picks them off the send
+ * queue. This means that in the TCP case, the message may not have been
+ * assigned the m_ack_seq yet - but that's fine as long as tcp_is_acked
+ * checks the RDS_MSG_HAS_ACK_SEQ bit.
+ *
+ * XXX It's not clear to me how this is safely serialized with socket
+ * destruction.  Maybe it should bail if it sees SOCK_DEAD.
+ */
+void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
+			 is_acked_func is_acked)
+{
+	struct rds_message *rm, *tmp;
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (!rds_send_is_acked(rm, ack, is_acked))
+			break;
+
+		list_move(&rm->m_conn_item, &list);
+		clear_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+	}
+
+	/* order flag updates with spin locks */
+	if (!list_empty(&list))
+		smp_mb__after_clear_bit();
+
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	/* now remove the messages from the sock list as needed */
+	rds_send_remove_from_sock(&list, RDS_RDMA_SUCCESS);
+}
+
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
+{
+	struct rds_message *rm, *tmp;
+	struct rds_connection *conn;
+	unsigned long flags;
+	LIST_HEAD(list);
+	int wake = 0;
+
+	/* get all the messages we're dropping under the rs lock */
+	spin_lock_irqsave(&rs->rs_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &rs->rs_send_queue, m_sock_item) {
+		if (dest && (dest->sin_addr.s_addr != rm->m_daddr ||
+			     dest->sin_port != rm->m_inc.i_hdr.h_dport))
+			continue;
+
+		wake = 1;
+		list_move(&rm->m_sock_item, &list);
+		rds_send_sndbuf_remove(rs, rm);
+		clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
+
+		/* If this is a RDMA operation, notify the app. */
+		__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED);
+	}
+
+	/* order flag updates with the rs lock */
+	if (wake)
+		smp_mb__after_clear_bit();
+
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	if (wake)
+		rds_wake_sk_sleep(rs);
+
+	conn = NULL;
+
+	/* now remove the messages from the conn list as needed */
+	list_for_each_entry(rm, &list, m_sock_item) {
+		/* We do this here rather than in the loop above, so that
+		 * we don't have to nest m_rs_lock under rs->rs_lock */
+		spin_lock(&rm->m_rs_lock);
+		rm->m_rs = NULL;
+		spin_unlock(&rm->m_rs_lock);
+
+		/*
+		 * If we see this flag cleared then we're *sure* that someone
+		 * else beat us to removing it from the conn.  If we race
+		 * with their flag update we'll get the lock and then really
+		 * see that the flag has been cleared.
+		 */
+		if (!test_bit(RDS_MSG_ON_CONN, &rm->m_flags))
+			continue;
+
+		if (conn != rm->m_inc.i_conn) {
+			if (conn)
+				spin_unlock_irqrestore(&conn->c_lock, flags);
+			conn = rm->m_inc.i_conn;
+			spin_lock_irqsave(&conn->c_lock, flags);
+		}
+
+		if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
+			list_del_init(&rm->m_conn_item);
+			rds_message_put(rm);
+		}
+	}
+
+	if (conn)
+		spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	while (!list_empty(&list)) {
+		rm = list_entry(list.next, struct rds_message, m_sock_item);
+		list_del_init(&rm->m_sock_item);
+
+		rds_message_wait(rm);
+		rds_message_put(rm);
+	}
+}
+
+/*
+ * we only want this to fire once so we use the callers 'queued'.  It's
+ * possible that another thread can race with us and remove the
+ * message from the flow with RDS_CANCEL_SENT_TO.
+ */
+static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
+			     struct rds_message *rm, __be16 sport,
+			     __be16 dport, int *queued)
+{
+	unsigned long flags;
+	u32 len;
+
+	if (*queued)
+		goto out;
+
+	len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	/* this is the only place which holds both the socket's rs_lock
+	 * and the connection's c_lock */
+	spin_lock_irqsave(&rs->rs_lock, flags);
+
+	/*
+	 * If there is a little space in sndbuf, we don't queue anything,
+	 * and userspace gets -EAGAIN. But poll() indicates there's send
+	 * room. This can lead to bad behavior (spinning) if snd_bytes isn't
+	 * freed up by incoming acks. So we check the *old* value of
+	 * rs_snd_bytes here to allow the last msg to exceed the buffer,
+	 * and poll() now knows no more data can be sent.
+	 */
+	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) {
+		rs->rs_snd_bytes += len;
+
+		/* let recv side know we are close to send space exhaustion.
+		 * This is probably not the optimal way to do it, as this
+		 * means we set the flag on *all* messages as soon as our
+		 * throughput hits a certain threshold.
+		 */
+		if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2)
+			__set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+
+		list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
+		set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
+		rds_message_addref(rm);
+		rm->m_rs = rs;
+
+		/* The code ordering is a little weird, but we're
+		   trying to minimize the time we hold c_lock */
+		rds_message_populate_header(&rm->m_inc.i_hdr, sport, dport, 0);
+		rm->m_inc.i_conn = conn;
+		rds_message_addref(rm);
+
+		spin_lock(&conn->c_lock);
+		rm->m_inc.i_hdr.h_sequence = cpu_to_be64(conn->c_next_tx_seq++);
+		list_add_tail(&rm->m_conn_item, &conn->c_send_queue);
+		set_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+		spin_unlock(&conn->c_lock);
+
+		rdsdebug("queued msg %p len %d, rs %p bytes %d seq %llu\n",
+			 rm, len, rs, rs->rs_snd_bytes,
+			 (unsigned long long)be64_to_cpu(rm->m_inc.i_hdr.h_sequence));
+
+		*queued = 1;
+	}
+
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+out:
+	return *queued;
+}
+
+static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
+			 struct msghdr *msg, int *allocated_mr)
+{
+	struct cmsghdr *cmsg;
+	int ret = 0;
+
+	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+		if (!CMSG_OK(msg, cmsg))
+			return -EINVAL;
+
+		if (cmsg->cmsg_level != SOL_RDS)
+			continue;
+
+		/* As a side effect, RDMA_DEST and RDMA_MAP will set
+		 * rm->m_rdma_cookie and rm->m_rdma_mr.
+		 */
+		switch (cmsg->cmsg_type) {
+		case RDS_CMSG_RDMA_ARGS:
+			ret = rds_cmsg_rdma_args(rs, rm, cmsg);
+			break;
+
+		case RDS_CMSG_RDMA_DEST:
+			ret = rds_cmsg_rdma_dest(rs, rm, cmsg);
+			break;
+
+		case RDS_CMSG_RDMA_MAP:
+			ret = rds_cmsg_rdma_map(rs, rm, cmsg);
+			if (!ret)
+				*allocated_mr = 1;
+			break;
+
+		default:
+			return -EINVAL;
+		}
+
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t payload_len)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
+	__be32 daddr;
+	__be16 dport;
+	struct rds_message *rm = NULL;
+	struct rds_connection *conn;
+	int ret = 0;
+	int queued = 0, allocated_mr = 0;
+	int nonblock = msg->msg_flags & MSG_DONTWAIT;
+	long timeo = sock_rcvtimeo(sk, nonblock);
+
+	/* Mirror Linux UDP mirror of BSD error message compatibility */
+	/* XXX: Perhaps MSG_MORE someday */
+	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
+		printk(KERN_INFO "msg_flags 0x%08X\n", msg->msg_flags);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (msg->msg_namelen) {
+		/* XXX fail non-unicast destination IPs? */
+		if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) {
+			ret = -EINVAL;
+			goto out;
+		}
+		daddr = usin->sin_addr.s_addr;
+		dport = usin->sin_port;
+	} else {
+		/* We only care about consistency with ->connect() */
+		lock_sock(sk);
+		daddr = rs->rs_conn_addr;
+		dport = rs->rs_conn_port;
+		release_sock(sk);
+	}
+
+	/* racing with another thread binding seems ok here */
+	if (daddr == 0 || rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	rm = rds_message_copy_from_user(msg->msg_iov, payload_len);
+	if (IS_ERR(rm)) {
+		ret = PTR_ERR(rm);
+		rm = NULL;
+		goto out;
+	}
+
+	rm->m_daddr = daddr;
+
+	/* Parse any control messages the user may have included. */
+	ret = rds_cmsg_send(rs, rm, msg, &allocated_mr);
+	if (ret)
+		goto out;
+
+	/* rds_conn_create has a spinlock that runs with IRQ off.
+	 * Caching the conn in the socket helps a lot. */
+	if (rs->rs_conn && rs->rs_conn->c_faddr == daddr)
+		conn = rs->rs_conn;
+	else {
+		conn = rds_conn_create_outgoing(rs->rs_bound_addr, daddr,
+					rs->rs_transport,
+					sock->sk->sk_allocation);
+		if (IS_ERR(conn)) {
+			ret = PTR_ERR(conn);
+			goto out;
+		}
+		rs->rs_conn = conn;
+	}
+
+	if ((rm->m_rdma_cookie || rm->m_rdma_op)
+	 && conn->c_trans->xmit_rdma == NULL) {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n",
+				rm->m_rdma_op, conn->c_trans->xmit_rdma);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* If the connection is down, trigger a connect. We may
+	 * have scheduled a delayed reconnect however - in this case
+	 * we should not interfere.
+	 */
+	if (rds_conn_state(conn) == RDS_CONN_DOWN
+	 && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+
+	ret = rds_cong_wait(conn->c_fcong, dport, nonblock, rs);
+	if (ret)
+		goto out;
+
+	while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port,
+				  dport, &queued)) {
+		rds_stats_inc(s_send_queue_full);
+		/* XXX make sure this is reasonable */
+		if (payload_len > rds_sk_sndbuf(rs)) {
+			ret = -EMSGSIZE;
+			goto out;
+		}
+		if (nonblock) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+					rds_send_queue_rm(rs, conn, rm,
+							  rs->rs_bound_port,
+							  dport,
+							  &queued),
+					timeo);
+		rdsdebug("sendmsg woke queued %d timeo %ld\n", queued, timeo);
+		if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT)
+			continue;
+
+		ret = timeo;
+		if (ret == 0)
+			ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	/*
+	 * By now we've committed to the send.  We reuse rds_send_worker()
+	 * to retry sends in the rds thread if the transport asks us to.
+	 */
+	rds_stats_inc(s_send_queued);
+
+	if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		rds_send_worker(&conn->c_send_w.work);
+
+	rds_message_put(rm);
+	return payload_len;
+
+out:
+	/* If the user included a RDMA_MAP cmsg, we allocated a MR on the fly.
+	 * If the sendmsg goes through, we keep the MR. If it fails with EAGAIN
+	 * or in any other way, we need to destroy the MR again */
+	if (allocated_mr)
+		rds_rdma_unuse(rs, rds_rdma_cookie_key(rm->m_rdma_cookie), 1);
+
+	if (rm)
+		rds_message_put(rm);
+	return ret;
+}
+
+/*
+ * Reply to a ping packet.
+ */
+int
+rds_send_pong(struct rds_connection *conn, __be16 dport)
+{
+	struct rds_message *rm;
+	unsigned long flags;
+	int ret = 0;
+
+	rm = rds_message_alloc(0, GFP_ATOMIC);
+	if (rm == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rm->m_daddr = conn->c_faddr;
+
+	/* If the connection is down, trigger a connect. We may
+	 * have scheduled a delayed reconnect however - in this case
+	 * we should not interfere.
+	 */
+	if (rds_conn_state(conn) == RDS_CONN_DOWN
+	 && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+
+	ret = rds_cong_wait(conn->c_fcong, dport, 1, NULL);
+	if (ret)
+		goto out;
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+	list_add_tail(&rm->m_conn_item, &conn->c_send_queue);
+	set_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+	rds_message_addref(rm);
+	rm->m_inc.i_conn = conn;
+
+	rds_message_populate_header(&rm->m_inc.i_hdr, 0, dport,
+				    conn->c_next_tx_seq);
+	conn->c_next_tx_seq++;
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	rds_stats_inc(s_send_queued);
+	rds_stats_inc(s_send_pong);
+
+	queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+	rds_message_put(rm);
+	return 0;
+
+out:
+	if (rm)
+		rds_message_put(rm);
+	return ret;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:28 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:28 -0800
Subject: [ofa-general] [PATCH 11/26] RDS: recv.c
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-12-git-send-email-andy.grover@oracle.com>

Upon receiving a datagram from the transport, RDS parses the
headers and potentially queues an ACK.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/recv.c |  542 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 542 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/recv.c

diff --git a/net/rds/recv.c b/net/rds/recv.c
new file mode 100644
index 0000000..f2118c5
--- /dev/null
+++ b/net/rds/recv.c
@@ -0,0 +1,542 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
+		  __be32 saddr)
+{
+	atomic_set(&inc->i_refcount, 1);
+	INIT_LIST_HEAD(&inc->i_item);
+	inc->i_conn = conn;
+	inc->i_saddr = saddr;
+	inc->i_rdma_cookie = 0;
+}
+
+void rds_inc_addref(struct rds_incoming *inc)
+{
+	rdsdebug("addref inc %p ref %d\n", inc, atomic_read(&inc->i_refcount));
+	atomic_inc(&inc->i_refcount);
+}
+
+void rds_inc_put(struct rds_incoming *inc)
+{
+	rdsdebug("put inc %p ref %d\n", inc, atomic_read(&inc->i_refcount));
+	if (atomic_dec_and_test(&inc->i_refcount)) {
+		BUG_ON(!list_empty(&inc->i_item));
+
+		inc->i_conn->c_trans->inc_free(inc);
+	}
+}
+
+static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk,
+				  struct rds_cong_map *map,
+				  int delta, __be16 port)
+{
+	int now_congested;
+
+	if (delta == 0)
+		return;
+
+	rs->rs_rcv_bytes += delta;
+	now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
+
+	rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
+	  "now_cong %d delta %d\n",
+	  rs, &rs->rs_bound_addr,
+	  ntohs(rs->rs_bound_port), rs->rs_rcv_bytes,
+	  rds_sk_rcvbuf(rs), now_congested, delta);
+
+	/* wasn't -> am congested */
+	if (!rs->rs_congested && now_congested) {
+		rs->rs_congested = 1;
+		rds_cong_set_bit(map, port);
+		rds_cong_queue_updates(map);
+	}
+	/* was -> aren't congested */
+	/* Require more free space before reporting uncongested to prevent
+	   bouncing cong/uncong state too often */
+	else if (rs->rs_congested && (rs->rs_rcv_bytes < (rds_sk_rcvbuf(rs)/2))) {
+		rs->rs_congested = 0;
+		rds_cong_clear_bit(map, port);
+		rds_cong_queue_updates(map);
+	}
+
+	/* do nothing if no change in cong state */
+}
+
+/*
+ * Process all extension headers that come with this message.
+ */
+static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock *rs)
+{
+	struct rds_header *hdr = &inc->i_hdr;
+	unsigned int pos = 0, type, len;
+	union {
+		struct rds_ext_header_version version;
+		struct rds_ext_header_rdma rdma;
+		struct rds_ext_header_rdma_dest rdma_dest;
+	} buffer;
+
+	while (1) {
+		len = sizeof(buffer);
+		type = rds_message_next_extension(hdr, &pos, &buffer, &len);
+		if (type == RDS_EXTHDR_NONE)
+			break;
+		/* Process extension header here */
+		switch (type) {
+		case RDS_EXTHDR_RDMA:
+			rds_rdma_unuse(rs, be32_to_cpu(buffer.rdma.h_rdma_rkey), 0);
+			break;
+
+		case RDS_EXTHDR_RDMA_DEST:
+			/* We ignore the size for now. We could stash it
+			 * somewhere and use it for error checking. */
+			inc->i_rdma_cookie = rds_rdma_make_cookie(
+					be32_to_cpu(buffer.rdma_dest.h_rdma_rkey),
+					be32_to_cpu(buffer.rdma_dest.h_rdma_offset));
+
+			break;
+		}
+	}
+}
+
+/*
+ * The transport must make sure that this is serialized against other
+ * rx and conn reset on this specific conn.
+ *
+ * We currently assert that only one fragmented message will be sent
+ * down a connection at a time.  This lets us reassemble in the conn
+ * instead of per-flow which means that we don't have to go digging through
+ * flows to tear down partial reassembly progress on conn failure and
+ * we save flow lookup and locking for each frag arrival.  It does mean
+ * that small messages will wait behind large ones.  Fragmenting at all
+ * is only to reduce the memory consumption of pre-posted buffers.
+ *
+ * The caller passes in saddr and daddr instead of us getting it from the
+ * conn.  This lets loopback, who only has one conn for both directions,
+ * tell us which roles the addrs in the conn are playing for this message.
+ */
+void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+		       struct rds_incoming *inc, gfp_t gfp, enum km_type km)
+{
+	struct rds_sock *rs = NULL;
+	struct sock *sk;
+	unsigned long flags;
+
+	inc->i_conn = conn;
+	inc->i_rx_jiffies = jiffies;
+
+	rdsdebug("conn %p next %llu inc %p seq %llu len %u sport %u dport %u "
+		 "flags 0x%x rx_jiffies %lu\n", conn,
+		 (unsigned long long)conn->c_next_rx_seq,
+		 inc,
+		 (unsigned long long)be64_to_cpu(inc->i_hdr.h_sequence),
+		 be32_to_cpu(inc->i_hdr.h_len),
+		 be16_to_cpu(inc->i_hdr.h_sport),
+		 be16_to_cpu(inc->i_hdr.h_dport),
+		 inc->i_hdr.h_flags,
+		 inc->i_rx_jiffies);
+
+	/*
+	 * Sequence numbers should only increase.  Messages get their
+	 * sequence number as they're queued in a sending conn.  They
+	 * can be dropped, though, if the sending socket is closed before
+	 * they hit the wire.  So sequence numbers can skip forward
+	 * under normal operation.  They can also drop back in the conn
+	 * failover case as previously sent messages are resent down the
+	 * new instance of a conn.  We drop those, otherwise we have
+	 * to assume that the next valid seq does not come after a
+	 * hole in the fragment stream.
+	 *
+	 * The headers don't give us a way to realize if fragments of
+	 * a message have been dropped.  We assume that frags that arrive
+	 * to a flow are part of the current message on the flow that is
+	 * being reassembled.  This means that senders can't drop messages
+	 * from the sending conn until all their frags are sent.
+	 *
+	 * XXX we could spend more on the wire to get more robust failure
+	 * detection, arguably worth it to avoid data corruption.
+	 */
+	if (be64_to_cpu(inc->i_hdr.h_sequence) < conn->c_next_rx_seq
+	 && (inc->i_hdr.h_flags & RDS_FLAG_RETRANSMITTED)) {
+		rds_stats_inc(s_recv_drop_old_seq);
+		goto out;
+	}
+	conn->c_next_rx_seq = be64_to_cpu(inc->i_hdr.h_sequence) + 1;
+
+	if (rds_sysctl_ping_enable && inc->i_hdr.h_dport == 0) {
+		rds_stats_inc(s_recv_ping);
+		rds_send_pong(conn, inc->i_hdr.h_sport);
+		goto out;
+	}
+
+	rs = rds_find_bound(daddr, inc->i_hdr.h_dport);
+	if (rs == NULL) {
+		rds_stats_inc(s_recv_drop_no_sock);
+		goto out;
+	}
+
+	/* Process extension headers */
+	rds_recv_incoming_exthdrs(inc, rs);
+
+	/* We can be racing with rds_release() which marks the socket dead. */
+	sk = rds_rs_to_sk(rs);
+
+	/* serialize with rds_release -> sock_orphan */
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!sock_flag(sk, SOCK_DEAD)) {
+		rdsdebug("adding inc %p to rs %p's recv queue\n", inc, rs);
+		rds_stats_inc(s_recv_queued);
+		rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+				      be32_to_cpu(inc->i_hdr.h_len),
+				      inc->i_hdr.h_dport);
+		rds_inc_addref(inc);
+		list_add_tail(&inc->i_item, &rs->rs_recv_queue);
+		__rds_wake_sk_sleep(sk);
+	} else {
+		rds_stats_inc(s_recv_drop_dead_sock);
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+out:
+	if (rs)
+		rds_sock_put(rs);
+}
+
+/*
+ * be very careful here.  This is being called as the condition in
+ * wait_event_*() needs to cope with being called many times.
+ */
+static int rds_next_incoming(struct rds_sock *rs, struct rds_incoming **inc)
+{
+	unsigned long flags;
+
+	if (*inc == NULL) {
+		read_lock_irqsave(&rs->rs_recv_lock, flags);
+		if (!list_empty(&rs->rs_recv_queue)) {
+			*inc = list_entry(rs->rs_recv_queue.next,
+					  struct rds_incoming,
+					  i_item);
+			rds_inc_addref(*inc);
+		}
+		read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+	}
+
+	return *inc != NULL;
+}
+
+static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc,
+			    int drop)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	int ret = 0;
+	unsigned long flags;
+
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!list_empty(&inc->i_item)) {
+		ret = 1;
+		if (drop) {
+			/* XXX make sure this i_conn is reliable */
+			rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+					      -be32_to_cpu(inc->i_hdr.h_len),
+					      inc->i_hdr.h_dport);
+			list_del_init(&inc->i_item);
+			rds_inc_put(inc);
+		}
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+	rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop);
+	return ret;
+}
+
+/*
+ * Pull errors off the error queue.
+ * If msghdr is NULL, we will just purge the error queue.
+ */
+int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msghdr)
+{
+	struct rds_notifier *notifier;
+	struct rds_rdma_notify cmsg;
+	unsigned int count = 0, max_messages = ~0U;
+	unsigned long flags;
+	LIST_HEAD(copy);
+	int err = 0;
+
+
+	/* put_cmsg copies to user space and thus may sleep. We can't do this
+	 * with rs_lock held, so first grab as many notifications as we can stuff
+	 * in the user provided cmsg buffer. We don't try to copy more, to avoid
+	 * losing notifications - except when the buffer is so small that it wouldn't
+	 * even hold a single notification. Then we give him as much of this single
+	 * msg as we can squeeze in, and set MSG_CTRUNC.
+	 */
+	if (msghdr) {
+		max_messages = msghdr->msg_controllen / CMSG_SPACE(sizeof(cmsg));
+		if (!max_messages)
+			max_messages = 1;
+	}
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	while (!list_empty(&rs->rs_notify_queue) && count < max_messages) {
+		notifier = list_entry(rs->rs_notify_queue.next,
+				struct rds_notifier, n_list);
+		list_move(&notifier->n_list, &copy);
+		count++;
+	}
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	if (!count)
+		return 0;
+
+	while (!list_empty(&copy)) {
+		notifier = list_entry(copy.next, struct rds_notifier, n_list);
+
+		if (msghdr) {
+			cmsg.user_token = notifier->n_user_token;
+			cmsg.status  = notifier->n_status;
+
+			err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_RDMA_STATUS,
+					sizeof(cmsg), &cmsg);
+			if (err)
+				break;
+		}
+
+		list_del_init(&notifier->n_list);
+		kfree(notifier);
+	}
+
+	/* If we bailed out because of an error in put_cmsg,
+	 * we may be left with one or more notifications that we
+	 * didn't process. Return them to the head of the list. */
+	if (!list_empty(&copy)) {
+		spin_lock_irqsave(&rs->rs_lock, flags);
+		list_splice(&copy, &rs->rs_notify_queue);
+		spin_unlock_irqrestore(&rs->rs_lock, flags);
+	}
+
+	return err;
+}
+
+/*
+ * Queue a congestion notification
+ */
+static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr)
+{
+	uint64_t notify = rs->rs_cong_notify;
+	unsigned long flags;
+	int err;
+
+	err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_CONG_UPDATE,
+			sizeof(notify), &notify);
+	if (err)
+		return err;
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	rs->rs_cong_notify &= ~notify;
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	return 0;
+}
+
+/*
+ * Receive any control messages.
+ */
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+{
+	int ret = 0;
+
+	if (inc->i_rdma_cookie) {
+		ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST,
+				sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t size, int msg_flags)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	long timeo;
+	int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
+	struct sockaddr_in *sin;
+	struct rds_incoming *inc = NULL;
+
+	/* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
+	timeo = sock_rcvtimeo(sk, nonblock);
+
+	rdsdebug("size %zu flags 0x%x timeo %ld\n", size, msg_flags, timeo);
+
+	if (msg_flags & MSG_OOB)
+		goto out;
+
+	/* If there are pending notifications, do those - and nothing else */
+	if (!list_empty(&rs->rs_notify_queue)) {
+		ret = rds_notify_queue_get(rs, msg);
+		goto out;
+	}
+
+	if (rs->rs_cong_notify) {
+		ret = rds_notify_cong(rs, msg);
+		goto out;
+	}
+
+	while (1) {
+		if (!rds_next_incoming(rs, &inc)) {
+			if (nonblock) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+						rds_next_incoming(rs, &inc),
+						timeo);
+			rdsdebug("recvmsg woke inc %p timeo %ld\n", inc,
+				 timeo);
+			if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT)
+				continue;
+
+			ret = timeo;
+			if (ret == 0)
+				ret = -ETIMEDOUT;
+			break;
+		}
+
+		rdsdebug("copying inc %p from %pI4:%u to user\n", inc,
+			 &inc->i_conn->c_faddr,
+			 ntohs(inc->i_hdr.h_sport));
+		ret = inc->i_conn->c_trans->inc_copy_to_user(inc, msg->msg_iov,
+							     size);
+		if (ret < 0)
+			break;
+
+		/*
+		 * if the message we just copied isn't at the head of the
+		 * recv queue then someone else raced us to return it, try
+		 * to get the next message.
+		 */
+		if (!rds_still_queued(rs, inc, !(msg_flags & MSG_PEEK))) {
+			rds_inc_put(inc);
+			inc = NULL;
+			rds_stats_inc(s_recv_deliver_raced);
+			continue;
+		}
+
+		if (ret < be32_to_cpu(inc->i_hdr.h_len)) {
+			if (msg_flags & MSG_TRUNC)
+				ret = be32_to_cpu(inc->i_hdr.h_len);
+			msg->msg_flags |= MSG_TRUNC;
+		}
+
+		if (rds_cmsg_recv(inc, msg)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		rds_stats_inc(s_recv_delivered);
+
+		sin = (struct sockaddr_in *)msg->msg_name;
+		if (sin) {
+			sin->sin_family = AF_INET;
+			sin->sin_port = inc->i_hdr.h_sport;
+			sin->sin_addr.s_addr = inc->i_saddr;
+			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+		}
+		break;
+	}
+
+	if (inc)
+		rds_inc_put(inc);
+
+out:
+	return ret;
+}
+
+/*
+ * The socket is being shut down and we're asked to drop messages that were
+ * queued for recvmsg.  The caller has unbound the socket so the receive path
+ * won't queue any more incoming fragments or messages on the socket.
+ */
+void rds_clear_recv_queue(struct rds_sock *rs)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct rds_incoming *inc, *tmp;
+	unsigned long flags;
+
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	list_for_each_entry_safe(inc, tmp, &rs->rs_recv_queue, i_item) {
+		rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+				      -be32_to_cpu(inc->i_hdr.h_len),
+				      inc->i_hdr.h_dport);
+		list_del_init(&inc->i_item);
+		rds_inc_put(inc);
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+}
+
+/*
+ * inc->i_saddr isn't used here because it is only set in the receive
+ * path.
+ */
+void rds_inc_info_copy(struct rds_incoming *inc,
+		       struct rds_info_iterator *iter,
+		       __be32 saddr, __be32 daddr, int flip)
+{
+	struct rds_info_message minfo;
+
+	minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence);
+	minfo.len = be32_to_cpu(inc->i_hdr.h_len);
+
+	if (flip) {
+		minfo.laddr = daddr;
+		minfo.faddr = saddr;
+		minfo.lport = inc->i_hdr.h_dport;
+		minfo.fport = inc->i_hdr.h_sport;
+	} else {
+		minfo.laddr = saddr;
+		minfo.faddr = daddr;
+		minfo.lport = inc->i_hdr.h_sport;
+		minfo.fport = inc->i_hdr.h_dport;
+	}
+
+	rds_info_copy(iter, &minfo, sizeof(minfo));
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:30 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:30 -0800
Subject: [ofa-general] [PATCH 13/26] RDS/IB: Infiniband transport
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-14-git-send-email-andy.grover@oracle.com>

Registers as an RDS transport and an IB client, and uses IB CM
API to allocate ids, queue pairs, and the rest of that fun stuff.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib.c    |  323 ++++++++++++++++++++++++
 net/rds/ib.h    |  367 ++++++++++++++++++++++++++++
 net/rds/ib_cm.c |  726 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1416 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib.c
 create mode 100644 net/rds/ib.h
 create mode 100644 net/rds/ib_cm.c

diff --git a/net/rds/ib.c b/net/rds/ib.c
new file mode 100644
index 0000000..06a7b79
--- /dev/null
+++ b/net/rds/ib.c
@@ -0,0 +1,323 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/if.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/if_arp.h>
+#include <linux/delay.h>
+
+#include "rds.h"
+#include "ib.h"
+
+unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE;
+unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */
+
+module_param(fmr_pool_size, int, 0444);
+MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA");
+module_param(fmr_message_size, int, 0444);
+MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer");
+
+struct list_head rds_ib_devices;
+
+DEFINE_SPINLOCK(ib_nodev_conns_lock);
+LIST_HEAD(ib_nodev_conns);
+
+void rds_ib_add_one(struct ib_device *device)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct ib_device_attr *dev_attr;
+
+	/* Only handle IB (no iWARP) devices */
+	if (device->node_type != RDMA_NODE_IB_CA)
+		return;
+
+	dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
+	if (!dev_attr)
+		return;
+
+	if (ib_query_device(device, dev_attr)) {
+		rdsdebug("Query device failed for %s\n", device->name);
+		goto free_attr;
+	}
+
+	rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL);
+	if (!rds_ibdev)
+		goto free_attr;
+
+	spin_lock_init(&rds_ibdev->spinlock);
+
+	rds_ibdev->max_wrs = dev_attr->max_qp_wr;
+	rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
+
+	rds_ibdev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1);
+	rds_ibdev->fmr_page_size  = 1 << rds_ibdev->fmr_page_shift;
+	rds_ibdev->fmr_page_mask  = ~((u64) rds_ibdev->fmr_page_size - 1);
+	rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
+	rds_ibdev->max_fmrs = dev_attr->max_fmr ?
+			min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) :
+			fmr_pool_size;
+
+	rds_ibdev->dev = device;
+	rds_ibdev->pd = ib_alloc_pd(device);
+	if (IS_ERR(rds_ibdev->pd))
+		goto free_dev;
+
+	rds_ibdev->mr = ib_get_dma_mr(rds_ibdev->pd,
+				      IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(rds_ibdev->mr))
+		goto err_pd;
+
+	rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev);
+	if (IS_ERR(rds_ibdev->mr_pool)) {
+		rds_ibdev->mr_pool = NULL;
+		goto err_mr;
+	}
+
+	INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
+	INIT_LIST_HEAD(&rds_ibdev->conn_list);
+	list_add_tail(&rds_ibdev->list, &rds_ib_devices);
+
+	ib_set_client_data(device, &rds_ib_client, rds_ibdev);
+
+	goto free_attr;
+
+err_mr:
+	ib_dereg_mr(rds_ibdev->mr);
+err_pd:
+	ib_dealloc_pd(rds_ibdev->pd);
+free_dev:
+	kfree(rds_ibdev);
+free_attr:
+	kfree(dev_attr);
+}
+
+void rds_ib_remove_one(struct ib_device *device)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_ipaddr *i_ipaddr, *i_next;
+
+	rds_ibdev = ib_get_client_data(device, &rds_ib_client);
+	if (!rds_ibdev)
+		return;
+
+	list_for_each_entry_safe(i_ipaddr, i_next, &rds_ibdev->ipaddr_list, list) {
+		list_del(&i_ipaddr->list);
+		kfree(i_ipaddr);
+	}
+
+	rds_ib_remove_conns(rds_ibdev);
+
+	if (rds_ibdev->mr_pool)
+		rds_ib_destroy_mr_pool(rds_ibdev->mr_pool);
+
+	ib_dereg_mr(rds_ibdev->mr);
+
+	while (ib_dealloc_pd(rds_ibdev->pd)) {
+		rdsdebug("Failed to dealloc pd %p\n", rds_ibdev->pd);
+		msleep(1);
+	}
+
+	list_del(&rds_ibdev->list);
+	kfree(rds_ibdev);
+}
+
+struct ib_client rds_ib_client = {
+	.name   = "rds_ib",
+	.add    = rds_ib_add_one,
+	.remove = rds_ib_remove_one
+};
+
+static int rds_ib_conn_info_visitor(struct rds_connection *conn,
+				    void *buffer)
+{
+	struct rds_info_rdma_connection *iinfo = buffer;
+	struct rds_ib_connection *ic;
+
+	/* We will only ever look at IB transports */
+	if (conn->c_trans != &rds_ib_transport)
+		return 0;
+
+	iinfo->src_addr = conn->c_laddr;
+	iinfo->dst_addr = conn->c_faddr;
+
+	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
+	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		struct rds_ib_device *rds_ibdev;
+		struct rdma_dev_addr *dev_addr;
+
+		ic = conn->c_transport_data;
+		dev_addr = &ic->i_cm_id->route.addr.dev_addr;
+
+		ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid);
+		ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid);
+
+		rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+		iinfo->max_send_wr = ic->i_send_ring.w_nr;
+		iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
+		iinfo->max_send_sge = rds_ibdev->max_sge;
+		rds_ib_get_mr_info(rds_ibdev, iinfo);
+	}
+	return 1;
+}
+
+static void rds_ib_ic_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	rds_for_each_conn_info(sock, len, iter, lens,
+				rds_ib_conn_info_visitor,
+				sizeof(struct rds_info_rdma_connection));
+}
+
+
+/*
+ * Early RDS/IB was built to only bind to an address if there is an IPoIB
+ * device with that address set.
+ *
+ * If it were me, I'd advocate for something more flexible.  Sending and
+ * receiving should be device-agnostic.  Transports would try and maintain
+ * connections between peers who have messages queued.  Userspace would be
+ * allowed to influence which paths have priority.  We could call userspace
+ * asserting this policy "routing".
+ */
+static int rds_ib_laddr_check(__be32 addr)
+{
+	int ret;
+	struct rdma_cm_id *cm_id;
+	struct sockaddr_in sin;
+
+	/* Create a CMA ID and try to bind it. This catches both
+	 * IB and iWARP capable NICs.
+	 */
+	cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP);
+	if (!cm_id)
+		return -EADDRNOTAVAIL;
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = addr;
+
+	/* rdma_bind_addr will only succeed for IB & iWARP devices */
+	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	/* due to this, we will claim to support iWARP devices unless we
+	   check node_type. */
+	if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA)
+		ret = -EADDRNOTAVAIL;
+
+	rdsdebug("addr %pI4 ret %d node type %d\n",
+		&addr, ret,
+		cm_id->device ? cm_id->device->node_type : -1);
+
+	rdma_destroy_id(cm_id);
+
+	return ret;
+}
+
+void rds_ib_exit(void)
+{
+	rds_info_deregister_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+	rds_ib_remove_nodev_conns();
+	ib_unregister_client(&rds_ib_client);
+	rds_ib_sysctl_exit();
+	rds_ib_recv_exit();
+	rds_trans_unregister(&rds_ib_transport);
+}
+
+struct rds_transport rds_ib_transport = {
+	.laddr_check		= rds_ib_laddr_check,
+	.xmit_complete		= rds_ib_xmit_complete,
+	.xmit			= rds_ib_xmit,
+	.xmit_cong_map		= NULL,
+	.xmit_rdma		= rds_ib_xmit_rdma,
+	.recv			= rds_ib_recv,
+	.conn_alloc		= rds_ib_conn_alloc,
+	.conn_free		= rds_ib_conn_free,
+	.conn_connect		= rds_ib_conn_connect,
+	.conn_shutdown		= rds_ib_conn_shutdown,
+	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
+	.inc_purge		= rds_ib_inc_purge,
+	.inc_free		= rds_ib_inc_free,
+	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
+	.cm_handle_connect	= rds_ib_cm_handle_connect,
+	.cm_connect_complete	= rds_ib_cm_connect_complete,
+	.stats_info_copy	= rds_ib_stats_info_copy,
+	.exit			= rds_ib_exit,
+	.get_mr			= rds_ib_get_mr,
+	.sync_mr		= rds_ib_sync_mr,
+	.free_mr		= rds_ib_free_mr,
+	.flush_mrs		= rds_ib_flush_mrs,
+	.t_owner		= THIS_MODULE,
+	.t_name			= "infiniband",
+};
+
+int __init rds_ib_init(void)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&rds_ib_devices);
+
+	ret = ib_register_client(&rds_ib_client);
+	if (ret)
+		goto out;
+
+	ret = rds_ib_sysctl_init();
+	if (ret)
+		goto out_ibreg;
+
+	ret = rds_ib_recv_init();
+	if (ret)
+		goto out_sysctl;
+
+	ret = rds_trans_register(&rds_ib_transport);
+	if (ret)
+		goto out_recv;
+
+	rds_info_register_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+
+	goto out;
+
+out_recv:
+	rds_ib_recv_exit();
+out_sysctl:
+	rds_ib_sysctl_exit();
+out_ibreg:
+	ib_unregister_client(&rds_ib_client);
+out:
+	return ret;
+}
+
+MODULE_LICENSE("GPL");
+
diff --git a/net/rds/ib.h b/net/rds/ib.h
new file mode 100644
index 0000000..8be563a
--- /dev/null
+++ b/net/rds/ib.h
@@ -0,0 +1,367 @@
+#ifndef _RDS_IB_H
+#define _RDS_IB_H
+
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include "rds.h"
+#include "rdma_transport.h"
+
+#define RDS_FMR_SIZE			256
+#define RDS_FMR_POOL_SIZE		4096
+
+#define RDS_IB_MAX_SGE			8
+#define RDS_IB_RECV_SGE 		2
+
+#define RDS_IB_DEFAULT_RECV_WR		1024
+#define RDS_IB_DEFAULT_SEND_WR		256
+
+#define RDS_IB_SUPPORTED_PROTOCOLS	0x00000003	/* minor versions supported */
+
+extern struct list_head rds_ib_devices;
+
+/*
+ * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to
+ * try and minimize the amount of memory tied up both the device and
+ * socket receive queues.
+ */
+/* page offset of the final full frag that fits in the page */
+#define RDS_PAGE_LAST_OFF (((PAGE_SIZE  / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE)
+struct rds_page_frag {
+	struct list_head	f_item;
+	struct page		*f_page;
+	unsigned long		f_offset;
+	dma_addr_t 		f_mapped;
+};
+
+struct rds_ib_incoming {
+	struct list_head	ii_frags;
+	struct rds_incoming	ii_inc;
+};
+
+struct rds_ib_connect_private {
+	/* Add new fields at the end, and don't permute existing fields. */
+	__be32			dp_saddr;
+	__be32			dp_daddr;
+	u8			dp_protocol_major;
+	u8			dp_protocol_minor;
+	__be16			dp_protocol_minor_mask; /* bitmask */
+	__be32			dp_reserved1;
+	__be64			dp_ack_seq;
+	__be32			dp_credit;		/* non-zero enables flow ctl */
+};
+
+struct rds_ib_send_work {
+	struct rds_message	*s_rm;
+	struct rds_rdma_op	*s_op;
+	struct ib_send_wr	s_wr;
+	struct ib_sge		s_sge[RDS_IB_MAX_SGE];
+	unsigned long		s_queued;
+};
+
+struct rds_ib_recv_work {
+	struct rds_ib_incoming 	*r_ibinc;
+	struct rds_page_frag	*r_frag;
+	struct ib_recv_wr	r_wr;
+	struct ib_sge		r_sge[2];
+};
+
+struct rds_ib_work_ring {
+	u32		w_nr;
+	u32		w_alloc_ptr;
+	u32		w_alloc_ctr;
+	u32		w_free_ptr;
+	atomic_t	w_free_ctr;
+};
+
+struct rds_ib_device;
+
+struct rds_ib_connection {
+
+	struct list_head	ib_node;
+	struct rds_ib_device	*rds_ibdev;
+	struct rds_connection	*conn;
+
+	/* alphabet soup, IBTA style */
+	struct rdma_cm_id	*i_cm_id;
+	struct ib_pd		*i_pd;
+	struct ib_mr		*i_mr;
+	struct ib_cq		*i_send_cq;
+	struct ib_cq		*i_recv_cq;
+
+	/* tx */
+	struct rds_ib_work_ring	i_send_ring;
+	struct rds_message	*i_rm;
+	struct rds_header	*i_send_hdrs;
+	u64			i_send_hdrs_dma;
+	struct rds_ib_send_work *i_sends;
+
+	/* rx */
+	struct mutex		i_recv_mutex;
+	struct rds_ib_work_ring	i_recv_ring;
+	struct rds_ib_incoming	*i_ibinc;
+	u32			i_recv_data_rem;
+	struct rds_header	*i_recv_hdrs;
+	u64			i_recv_hdrs_dma;
+	struct rds_ib_recv_work *i_recvs;
+	struct rds_page_frag	i_frag;
+	u64			i_ack_recv;	/* last ACK received */
+
+	/* sending acks */
+	unsigned long		i_ack_flags;
+	u64			i_ack_next;	/* next ACK to send */
+	struct rds_header	*i_ack;
+	struct ib_send_wr	i_ack_wr;
+	struct ib_sge		i_ack_sge;
+	u64			i_ack_dma;
+	unsigned long		i_ack_queued;
+
+	/* Flow control related information
+	 *
+	 * Our algorithm uses a pair variables that we need to access
+	 * atomically - one for the send credits, and one posted
+	 * recv credits we need to transfer to remote.
+	 * Rather than protect them using a slow spinlock, we put both into
+	 * a single atomic_t and update it using cmpxchg
+	 */
+	atomic_t		i_credits;
+
+	/* Protocol version specific information */
+	unsigned int		i_flowctl:1;	/* enable/disable flow ctl */
+
+	/* Batched completions */
+	unsigned int		i_unsignaled_wrs;
+	long			i_unsignaled_bytes;
+};
+
+/* This assumes that atomic_t is at least 32 bits */
+#define IB_GET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_GET_POST_CREDITS(v)	((v) >> 16)
+#define IB_SET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_SET_POST_CREDITS(v)	((v) << 16)
+
+struct rds_ib_ipaddr {
+	struct list_head	list;
+	__be32			ipaddr;
+};
+
+struct rds_ib_device {
+	struct list_head	list;
+	struct list_head	ipaddr_list;
+	struct list_head	conn_list;
+	struct ib_device	*dev;
+	struct ib_pd		*pd;
+	struct ib_mr		*mr;
+	struct rds_ib_mr_pool	*mr_pool;
+	int			fmr_page_shift;
+	int			fmr_page_size;
+	u64			fmr_page_mask;
+	unsigned int		fmr_max_remaps;
+	unsigned int		max_fmrs;
+	int			max_sge;
+	unsigned int		max_wrs;
+	spinlock_t		spinlock;	/* protect the above */
+};
+
+/* bits for i_ack_flags */
+#define IB_ACK_IN_FLIGHT	0
+#define IB_ACK_REQUESTED	1
+
+/* Magic WR_ID for ACKs */
+#define RDS_IB_ACK_WR_ID	(~(u64) 0)
+
+struct rds_ib_statistics {
+	uint64_t	s_ib_connect_raced;
+	uint64_t	s_ib_listen_closed_stale;
+	uint64_t	s_ib_tx_cq_call;
+	uint64_t	s_ib_tx_cq_event;
+	uint64_t	s_ib_tx_ring_full;
+	uint64_t	s_ib_tx_throttle;
+	uint64_t	s_ib_tx_sg_mapping_failure;
+	uint64_t	s_ib_tx_stalled;
+	uint64_t	s_ib_tx_credit_updates;
+	uint64_t	s_ib_rx_cq_call;
+	uint64_t	s_ib_rx_cq_event;
+	uint64_t	s_ib_rx_ring_empty;
+	uint64_t	s_ib_rx_refill_from_cq;
+	uint64_t	s_ib_rx_refill_from_thread;
+	uint64_t	s_ib_rx_alloc_limit;
+	uint64_t	s_ib_rx_credit_updates;
+	uint64_t	s_ib_ack_sent;
+	uint64_t	s_ib_ack_send_failure;
+	uint64_t	s_ib_ack_send_delayed;
+	uint64_t	s_ib_ack_send_piggybacked;
+	uint64_t	s_ib_ack_received;
+	uint64_t	s_ib_rdma_mr_alloc;
+	uint64_t	s_ib_rdma_mr_free;
+	uint64_t	s_ib_rdma_mr_used;
+	uint64_t	s_ib_rdma_mr_pool_flush;
+	uint64_t	s_ib_rdma_mr_pool_wait;
+	uint64_t	s_ib_rdma_mr_pool_depleted;
+};
+
+extern struct workqueue_struct *rds_ib_wq;
+
+/*
+ * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h
+ * doesn't define it.
+ */
+static inline void rds_ib_dma_sync_sg_for_cpu(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_cpu(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_cpu	rds_ib_dma_sync_sg_for_cpu
+
+static inline void rds_ib_dma_sync_sg_for_device(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_device(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_device	rds_ib_dma_sync_sg_for_device
+
+
+/* ib.c */
+extern struct rds_transport rds_ib_transport;
+extern void rds_ib_add_one(struct ib_device *device);
+extern void rds_ib_remove_one(struct ib_device *device);
+extern struct ib_client rds_ib_client;
+
+extern unsigned int fmr_pool_size;
+extern unsigned int fmr_message_size;
+
+extern spinlock_t ib_nodev_conns_lock;
+extern struct list_head ib_nodev_conns;
+
+/* ib_cm.c */
+int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp);
+void rds_ib_conn_free(void *arg);
+int rds_ib_conn_connect(struct rds_connection *conn);
+void rds_ib_conn_shutdown(struct rds_connection *conn);
+void rds_ib_state_change(struct sock *sk);
+int __init rds_ib_listen_init(void);
+void rds_ib_listen_stop(void);
+void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...);
+int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
+			     struct rdma_cm_event *event);
+int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id);
+void rds_ib_cm_connect_complete(struct rds_connection *conn,
+				struct rdma_cm_event *event);
+
+
+#define rds_ib_conn_error(conn, fmt...) \
+	__rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt)
+
+/* ib_rdma.c */
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr);
+int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn);
+void rds_ib_remove_nodev_conns(void);
+void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev);
+struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *);
+void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo);
+void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *);
+void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret);
+void rds_ib_sync_mr(void *trans_private, int dir);
+void rds_ib_free_mr(void *trans_private, int invalidate);
+void rds_ib_flush_mrs(void);
+
+/* ib_recv.c */
+int __init rds_ib_recv_init(void);
+void rds_ib_recv_exit(void);
+int rds_ib_recv(struct rds_connection *conn);
+int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill);
+void rds_ib_inc_purge(struct rds_incoming *inc);
+void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov,
+			     size_t size);
+void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
+void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
+void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
+void rds_ib_attempt_ack(struct rds_ib_connection *ic);
+void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
+u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
+
+/* ib_ring.c */
+void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr);
+void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr);
+u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos);
+void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val);
+void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val);
+int rds_ib_ring_empty(struct rds_ib_work_ring *ring);
+int rds_ib_ring_low(struct rds_ib_work_ring *ring);
+u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring);
+u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest);
+extern wait_queue_head_t rds_ib_ring_empty_wait;
+
+/* ib_send.c */
+void rds_ib_xmit_complete(struct rds_connection *conn);
+int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off);
+void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_send_init_ring(struct rds_ib_connection *ic);
+void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
+int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits);
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted);
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted,
+			     u32 *adv_credits, int need_posted);
+
+/* ib_stats.c */
+DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
+#define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member)
+unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail);
+
+/* ib_sysctl.c */
+int __init rds_ib_sysctl_init(void);
+void rds_ib_sysctl_exit(void);
+extern unsigned long rds_ib_sysctl_max_send_wr;
+extern unsigned long rds_ib_sysctl_max_recv_wr;
+extern unsigned long rds_ib_sysctl_max_unsig_wrs;
+extern unsigned long rds_ib_sysctl_max_unsig_bytes;
+extern unsigned long rds_ib_sysctl_max_recv_allocation;
+extern unsigned int rds_ib_sysctl_flow_control;
+extern ctl_table rds_ib_sysctl_table[];
+
+/*
+ * Helper functions for getting/setting the header and data SGEs in
+ * RDS packets (not RDMA)
+ */
+static inline struct ib_sge *
+rds_ib_header_sge(struct rds_ib_connection *ic, struct ib_sge *sge)
+{
+	return &sge[0];
+}
+
+static inline struct ib_sge *
+rds_ib_data_sge(struct rds_ib_connection *ic, struct ib_sge *sge)
+{
+	return &sge[1];
+}
+
+static inline void rds_ib_set_64bit(u64 *ptr, u64 val)
+{
+#if BITS_PER_LONG == 64
+	*ptr = val;
+#else
+	set_64bit(ptr, val);
+#endif
+}
+
+#endif
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
new file mode 100644
index 0000000..0532237
--- /dev/null
+++ b/net/rds/ib_cm.c
@@ -0,0 +1,726 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/vmalloc.h>
+
+#include "rds.h"
+#include "ib.h"
+
+/*
+ * Set the selected protocol version
+ */
+static void rds_ib_set_protocol(struct rds_connection *conn, unsigned int version)
+{
+	conn->c_version = version;
+}
+
+/*
+ * Set up flow control
+ */
+static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (rds_ib_sysctl_flow_control && credits != 0) {
+		/* We're doing flow control */
+		ic->i_flowctl = 1;
+		rds_ib_send_add_credits(conn, credits);
+	} else {
+		ic->i_flowctl = 0;
+	}
+}
+
+/*
+ * Tune RNR behavior. Without flow control, we use a rather
+ * low timeout, but not the absolute minimum - this should
+ * be tunable.
+ *
+ * We already set the RNR retry count to 7 (which is the
+ * smallest infinite number :-) above.
+ * If flow control is off, we want to change this back to 0
+ * so that we learn quickly when our credit accounting is
+ * buggy.
+ *
+ * Caller passes in a qp_attr pointer - don't waste stack spacv
+ * by allocation this twice.
+ */
+static void
+rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr)
+{
+	int ret;
+
+	attr->min_rnr_timer = IB_RNR_TIMER_000_32;
+	ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER);
+	if (ret)
+		printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret);
+}
+
+/*
+ * Connection established.
+ * We get here for both outgoing and incoming connection.
+ */
+void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event)
+{
+	const struct rds_ib_connect_private *dp = NULL;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_device *rds_ibdev;
+	struct ib_qp_attr qp_attr;
+	int err;
+
+	if (event->param.conn.private_data_len) {
+		dp = event->param.conn.private_data;
+
+		rds_ib_set_protocol(conn,
+				RDS_PROTOCOL(dp->dp_protocol_major,
+					dp->dp_protocol_minor));
+		rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+	}
+
+	printk(KERN_NOTICE "RDS/IB: connected to %pI4 version %u.%u%s\n",
+			&conn->c_laddr,
+			RDS_PROTOCOL_MAJOR(conn->c_version),
+			RDS_PROTOCOL_MINOR(conn->c_version),
+			ic->i_flowctl ? ", flow control" : "");
+
+	/* Tune RNR behavior */
+	rds_ib_tune_rnr(ic, &qp_attr);
+
+	qp_attr.qp_state = IB_QPS_RTS;
+	err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
+	if (err)
+		printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err);
+
+	/* update ib_device with this local ipaddr & conn */
+	rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+	err = rds_ib_update_ipaddr(rds_ibdev, conn->c_laddr);
+	if (err)
+		printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n", err);
+	err = rds_ib_add_conn(rds_ibdev, conn);
+	if (err)
+		printk(KERN_ERR "rds_ib_add_conn failed (%d)\n", err);
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp && dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	rds_connect_complete(conn);
+}
+
+static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
+			struct rdma_conn_param *conn_param,
+			struct rds_ib_connect_private *dp,
+			u32 protocol_version)
+{
+	memset(conn_param, 0, sizeof(struct rdma_conn_param));
+	/* XXX tune these? */
+	conn_param->responder_resources = 1;
+	conn_param->initiator_depth = 1;
+	conn_param->retry_count = 7;
+	conn_param->rnr_retry_count = 7;
+
+	if (dp) {
+		struct rds_ib_connection *ic = conn->c_transport_data;
+
+		memset(dp, 0, sizeof(*dp));
+		dp->dp_saddr = conn->c_laddr;
+		dp->dp_daddr = conn->c_faddr;
+		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
+		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
+		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
+		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+
+		/* Advertise flow control */
+		if (ic->i_flowctl) {
+			unsigned int credits;
+
+			credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits));
+			dp->dp_credit = cpu_to_be32(credits);
+			atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits);
+		}
+
+		conn_param->private_data = dp;
+		conn_param->private_data_len = sizeof(*dp);
+	}
+}
+
+static void rds_ib_cq_event_handler(struct ib_event *event, void *data)
+{
+	rdsdebug("event %u data %p\n", event->event, data);
+}
+
+static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
+{
+	struct rds_connection *conn = data;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event);
+
+	switch (event->event) {
+	case IB_EVENT_COMM_EST:
+		rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		printk(KERN_WARNING "RDS/ib: unhandled QP event %u "
+		       "on connection to %pI4\n", event->event,
+		       &conn->c_faddr);
+		break;
+	}
+}
+
+/*
+ * This needs to be very careful to not leave IS_ERR pointers around for
+ * cleanup to trip over.
+ */
+static int rds_ib_setup_qp(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct ib_qp_init_attr attr;
+	struct rds_ib_device *rds_ibdev;
+	int ret;
+
+	/* rds_ib_add_one creates a rds_ib_device object per IB device,
+	 * and allocates a protection domain, memory range and FMR pool
+	 * for each.  If that fails for any reason, it will not register
+	 * the rds_ibdev at all.
+	 */
+	rds_ibdev = ib_get_client_data(dev, &rds_ib_client);
+	if (rds_ibdev == NULL) {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "RDS/IB: No client_data for device %s\n",
+					dev->name);
+		return -EOPNOTSUPP;
+	}
+
+	if (rds_ibdev->max_wrs < ic->i_send_ring.w_nr + 1)
+		rds_ib_ring_resize(&ic->i_send_ring, rds_ibdev->max_wrs - 1);
+	if (rds_ibdev->max_wrs < ic->i_recv_ring.w_nr + 1)
+		rds_ib_ring_resize(&ic->i_recv_ring, rds_ibdev->max_wrs - 1);
+
+	/* Protection domain and memory range */
+	ic->i_pd = rds_ibdev->pd;
+	ic->i_mr = rds_ibdev->mr;
+
+	ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
+				     rds_ib_cq_event_handler, conn,
+				     ic->i_send_ring.w_nr + 1, 0);
+	if (IS_ERR(ic->i_send_cq)) {
+		ret = PTR_ERR(ic->i_send_cq);
+		ic->i_send_cq = NULL;
+		rdsdebug("ib_create_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ic->i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler,
+				     rds_ib_cq_event_handler, conn,
+				     ic->i_recv_ring.w_nr, 0);
+	if (IS_ERR(ic->i_recv_cq)) {
+		ret = PTR_ERR(ic->i_recv_cq);
+		ic->i_recv_cq = NULL;
+		rdsdebug("ib_create_cq recv failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
+		goto out;
+	}
+
+	/* XXX negotiate max send/recv with remote? */
+	memset(&attr, 0, sizeof(attr));
+	attr.event_handler = rds_ib_qp_event_handler;
+	attr.qp_context = conn;
+	/* + 1 to allow for the single ack message */
+	attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+	attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
+	attr.cap.max_send_sge = rds_ibdev->max_sge;
+	attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
+	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	attr.qp_type = IB_QPT_RC;
+	attr.send_cq = ic->i_send_cq;
+	attr.recv_cq = ic->i_recv_cq;
+
+	/*
+	 * XXX this can fail if max_*_wr is too large?  Are we supposed
+	 * to back off until we get a value that the hardware can support?
+	 */
+	ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
+	if (ret) {
+		rdsdebug("rdma_create_qp failed: %d\n", ret);
+		goto out;
+	}
+
+	ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_send_hdrs_dma, GFP_KERNEL);
+	if (ic->i_send_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent send failed\n");
+		goto out;
+	}
+
+	ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_recv_hdrs_dma, GFP_KERNEL);
+	if (ic->i_recv_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent recv failed\n");
+		goto out;
+	}
+
+	ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
+				       &ic->i_ack_dma, GFP_KERNEL);
+	if (ic->i_ack == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent ack failed\n");
+		goto out;
+	}
+
+	ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work));
+	if (ic->i_sends == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("send allocation failed\n");
+		goto out;
+	}
+	rds_ib_send_init_ring(ic);
+
+	ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work));
+	if (ic->i_recvs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("recv allocation failed\n");
+		goto out;
+	}
+
+	rds_ib_recv_init_ring(ic);
+	rds_ib_recv_init_ack(ic);
+
+	/* Post receive buffers - as a side effect, this will update
+	 * the posted credit count. */
+	rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
+
+	rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr,
+		 ic->i_send_cq, ic->i_recv_cq);
+
+out:
+	return ret;
+}
+
+static u32 rds_ib_protocol_compatible(const struct rds_ib_connect_private *dp)
+{
+	u16 common;
+	u32 version = 0;
+
+	/* rdma_cm private data is odd - when there is any private data in the
+	 * request, we will be given a pretty large buffer without telling us the
+	 * original size. The only way to tell the difference is by looking at
+	 * the contents, which are initialized to zero.
+	 * If the protocol version fields aren't set, this is a connection attempt
+	 * from an older version. This could could be 3.0 or 2.0 - we can't tell.
+	 * We really should have changed this for OFED 1.3 :-( */
+	if (dp->dp_protocol_major == 0)
+		return RDS_PROTOCOL_3_0;
+
+	common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IB_SUPPORTED_PROTOCOLS;
+	if (dp->dp_protocol_major == 3 && common) {
+		version = RDS_PROTOCOL_3_0;
+		while ((common >>= 1) != 0)
+			version++;
+	} else if (printk_ratelimit()) {
+		printk(KERN_NOTICE "RDS: Connection from %pI4 using "
+			"incompatible protocol version %u.%u\n",
+			&dp->dp_saddr,
+			dp->dp_protocol_major,
+			dp->dp_protocol_minor);
+	}
+	return version;
+}
+
+int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
+				    struct rdma_cm_event *event)
+{
+	__be64 lguid = cm_id->route.path_rec->sgid.global.interface_id;
+	__be64 fguid = cm_id->route.path_rec->dgid.global.interface_id;
+	const struct rds_ib_connect_private *dp = event->param.conn.private_data;
+	struct rds_ib_connect_private dp_rep;
+	struct rds_connection *conn = NULL;
+	struct rds_ib_connection *ic = NULL;
+	struct rdma_conn_param conn_param;
+	u32 version;
+	int err, destroy = 1;
+
+	/* Check whether the remote protocol version matches ours. */
+	version = rds_ib_protocol_compatible(dp);
+	if (!version)
+		goto out;
+
+	rdsdebug("saddr %pI4 daddr %pI4 RDSv%u.%u lguid 0x%llx fguid "
+		 "0x%llx\n", &dp->dp_saddr, &dp->dp_daddr,
+		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version),
+		 (unsigned long long)be64_to_cpu(lguid),
+		 (unsigned long long)be64_to_cpu(fguid));
+
+	conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_ib_transport,
+			       GFP_KERNEL);
+	if (IS_ERR(conn)) {
+		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
+		conn = NULL;
+		goto out;
+	}
+
+	/*
+	 * The connection request may occur while the
+	 * previous connection exist, e.g. in case of failover.
+	 * But as connections may be initiated simultaneously
+	 * by both hosts, we have a random backoff mechanism -
+	 * see the comment above rds_queue_reconnect()
+	 */
+	mutex_lock(&conn->c_cm_lock);
+	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
+		if (rds_conn_state(conn) == RDS_CONN_UP) {
+			rdsdebug("incoming connect while connecting\n");
+			rds_conn_drop(conn);
+			rds_ib_stats_inc(s_ib_listen_closed_stale);
+		} else
+		if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
+			/* Wait and see - our connect may still be succeeding */
+			rds_ib_stats_inc(s_ib_connect_raced);
+		}
+		mutex_unlock(&conn->c_cm_lock);
+		goto out;
+	}
+
+	ic = conn->c_transport_data;
+
+	rds_ib_set_protocol(conn, version);
+	rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	BUG_ON(cm_id->context);
+	BUG_ON(ic->i_cm_id);
+
+	ic->i_cm_id = cm_id;
+	cm_id->context = conn;
+
+	/* We got halfway through setting up the ib_connection, if we
+	 * fail now, we have to take the long route out of this mess. */
+	destroy = 0;
+
+	err = rds_ib_setup_qp(conn);
+	if (err) {
+		rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", err);
+		goto out;
+	}
+
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp_rep, version);
+
+	/* rdma_accept() calls rdma_reject() internally if it fails */
+	err = rdma_accept(cm_id, &conn_param);
+	mutex_unlock(&conn->c_cm_lock);
+	if (err) {
+		rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);
+		goto out;
+	}
+
+	return 0;
+
+out:
+	rdma_reject(cm_id, NULL, 0);
+	return destroy;
+}
+
+
+int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
+{
+	struct rds_connection *conn = cm_id->context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rdma_conn_param conn_param;
+	struct rds_ib_connect_private dp;
+	int ret;
+
+	/* If the peer doesn't do protocol negotiation, we must
+	 * default to RDSv3.0 */
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
+	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
+
+	ret = rds_ib_setup_qp(conn);
+	if (ret) {
+		rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", ret);
+		goto out;
+	}
+
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION);
+
+	ret = rdma_connect(cm_id, &conn_param);
+	if (ret)
+		rds_ib_conn_error(conn, "rdma_connect failed (%d)\n", ret);
+
+out:
+	/* Beware - returning non-zero tells the rdma_cm to destroy
+	 * the cm_id. We should certainly not do it as long as we still
+	 * "own" the cm_id. */
+	if (ret) {
+		if (ic->i_cm_id == cm_id)
+			ret = 0;
+	}
+	return ret;
+}
+
+int rds_ib_conn_connect(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct sockaddr_in src, dest;
+	int ret;
+
+	/* XXX I wonder what affect the port space has */
+	/* delegate cm event handler to rdma_transport */
+	ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn,
+				     RDMA_PS_TCP);
+	if (IS_ERR(ic->i_cm_id)) {
+		ret = PTR_ERR(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+		rdsdebug("rdma_create_id() failed: %d\n", ret);
+		goto out;
+	}
+
+	rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn);
+
+	src.sin_family = AF_INET;
+	src.sin_addr.s_addr = (__force u32)conn->c_laddr;
+	src.sin_port = (__force u16)htons(0);
+
+	dest.sin_family = AF_INET;
+	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
+	dest.sin_port = (__force u16)htons(RDS_PORT);
+
+	ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src,
+				(struct sockaddr *)&dest,
+				RDS_RDMA_RESOLVE_TIMEOUT_MS);
+	if (ret) {
+		rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id,
+			 ret);
+		rdma_destroy_id(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * This is so careful about only cleaning up resources that were built up
+ * so that it can be called at any point during startup.  In fact it
+ * can be called multiple times for a given connection.
+ */
+void rds_ib_conn_shutdown(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	int err = 0;
+
+	rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id,
+		 ic->i_pd, ic->i_send_cq, ic->i_recv_cq,
+		 ic->i_cm_id ? ic->i_cm_id->qp : NULL);
+
+	if (ic->i_cm_id) {
+		struct ib_device *dev = ic->i_cm_id->device;
+
+		rdsdebug("disconnecting cm %p\n", ic->i_cm_id);
+		err = rdma_disconnect(ic->i_cm_id);
+		if (err) {
+			/* Actually this may happen quite frequently, when
+			 * an outgoing connect raced with an incoming connect.
+			 */
+			rdsdebug("failed to disconnect, cm: %p err %d\n",
+				ic->i_cm_id, err);
+		}
+
+		wait_event(rds_ib_ring_empty_wait,
+			rds_ib_ring_empty(&ic->i_send_ring) &&
+			rds_ib_ring_empty(&ic->i_recv_ring));
+
+		if (ic->i_send_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_send_hdrs,
+					   ic->i_send_hdrs_dma);
+
+		if (ic->i_recv_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_recv_hdrs,
+					   ic->i_recv_hdrs_dma);
+
+		if (ic->i_ack)
+			ib_dma_free_coherent(dev, sizeof(struct rds_header),
+					     ic->i_ack, ic->i_ack_dma);
+
+		if (ic->i_sends)
+			rds_ib_send_clear_ring(ic);
+		if (ic->i_recvs)
+			rds_ib_recv_clear_ring(ic);
+
+		if (ic->i_cm_id->qp)
+			rdma_destroy_qp(ic->i_cm_id);
+		if (ic->i_send_cq)
+			ib_destroy_cq(ic->i_send_cq);
+		if (ic->i_recv_cq)
+			ib_destroy_cq(ic->i_recv_cq);
+		rdma_destroy_id(ic->i_cm_id);
+
+		/*
+		 * Move connection back to the nodev list.
+		 */
+		if (ic->rds_ibdev) {
+
+			spin_lock_irq(&ic->rds_ibdev->spinlock);
+			BUG_ON(list_empty(&ic->ib_node));
+			list_del(&ic->ib_node);
+			spin_unlock_irq(&ic->rds_ibdev->spinlock);
+
+			spin_lock_irq(&ib_nodev_conns_lock);
+			list_add_tail(&ic->ib_node, &ib_nodev_conns);
+			spin_unlock_irq(&ib_nodev_conns_lock);
+			ic->rds_ibdev = NULL;
+		}
+
+		ic->i_cm_id = NULL;
+		ic->i_pd = NULL;
+		ic->i_mr = NULL;
+		ic->i_send_cq = NULL;
+		ic->i_recv_cq = NULL;
+		ic->i_send_hdrs = NULL;
+		ic->i_recv_hdrs = NULL;
+		ic->i_ack = NULL;
+	}
+	BUG_ON(ic->rds_ibdev);
+
+	/* Clear pending transmit */
+	if (ic->i_rm) {
+		rds_message_put(ic->i_rm);
+		ic->i_rm = NULL;
+	}
+
+	/* Clear the ACK state */
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+	rds_ib_set_64bit(&ic->i_ack_next, 0);
+	ic->i_ack_recv = 0;
+
+	/* Clear flow control state */
+	ic->i_flowctl = 0;
+	atomic_set(&ic->i_credits, 0);
+
+	rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr);
+	rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr);
+
+	if (ic->i_ibinc) {
+		rds_inc_put(&ic->i_ibinc->ii_inc);
+		ic->i_ibinc = NULL;
+	}
+
+	vfree(ic->i_sends);
+	ic->i_sends = NULL;
+	vfree(ic->i_recvs);
+	ic->i_recvs = NULL;
+}
+
+int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
+{
+	struct rds_ib_connection *ic;
+	unsigned long flags;
+
+	/* XXX too lazy? */
+	ic = kzalloc(sizeof(struct rds_ib_connection), GFP_KERNEL);
+	if (ic == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&ic->ib_node);
+	mutex_init(&ic->i_recv_mutex);
+
+	/*
+	 * rds_ib_conn_shutdown() waits for these to be emptied so they
+	 * must be initialized before it can be called.
+	 */
+	rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr);
+	rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr);
+
+	ic->conn = conn;
+	conn->c_transport_data = ic;
+
+	spin_lock_irqsave(&ib_nodev_conns_lock, flags);
+	list_add_tail(&ic->ib_node, &ib_nodev_conns);
+	spin_unlock_irqrestore(&ib_nodev_conns_lock, flags);
+
+
+	rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data);
+	return 0;
+}
+
+void rds_ib_conn_free(void *arg)
+{
+	struct rds_ib_connection *ic = arg;
+	rdsdebug("ic %p\n", ic);
+	list_del(&ic->ib_node);
+	kfree(ic);
+}
+
+
+/*
+ * An error occurred on the connection
+ */
+void
+__rds_ib_conn_error(struct rds_connection *conn, const char *fmt, ...)
+{
+	va_list ap;
+
+	rds_conn_drop(conn);
+
+	va_start(ap, fmt);
+	vprintk(fmt, ap);
+	va_end(ap);
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:31 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:31 -0800
Subject: [ofa-general] [PATCH 14/26] RDS/IB: Ring-handling code.
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-15-git-send-email-andy.grover@oracle.com>

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib_ring.c |  168 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 168 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib_ring.c

diff --git a/net/rds/ib_ring.c b/net/rds/ib_ring.c
new file mode 100644
index 0000000..99a6cca
--- /dev/null
+++ b/net/rds/ib_ring.c
@@ -0,0 +1,168 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "ib.h"
+
+/*
+ * Locking for IB rings.
+ * We assume that allocation is always protected by a mutex
+ * in the caller (this is a valid assumption for the current
+ * implementation).
+ *
+ * Freeing always happens in an interrupt, and hence only
+ * races with allocations, but not with other free()s.
+ *
+ * The interaction between allocation and freeing is that
+ * the alloc code has to determine the number of free entries.
+ * To this end, we maintain two counters; an allocation counter
+ * and a free counter. Both are allowed to run freely, and wrap
+ * around.
+ * The number of used entries is always (alloc_ctr - free_ctr) % NR.
+ *
+ * The current implementation makes free_ctr atomic. When the
+ * caller finds an allocation fails, it should set an "alloc fail"
+ * bit and retry the allocation. The "alloc fail" bit essentially tells
+ * the CQ completion handlers to wake it up after freeing some
+ * more entries.
+ */
+
+/*
+ * This only happens on shutdown.
+ */
+DECLARE_WAIT_QUEUE_HEAD(rds_ib_ring_empty_wait);
+
+void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr)
+{
+	memset(ring, 0, sizeof(*ring));
+	ring->w_nr = nr;
+	rdsdebug("ring %p nr %u\n", ring, ring->w_nr);
+}
+
+static inline u32 __rds_ib_ring_used(struct rds_ib_work_ring *ring)
+{
+	u32 diff;
+
+	/* This assumes that atomic_t has at least as many bits as u32 */
+	diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr);
+	BUG_ON(diff > ring->w_nr);
+
+	return diff;
+}
+
+void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr)
+{
+	/* We only ever get called from the connection setup code,
+	 * prior to creating the QP. */
+	BUG_ON(__rds_ib_ring_used(ring));
+	ring->w_nr = nr;
+}
+
+static int __rds_ib_ring_empty(struct rds_ib_work_ring *ring)
+{
+	return __rds_ib_ring_used(ring) == 0;
+}
+
+u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos)
+{
+	u32 ret = 0, avail;
+
+	avail = ring->w_nr - __rds_ib_ring_used(ring);
+
+	rdsdebug("ring %p val %u next %u free %u\n", ring, val,
+		 ring->w_alloc_ptr, avail);
+
+	if (val && avail) {
+		ret = min(val, avail);
+		*pos = ring->w_alloc_ptr;
+
+		ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr;
+		ring->w_alloc_ctr += ret;
+	}
+
+	return ret;
+}
+
+void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val)
+{
+	ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr;
+	atomic_add(val, &ring->w_free_ctr);
+
+	if (__rds_ib_ring_empty(ring) &&
+	    waitqueue_active(&rds_ib_ring_empty_wait))
+		wake_up(&rds_ib_ring_empty_wait);
+}
+
+void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val)
+{
+	ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr;
+	ring->w_alloc_ctr -= val;
+}
+
+int rds_ib_ring_empty(struct rds_ib_work_ring *ring)
+{
+	return __rds_ib_ring_empty(ring);
+}
+
+int rds_ib_ring_low(struct rds_ib_work_ring *ring)
+{
+	return __rds_ib_ring_used(ring) <= (ring->w_nr >> 2);
+}
+
+/*
+ * returns the oldest alloced ring entry.  This will be the next one
+ * freed.  This can't be called if there are none allocated.
+ */
+u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring)
+{
+	return ring->w_free_ptr;
+}
+
+/*
+ * returns the number of completed work requests.
+ */
+
+u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest)
+{
+	u32 ret;
+
+	if (oldest <= (unsigned long long)wr_id)
+		ret = (unsigned long long)wr_id - oldest + 1;
+	else
+		ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1;
+
+	rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret,
+		 wr_id, oldest);
+	return ret;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:29 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:29 -0800
Subject: [ofa-general] [PATCH 12/26] RDS: RDMA support
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-13-git-send-email-andy.grover@oracle.com>

Some transports may support RDMA features. This handles the
non-transport-specific parts, like pinning user pages and
tracking mapped regions.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/rdma.c |  679 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/rdma.h |   84 +++++++
 2 files changed, 763 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/rdma.c
 create mode 100644 net/rds/rdma.h

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
new file mode 100644
index 0000000..eaeeb91
--- /dev/null
+++ b/net/rds/rdma.c
@@ -0,0 +1,679 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/pagemap.h>
+#include <linux/rbtree.h>
+#include <linux/dma-mapping.h> /* for DMA_*_DEVICE */
+
+#include "rdma.h"
+
+/*
+ * XXX
+ *  - build with sparse
+ *  - should we limit the size of a mr region?  let transport return failure?
+ *  - should we detect duplicate keys on a socket?  hmm.
+ *  - an rdma is an mlock, apply rlimit?
+ */
+
+/*
+ * get the number of pages by looking at the page indices that the start and
+ * end addresses fall in.
+ *
+ * Returns 0 if the vec is invalid.  It is invalid if the number of bytes
+ * causes the address to wrap or overflows an unsigned int.  This comes
+ * from being stored in the 'length' member of 'struct scatterlist'.
+ */
+static unsigned int rds_pages_in_vec(struct rds_iovec *vec)
+{
+	if ((vec->addr + vec->bytes <= vec->addr) ||
+	    (vec->bytes > (u64)UINT_MAX))
+		return 0;
+
+	return ((vec->addr + vec->bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) -
+		(vec->addr >> PAGE_SHIFT);
+}
+
+static struct rds_mr *rds_mr_tree_walk(struct rb_root *root, u64 key,
+				       struct rds_mr *insert)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_mr *mr;
+
+	while (*p) {
+		parent = *p;
+		mr = rb_entry(parent, struct rds_mr, r_rb_node);
+
+		if (key < mr->r_key)
+			p = &(*p)->rb_left;
+		else if (key > mr->r_key)
+			p = &(*p)->rb_right;
+		else
+			return mr;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->r_rb_node, parent, p);
+		rb_insert_color(&insert->r_rb_node, root);
+		atomic_inc(&insert->r_refcount);
+	}
+	return NULL;
+}
+
+/*
+ * Destroy the transport-specific part of a MR.
+ */
+static void rds_destroy_mr(struct rds_mr *mr)
+{
+	struct rds_sock *rs = mr->r_sock;
+	void *trans_private = NULL;
+	unsigned long flags;
+
+	rdsdebug("RDS: destroy mr key is %x refcnt %u\n",
+			mr->r_key, atomic_read(&mr->r_refcount));
+
+	if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state))
+		return;
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	if (!RB_EMPTY_NODE(&mr->r_rb_node))
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+	trans_private = mr->r_trans_private;
+	mr->r_trans_private = NULL;
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (trans_private)
+		mr->r_trans->free_mr(trans_private, mr->r_invalidate);
+}
+
+void __rds_put_mr_final(struct rds_mr *mr)
+{
+	rds_destroy_mr(mr);
+	kfree(mr);
+}
+
+/*
+ * By the time this is called we can't have any more ioctls called on
+ * the socket so we don't need to worry about racing with others.
+ */
+void rds_rdma_drop_keys(struct rds_sock *rs)
+{
+	struct rds_mr *mr;
+	struct rb_node *node;
+
+	/* Release any MRs associated with this socket */
+	while ((node = rb_first(&rs->rs_rdma_keys))) {
+		mr = container_of(node, struct rds_mr, r_rb_node);
+		if (mr->r_trans == rs->rs_transport)
+			mr->r_invalidate = 0;
+		rds_mr_put(mr);
+	}
+
+	if (rs->rs_transport && rs->rs_transport->flush_mrs)
+		rs->rs_transport->flush_mrs();
+}
+
+/*
+ * Helper function to pin user pages.
+ */
+static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages,
+			struct page **pages, int write)
+{
+	int ret;
+
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, user_addr,
+			     nr_pages, write, 0, pages, NULL);
+	up_read(&current->mm->mmap_sem);
+
+	if (0 <= ret && (unsigned) ret < nr_pages) {
+		while (ret--)
+			put_page(pages[ret]);
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args,
+				u64 *cookie_ret, struct rds_mr **mr_ret)
+{
+	struct rds_mr *mr = NULL, *found;
+	unsigned int nr_pages;
+	struct page **pages = NULL;
+	struct scatterlist *sg;
+	void *trans_private;
+	unsigned long flags;
+	rds_rdma_cookie_t cookie;
+	unsigned int nents;
+	long i;
+	int ret;
+
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (rs->rs_transport->get_mr == NULL) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	nr_pages = rds_pages_in_vec(&args->vec);
+	if (nr_pages == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n",
+		args->vec.addr, args->vec.bytes, nr_pages);
+
+	/* XXX clamp nr_pages to limit the size of this alloc? */
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mr = kzalloc(sizeof(struct rds_mr), GFP_KERNEL);
+	if (mr == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	atomic_set(&mr->r_refcount, 1);
+	RB_CLEAR_NODE(&mr->r_rb_node);
+	mr->r_trans = rs->rs_transport;
+	mr->r_sock = rs;
+
+	if (args->flags & RDS_RDMA_USE_ONCE)
+		mr->r_use_once = 1;
+	if (args->flags & RDS_RDMA_INVALIDATE)
+		mr->r_invalidate = 1;
+	if (args->flags & RDS_RDMA_READWRITE)
+		mr->r_write = 1;
+
+	/*
+	 * Pin the pages that make up the user buffer and transfer the page
+	 * pointers to the mr's sg array.  We check to see if we've mapped
+	 * the whole region after transferring the partial page references
+	 * to the sg array so that we can have one page ref cleanup path.
+	 *
+	 * For now we have no flag that tells us whether the mapping is
+	 * r/o or r/w. We need to assume r/w, or we'll do a lot of RDMA to
+	 * the zero page.
+	 */
+	ret = rds_pin_pages(args->vec.addr & PAGE_MASK, nr_pages, pages, 1);
+	if (ret < 0)
+		goto out;
+
+	nents = ret;
+	sg = kcalloc(nents, sizeof(*sg), GFP_KERNEL);
+	if (sg == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	WARN_ON(!nents);
+	sg_init_table(sg, nents);
+
+	/* Stick all pages into the scatterlist */
+	for (i = 0 ; i < nents; i++)
+		sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
+
+	rdsdebug("RDS: trans_private nents is %u\n", nents);
+
+	/* Obtain a transport specific MR. If this succeeds, the
+	 * s/g list is now owned by the MR.
+	 * Note that dma_map() implies that pending writes are
+	 * flushed to RAM, so no dma_sync is needed here. */
+	trans_private = rs->rs_transport->get_mr(sg, nents, rs,
+						 &mr->r_key);
+
+	if (IS_ERR(trans_private)) {
+		for (i = 0 ; i < nents; i++)
+			put_page(sg_page(&sg[i]));
+		kfree(sg);
+		ret = PTR_ERR(trans_private);
+		goto out;
+	}
+
+	mr->r_trans_private = trans_private;
+
+	rdsdebug("RDS: get_mr put_user key is %x cookie_addr %p\n",
+	       mr->r_key, (void *)(unsigned long) args->cookie_addr);
+
+	/* The user may pass us an unaligned address, but we can only
+	 * map page aligned regions. So we keep the offset, and build
+	 * a 64bit cookie containing <R_Key, offset> and pass that
+	 * around. */
+	cookie = rds_rdma_make_cookie(mr->r_key, args->vec.addr & ~PAGE_MASK);
+	if (cookie_ret)
+		*cookie_ret = cookie;
+
+	if (args->cookie_addr && put_user(cookie, (u64 __user *)(unsigned long) args->cookie_addr)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* Inserting the new MR into the rbtree bumps its
+	 * reference count. */
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	found = rds_mr_tree_walk(&rs->rs_rdma_keys, mr->r_key, mr);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	BUG_ON(found && found != mr);
+
+	rdsdebug("RDS: get_mr key is %x\n", mr->r_key);
+	if (mr_ret) {
+		atomic_inc(&mr->r_refcount);
+		*mr_ret = mr;
+	}
+
+	ret = 0;
+out:
+	kfree(pages);
+	if (mr)
+		rds_mr_put(mr);
+	return ret;
+}
+
+int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen)
+{
+	struct rds_get_mr_args args;
+
+	if (optlen != sizeof(struct rds_get_mr_args))
+		return -EINVAL;
+
+	if (copy_from_user(&args, (struct rds_get_mr_args __user *)optval,
+			   sizeof(struct rds_get_mr_args)))
+		return -EFAULT;
+
+	return __rds_rdma_map(rs, &args, NULL, NULL);
+}
+
+/*
+ * Free the MR indicated by the given R_Key
+ */
+int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen)
+{
+	struct rds_free_mr_args args;
+	struct rds_mr *mr;
+	unsigned long flags;
+
+	if (optlen != sizeof(struct rds_free_mr_args))
+		return -EINVAL;
+
+	if (copy_from_user(&args, (struct rds_free_mr_args __user *)optval,
+			   sizeof(struct rds_free_mr_args)))
+		return -EFAULT;
+
+	/* Special case - a null cookie means flush all unused MRs */
+	if (args.cookie == 0) {
+		if (!rs->rs_transport || !rs->rs_transport->flush_mrs)
+			return -EINVAL;
+		rs->rs_transport->flush_mrs();
+		return 0;
+	}
+
+	/* Look up the MR given its R_key and remove it from the rbtree
+	 * so nobody else finds it.
+	 * This should also prevent races with rds_rdma_unuse.
+	 */
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, rds_rdma_cookie_key(args.cookie), NULL);
+	if (mr) {
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+		RB_CLEAR_NODE(&mr->r_rb_node);
+		if (args.flags & RDS_RDMA_INVALIDATE)
+			mr->r_invalidate = 1;
+	}
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (!mr)
+		return -EINVAL;
+
+	/*
+	 * call rds_destroy_mr() ourselves so that we're sure it's done by the time
+	 * we return.  If we let rds_mr_put() do it it might not happen until
+	 * someone else drops their ref.
+	 */
+	rds_destroy_mr(mr);
+	rds_mr_put(mr);
+	return 0;
+}
+
+/*
+ * This is called when we receive an extension header that
+ * tells us this MR was used. It allows us to implement
+ * use_once semantics
+ */
+void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force)
+{
+	struct rds_mr *mr;
+	unsigned long flags;
+	int zot_me = 0;
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL);
+	if (mr && (mr->r_use_once || force)) {
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+		RB_CLEAR_NODE(&mr->r_rb_node);
+		zot_me = 1;
+	} else if (mr)
+		atomic_inc(&mr->r_refcount);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	/* May have to issue a dma_sync on this memory region.
+	 * Note we could avoid this if the operation was a RDMA READ,
+	 * but at this point we can't tell. */
+	if (mr != NULL) {
+		if (mr->r_trans->sync_mr)
+			mr->r_trans->sync_mr(mr->r_trans_private, DMA_FROM_DEVICE);
+
+		/* If the MR was marked as invalidate, this will
+		 * trigger an async flush. */
+		if (zot_me)
+			rds_destroy_mr(mr);
+		rds_mr_put(mr);
+	}
+}
+
+void rds_rdma_free_op(struct rds_rdma_op *ro)
+{
+	unsigned int i;
+
+	for (i = 0; i < ro->r_nents; i++) {
+		struct page *page = sg_page(&ro->r_sg[i]);
+
+		/* Mark page dirty if it was possibly modified, which
+		 * is the case for a RDMA_READ which copies from remote
+		 * to local memory */
+		if (!ro->r_write)
+			set_page_dirty(page);
+		put_page(page);
+	}
+
+	kfree(ro->r_notifier);
+	kfree(ro);
+}
+
+/*
+ * args is a pointer to an in-kernel copy in the sendmsg cmsg.
+ */
+static struct rds_rdma_op *rds_rdma_prepare(struct rds_sock *rs,
+					    struct rds_rdma_args *args)
+{
+	struct rds_iovec vec;
+	struct rds_rdma_op *op = NULL;
+	unsigned int nr_pages;
+	unsigned int max_pages;
+	unsigned int nr_bytes;
+	struct page **pages = NULL;
+	struct rds_iovec __user *local_vec;
+	struct scatterlist *sg;
+	unsigned int nr;
+	unsigned int i, j;
+	int ret;
+
+
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (args->nr_local > (u64)UINT_MAX) {
+		ret = -EMSGSIZE;
+		goto out;
+	}
+
+	nr_pages = 0;
+	max_pages = 0;
+
+	local_vec = (struct rds_iovec __user *)(unsigned long) args->local_vec_addr;
+
+	/* figure out the number of pages in the vector */
+	for (i = 0; i < args->nr_local; i++) {
+		if (copy_from_user(&vec, &local_vec[i],
+				   sizeof(struct rds_iovec))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		nr = rds_pages_in_vec(&vec);
+		if (nr == 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		max_pages = max(nr, max_pages);
+		nr_pages += nr;
+	}
+
+	pages = kcalloc(max_pages, sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	op = kzalloc(offsetof(struct rds_rdma_op, r_sg[nr_pages]), GFP_KERNEL);
+	if (op == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	op->r_write = !!(args->flags & RDS_RDMA_READWRITE);
+	op->r_fence = !!(args->flags & RDS_RDMA_FENCE);
+	op->r_notify = !!(args->flags & RDS_RDMA_NOTIFY_ME);
+	op->r_recverr = rs->rs_recverr;
+	WARN_ON(!nr_pages);
+	sg_init_table(op->r_sg, nr_pages);
+
+	if (op->r_notify || op->r_recverr) {
+		/* We allocate an uninitialized notifier here, because
+		 * we don't want to do that in the completion handler. We
+		 * would have to use GFP_ATOMIC there, and don't want to deal
+		 * with failed allocations.
+		 */
+		op->r_notifier = kmalloc(sizeof(struct rds_notifier), GFP_KERNEL);
+		if (!op->r_notifier) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		op->r_notifier->n_user_token = args->user_token;
+		op->r_notifier->n_status = RDS_RDMA_SUCCESS;
+	}
+
+	/* The cookie contains the R_Key of the remote memory region, and
+	 * optionally an offset into it. This is how we implement RDMA into
+	 * unaligned memory.
+	 * When setting up the RDMA, we need to add that offset to the
+	 * destination address (which is really an offset into the MR)
+	 * FIXME: We may want to move this into ib_rdma.c
+	 */
+	op->r_key = rds_rdma_cookie_key(args->cookie);
+	op->r_remote_addr = args->remote_vec.addr + rds_rdma_cookie_offset(args->cookie);
+
+	nr_bytes = 0;
+
+	rdsdebug("RDS: rdma prepare nr_local %llu rva %llx rkey %x\n",
+	       (unsigned long long)args->nr_local,
+	       (unsigned long long)args->remote_vec.addr,
+	       op->r_key);
+
+	for (i = 0; i < args->nr_local; i++) {
+		if (copy_from_user(&vec, &local_vec[i],
+				   sizeof(struct rds_iovec))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		nr = rds_pages_in_vec(&vec);
+		if (nr == 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		rs->rs_user_addr = vec.addr;
+		rs->rs_user_bytes = vec.bytes;
+
+		/* did the user change the vec under us? */
+		if (nr > max_pages || op->r_nents + nr > nr_pages) {
+			ret = -EINVAL;
+			goto out;
+		}
+		/* If it's a WRITE operation, we want to pin the pages for reading.
+		 * If it's a READ operation, we need to pin the pages for writing.
+		 */
+		ret = rds_pin_pages(vec.addr & PAGE_MASK, nr, pages, !op->r_write);
+		if (ret < 0)
+			goto out;
+
+		rdsdebug("RDS: nr_bytes %u nr %u vec.bytes %llu vec.addr %llx\n",
+		       nr_bytes, nr, vec.bytes, vec.addr);
+
+		nr_bytes += vec.bytes;
+
+		for (j = 0; j < nr; j++) {
+			unsigned int offset = vec.addr & ~PAGE_MASK;
+
+			sg = &op->r_sg[op->r_nents + j];
+			sg_set_page(sg, pages[j],
+					min_t(unsigned int, vec.bytes, PAGE_SIZE - offset),
+					offset);
+
+			rdsdebug("RDS: sg->offset %x sg->len %x vec.addr %llx vec.bytes %llu\n",
+			       sg->offset, sg->length, vec.addr, vec.bytes);
+
+			vec.addr += sg->length;
+			vec.bytes -= sg->length;
+		}
+
+		op->r_nents += nr;
+	}
+
+
+	if (nr_bytes > args->remote_vec.bytes) {
+		rdsdebug("RDS nr_bytes %u remote_bytes %u do not match\n",
+				nr_bytes,
+				(unsigned int) args->remote_vec.bytes);
+		ret = -EINVAL;
+		goto out;
+	}
+	op->r_bytes = nr_bytes;
+
+	ret = 0;
+out:
+	kfree(pages);
+	if (ret) {
+		if (op)
+			rds_rdma_free_op(op);
+		op = ERR_PTR(ret);
+	}
+	return op;
+}
+
+/*
+ * The application asks for a RDMA transfer.
+ * Extract all arguments and set up the rdma_op
+ */
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	struct rds_rdma_op *op;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_rdma_args))
+	 || rm->m_rdma_op != NULL)
+		return -EINVAL;
+
+	op = rds_rdma_prepare(rs, CMSG_DATA(cmsg));
+	if (IS_ERR(op))
+		return PTR_ERR(op);
+	rds_stats_inc(s_send_rdma);
+	rm->m_rdma_op = op;
+	return 0;
+}
+
+/*
+ * The application wants us to pass an RDMA destination (aka MR)
+ * to the remote
+ */
+int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	unsigned long flags;
+	struct rds_mr *mr;
+	u32 r_key;
+	int err = 0;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(rds_rdma_cookie_t))
+	 || rm->m_rdma_cookie != 0)
+		return -EINVAL;
+
+	memcpy(&rm->m_rdma_cookie, CMSG_DATA(cmsg), sizeof(rm->m_rdma_cookie));
+
+	/* We are reusing a previously mapped MR here. Most likely, the
+	 * application has written to the buffer, so we need to explicitly
+	 * flush those writes to RAM. Otherwise the HCA may not see them
+	 * when doing a DMA from that buffer.
+	 */
+	r_key = rds_rdma_cookie_key(rm->m_rdma_cookie);
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL);
+	if (mr == NULL)
+		err = -EINVAL;	/* invalid r_key */
+	else
+		atomic_inc(&mr->r_refcount);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (mr) {
+		mr->r_trans->sync_mr(mr->r_trans_private, DMA_TO_DEVICE);
+		rm->m_rdma_mr = mr;
+	}
+	return err;
+}
+
+/*
+ * The application passes us an address range it wants to enable RDMA
+ * to/from. We map the area, and save the <R_Key,offset> pair
+ * in rm->m_rdma_cookie. This causes it to be sent along to the peer
+ * in an extension header.
+ */
+int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_get_mr_args))
+	 || rm->m_rdma_cookie != 0)
+		return -EINVAL;
+
+	return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie, &rm->m_rdma_mr);
+}
diff --git a/net/rds/rdma.h b/net/rds/rdma.h
new file mode 100644
index 0000000..4255120
--- /dev/null
+++ b/net/rds/rdma.h
@@ -0,0 +1,84 @@
+#ifndef _RDS_RDMA_H
+#define _RDS_RDMA_H
+
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/scatterlist.h>
+
+#include "rds.h"
+
+struct rds_mr {
+	struct rb_node		r_rb_node;
+	atomic_t		r_refcount;
+	u32			r_key;
+
+	/* A copy of the creation flags */
+	unsigned int		r_use_once:1;
+	unsigned int		r_invalidate:1;
+	unsigned int		r_write:1;
+
+	/* This is for RDS_MR_DEAD.
+	 * It would be nice & consistent to make this part of the above
+	 * bit field here, but we need to use test_and_set_bit.
+	 */
+	unsigned long		r_state;
+	struct rds_sock		*r_sock; /* back pointer to the socket that owns us */
+	struct rds_transport	*r_trans;
+	void			*r_trans_private;
+};
+
+/* Flags for mr->r_state */
+#define RDS_MR_DEAD		0
+
+struct rds_rdma_op {
+	u32			r_key;
+	u64			r_remote_addr;
+	unsigned int		r_write:1;
+	unsigned int		r_fence:1;
+	unsigned int		r_notify:1;
+	unsigned int		r_recverr:1;
+	unsigned int		r_mapped:1;
+	struct rds_notifier	*r_notifier;
+	unsigned int		r_bytes;
+	unsigned int		r_nents;
+	unsigned int		r_count;
+	struct scatterlist	r_sg[0];
+};
+
+static inline rds_rdma_cookie_t rds_rdma_make_cookie(u32 r_key, u32 offset)
+{
+	return r_key | (((u64) offset) << 32);
+}
+
+static inline u32 rds_rdma_cookie_key(rds_rdma_cookie_t cookie)
+{
+	return cookie;
+}
+
+static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
+{
+	return cookie >> 32;
+}
+
+int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen);
+int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen);
+void rds_rdma_drop_keys(struct rds_sock *rs);
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+void rds_rdma_free_op(struct rds_rdma_op *ro);
+void rds_rdma_send_complete(struct rds_message *rm, int);
+
+extern void __rds_put_mr_final(struct rds_mr *mr);
+static inline void rds_mr_put(struct rds_mr *mr)
+{
+	if (atomic_dec_and_test(&mr->r_refcount))
+		__rds_put_mr_final(mr);
+}
+
+#endif
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:32 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:32 -0800
Subject: [ofa-general] [PATCH 15/26] RDS/IB: Implement RDMA ops using FMRs
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-16-git-send-email-andy.grover@oracle.com>

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib_rdma.c |  641 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 641 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib_rdma.c

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
new file mode 100644
index 0000000..69a6289
--- /dev/null
+++ b/net/rds/ib_rdma.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "ib.h"
+
+
+/*
+ * This is stored as mr->r_trans_private.
+ */
+struct rds_ib_mr {
+	struct rds_ib_device	*device;
+	struct rds_ib_mr_pool	*pool;
+	struct ib_fmr		*fmr;
+	struct list_head	list;
+	unsigned int		remap_count;
+
+	struct scatterlist	*sg;
+	unsigned int		sg_len;
+	u64			*dma;
+	int			sg_dma_len;
+};
+
+/*
+ * Our own little FMR pool
+ */
+struct rds_ib_mr_pool {
+	struct mutex		flush_lock;		/* serialize fmr invalidate */
+	struct work_struct	flush_worker;		/* flush worker */
+
+	spinlock_t		list_lock;		/* protect variables below */
+	atomic_t		item_count;		/* total # of MRs */
+	atomic_t		dirty_count;		/* # dirty of MRs */
+	struct list_head	drop_list;		/* MRs that have reached their max_maps limit */
+	struct list_head	free_list;		/* unused MRs */
+	struct list_head	clean_list;		/* unused & unamapped MRs */
+	atomic_t		free_pinned;		/* memory pinned by free MRs */
+	unsigned long		max_items;
+	unsigned long		max_items_soft;
+	unsigned long		max_free_pinned;
+	struct ib_fmr_attr	fmr_attr;
+};
+
+static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all);
+static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr);
+static void rds_ib_mr_pool_flush_worker(struct work_struct *work);
+
+static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_ipaddr *i_ipaddr;
+
+	list_for_each_entry(rds_ibdev, &rds_ib_devices, list) {
+		spin_lock_irq(&rds_ibdev->spinlock);
+		list_for_each_entry(i_ipaddr, &rds_ibdev->ipaddr_list, list) {
+			if (i_ipaddr->ipaddr == ipaddr) {
+				spin_unlock_irq(&rds_ibdev->spinlock);
+				return rds_ibdev;
+			}
+		}
+		spin_unlock_irq(&rds_ibdev->spinlock);
+	}
+
+	return NULL;
+}
+
+static int rds_ib_add_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_ipaddr *i_ipaddr;
+
+	i_ipaddr = kmalloc(sizeof *i_ipaddr, GFP_KERNEL);
+	if (!i_ipaddr)
+		return -ENOMEM;
+
+	i_ipaddr->ipaddr = ipaddr;
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_add_tail(&i_ipaddr->list, &rds_ibdev->ipaddr_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	return 0;
+}
+
+static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_ipaddr *i_ipaddr, *next;
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_for_each_entry_safe(i_ipaddr, next, &rds_ibdev->ipaddr_list, list) {
+		if (i_ipaddr->ipaddr == ipaddr) {
+			list_del(&i_ipaddr->list);
+			kfree(i_ipaddr);
+			break;
+		}
+	}
+	spin_unlock_irq(&rds_ibdev->spinlock);
+}
+
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_device *rds_ibdev_old;
+
+	rds_ibdev_old = rds_ib_get_device(ipaddr);
+	if (rds_ibdev_old)
+		rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr);
+
+	return rds_ib_add_ipaddr(rds_ibdev, ipaddr);
+}
+
+int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	/* conn was previously on the nodev_conns_list */
+	spin_lock_irq(&ib_nodev_conns_lock);
+	BUG_ON(list_empty(&ib_nodev_conns));
+	BUG_ON(list_empty(&ic->ib_node));
+	list_del(&ic->ib_node);
+	spin_unlock_irq(&ib_nodev_conns_lock);
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_add_tail(&ic->ib_node, &rds_ibdev->conn_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	ic->rds_ibdev = rds_ibdev;
+
+	return 0;
+}
+
+void rds_ib_remove_nodev_conns(void)
+{
+	struct rds_ib_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&ib_nodev_conns_lock);
+	list_splice(&ib_nodev_conns, &tmp_list);
+	INIT_LIST_HEAD(&ib_nodev_conns);
+	spin_unlock_irq(&ib_nodev_conns_lock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_splice(&rds_ibdev->conn_list, &tmp_list);
+	INIT_LIST_HEAD(&rds_ibdev->conn_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_mr_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&pool->free_list);
+	INIT_LIST_HEAD(&pool->drop_list);
+	INIT_LIST_HEAD(&pool->clean_list);
+	mutex_init(&pool->flush_lock);
+	spin_lock_init(&pool->list_lock);
+	INIT_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker);
+
+	pool->fmr_attr.max_pages = fmr_message_size;
+	pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps;
+	pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift;
+	pool->max_free_pinned = rds_ibdev->max_fmrs * fmr_message_size / 4;
+
+	/* We never allow more than max_items MRs to be allocated.
+	 * When we exceed more than max_items_soft, we start freeing
+	 * items more aggressively.
+	 * Make sure that max_items > max_items_soft > max_items / 2
+	 */
+	pool->max_items_soft = rds_ibdev->max_fmrs * 3 / 4;
+	pool->max_items = rds_ibdev->max_fmrs;
+
+	return pool;
+}
+
+void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo)
+{
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+	iinfo->rdma_mr_max = pool->max_items;
+	iinfo->rdma_mr_size = pool->fmr_attr.max_pages;
+}
+
+void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *pool)
+{
+	flush_workqueue(rds_wq);
+	rds_ib_flush_mr_pool(pool, 1);
+	BUG_ON(atomic_read(&pool->item_count));
+	BUG_ON(atomic_read(&pool->free_pinned));
+	kfree(pool);
+}
+
+static inline struct rds_ib_mr *rds_ib_reuse_fmr(struct rds_ib_mr_pool *pool)
+{
+	struct rds_ib_mr *ibmr = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	if (!list_empty(&pool->clean_list)) {
+		ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, list);
+		list_del_init(&ibmr->list);
+	}
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	return ibmr;
+}
+
+static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+	struct rds_ib_mr *ibmr = NULL;
+	int err = 0, iter = 0;
+
+	while (1) {
+		ibmr = rds_ib_reuse_fmr(pool);
+		if (ibmr)
+			return ibmr;
+
+		/* No clean MRs - now we have the choice of either
+		 * allocating a fresh MR up to the limit imposed by the
+		 * driver, or flush any dirty unused MRs.
+		 * We try to avoid stalling in the send path if possible,
+		 * so we allocate as long as we're allowed to.
+		 *
+		 * We're fussy with enforcing the FMR limit, though. If the driver
+		 * tells us we can't use more than N fmrs, we shouldn't start
+		 * arguing with it */
+		if (atomic_inc_return(&pool->item_count) <= pool->max_items)
+			break;
+
+		atomic_dec(&pool->item_count);
+
+		if (++iter > 2) {
+			rds_ib_stats_inc(s_ib_rdma_mr_pool_depleted);
+			return ERR_PTR(-EAGAIN);
+		}
+
+		/* We do have some empty MRs. Flush them out. */
+		rds_ib_stats_inc(s_ib_rdma_mr_pool_wait);
+		rds_ib_flush_mr_pool(pool, 0);
+	}
+
+	ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL);
+	if (!ibmr) {
+		err = -ENOMEM;
+		goto out_no_cigar;
+	}
+
+	ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+			(IB_ACCESS_LOCAL_WRITE |
+			 IB_ACCESS_REMOTE_READ |
+			 IB_ACCESS_REMOTE_WRITE),
+			&pool->fmr_attr);
+	if (IS_ERR(ibmr->fmr)) {
+		err = PTR_ERR(ibmr->fmr);
+		ibmr->fmr = NULL;
+		printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err);
+		goto out_no_cigar;
+	}
+
+	rds_ib_stats_inc(s_ib_rdma_mr_alloc);
+	return ibmr;
+
+out_no_cigar:
+	if (ibmr) {
+		if (ibmr->fmr)
+			ib_dealloc_fmr(ibmr->fmr);
+		kfree(ibmr);
+	}
+	atomic_dec(&pool->item_count);
+	return ERR_PTR(err);
+}
+
+static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr,
+	       struct scatterlist *sg, unsigned int nents)
+{
+	struct ib_device *dev = rds_ibdev->dev;
+	struct scatterlist *scat = sg;
+	u64 io_addr = 0;
+	u64 *dma_pages;
+	u32 len;
+	int page_cnt, sg_dma_len;
+	int i, j;
+	int ret;
+
+	sg_dma_len = ib_dma_map_sg(dev, sg, nents,
+				 DMA_BIDIRECTIONAL);
+	if (unlikely(!sg_dma_len)) {
+		printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n");
+		return -EBUSY;
+	}
+
+	len = 0;
+	page_cnt = 0;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &scat[i]);
+
+		if (dma_addr & ~rds_ibdev->fmr_page_mask) {
+			if (i > 0)
+				return -EINVAL;
+			else
+				++page_cnt;
+		}
+		if ((dma_addr + dma_len) & ~rds_ibdev->fmr_page_mask) {
+			if (i < sg_dma_len - 1)
+				return -EINVAL;
+			else
+				++page_cnt;
+		}
+
+		len += dma_len;
+	}
+
+	page_cnt += len >> rds_ibdev->fmr_page_shift;
+	if (page_cnt > fmr_message_size)
+		return -EINVAL;
+
+	dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC);
+	if (!dma_pages)
+		return -ENOMEM;
+
+	page_cnt = 0;
+	for (i = 0; i < sg_dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &scat[i]);
+
+		for (j = 0; j < dma_len; j += rds_ibdev->fmr_page_size)
+			dma_pages[page_cnt++] =
+				(dma_addr & rds_ibdev->fmr_page_mask) + j;
+	}
+
+	ret = ib_map_phys_fmr(ibmr->fmr,
+				   dma_pages, page_cnt, io_addr);
+	if (ret)
+		goto out;
+
+	/* Success - we successfully remapped the MR, so we can
+	 * safely tear down the old mapping. */
+	rds_ib_teardown_mr(ibmr);
+
+	ibmr->sg = scat;
+	ibmr->sg_len = nents;
+	ibmr->sg_dma_len = sg_dma_len;
+	ibmr->remap_count++;
+
+	rds_ib_stats_inc(s_ib_rdma_mr_used);
+	ret = 0;
+
+out:
+	kfree(dma_pages);
+
+	return ret;
+}
+
+void rds_ib_sync_mr(void *trans_private, int direction)
+{
+	struct rds_ib_mr *ibmr = trans_private;
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+
+	switch (direction) {
+	case DMA_FROM_DEVICE:
+		ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->sg,
+			ibmr->sg_dma_len, DMA_BIDIRECTIONAL);
+		break;
+	case DMA_TO_DEVICE:
+		ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->sg,
+			ibmr->sg_dma_len, DMA_BIDIRECTIONAL);
+		break;
+	}
+}
+
+static void __rds_ib_teardown_mr(struct rds_ib_mr *ibmr)
+{
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+
+	if (ibmr->sg_dma_len) {
+		ib_dma_unmap_sg(rds_ibdev->dev,
+				ibmr->sg, ibmr->sg_len,
+				DMA_BIDIRECTIONAL);
+		ibmr->sg_dma_len = 0;
+	}
+
+	/* Release the s/g list */
+	if (ibmr->sg_len) {
+		unsigned int i;
+
+		for (i = 0; i < ibmr->sg_len; ++i) {
+			struct page *page = sg_page(&ibmr->sg[i]);
+
+			/* FIXME we need a way to tell a r/w MR
+			 * from a r/o MR */
+			set_page_dirty(page);
+			put_page(page);
+		}
+		kfree(ibmr->sg);
+
+		ibmr->sg = NULL;
+		ibmr->sg_len = 0;
+	}
+}
+
+static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr)
+{
+	unsigned int pinned = ibmr->sg_len;
+
+	__rds_ib_teardown_mr(ibmr);
+	if (pinned) {
+		struct rds_ib_device *rds_ibdev = ibmr->device;
+		struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+		atomic_sub(pinned, &pool->free_pinned);
+	}
+}
+
+static inline unsigned int rds_ib_flush_goal(struct rds_ib_mr_pool *pool, int free_all)
+{
+	unsigned int item_count;
+
+	item_count = atomic_read(&pool->item_count);
+	if (free_all)
+		return item_count;
+
+	return 0;
+}
+
+/*
+ * Flush our pool of MRs.
+ * At a minimum, all currently unused MRs are unmapped.
+ * If the number of MRs allocated exceeds the limit, we also try
+ * to free as many MRs as needed to get back to this limit.
+ */
+static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all)
+{
+	struct rds_ib_mr *ibmr, *next;
+	LIST_HEAD(unmap_list);
+	LIST_HEAD(fmr_list);
+	unsigned long unpinned = 0;
+	unsigned long flags;
+	unsigned int nfreed = 0, ncleaned = 0, free_goal;
+	int ret = 0;
+
+	rds_ib_stats_inc(s_ib_rdma_mr_pool_flush);
+
+	mutex_lock(&pool->flush_lock);
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	/* Get the list of all MRs to be dropped. Ordering matters -
+	 * we want to put drop_list ahead of free_list. */
+	list_splice_init(&pool->free_list, &unmap_list);
+	list_splice_init(&pool->drop_list, &unmap_list);
+	if (free_all)
+		list_splice_init(&pool->clean_list, &unmap_list);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	free_goal = rds_ib_flush_goal(pool, free_all);
+
+	if (list_empty(&unmap_list))
+		goto out;
+
+	/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
+	list_for_each_entry(ibmr, &unmap_list, list)
+		list_add(&ibmr->fmr->list, &fmr_list);
+	ret = ib_unmap_fmr(&fmr_list);
+	if (ret)
+		printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret);
+
+	/* Now we can destroy the DMA mapping and unpin any pages */
+	list_for_each_entry_safe(ibmr, next, &unmap_list, list) {
+		unpinned += ibmr->sg_len;
+		__rds_ib_teardown_mr(ibmr);
+		if (nfreed < free_goal || ibmr->remap_count >= pool->fmr_attr.max_maps) {
+			rds_ib_stats_inc(s_ib_rdma_mr_free);
+			list_del(&ibmr->list);
+			ib_dealloc_fmr(ibmr->fmr);
+			kfree(ibmr);
+			nfreed++;
+		}
+		ncleaned++;
+	}
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	list_splice(&unmap_list, &pool->clean_list);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	atomic_sub(unpinned, &pool->free_pinned);
+	atomic_sub(ncleaned, &pool->dirty_count);
+	atomic_sub(nfreed, &pool->item_count);
+
+out:
+	mutex_unlock(&pool->flush_lock);
+	return ret;
+}
+
+static void rds_ib_mr_pool_flush_worker(struct work_struct *work)
+{
+	struct rds_ib_mr_pool *pool = container_of(work, struct rds_ib_mr_pool, flush_worker);
+
+	rds_ib_flush_mr_pool(pool, 0);
+}
+
+void rds_ib_free_mr(void *trans_private, int invalidate)
+{
+	struct rds_ib_mr *ibmr = trans_private;
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+	unsigned long flags;
+
+	rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len);
+
+	/* Return it to the pool's free list */
+	spin_lock_irqsave(&pool->list_lock, flags);
+	if (ibmr->remap_count >= pool->fmr_attr.max_maps)
+		list_add(&ibmr->list, &pool->drop_list);
+	else
+		list_add(&ibmr->list, &pool->free_list);
+
+	atomic_add(ibmr->sg_len, &pool->free_pinned);
+	atomic_inc(&pool->dirty_count);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	/* If we've pinned too many pages, request a flush */
+	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned
+	 || atomic_read(&pool->dirty_count) >= pool->max_items / 10)
+		queue_work(rds_wq, &pool->flush_worker);
+
+	if (invalidate) {
+		if (likely(!in_interrupt())) {
+			rds_ib_flush_mr_pool(pool, 0);
+		} else {
+			/* We get here if the user created a MR marked
+			 * as use_once and invalidate at the same time. */
+			queue_work(rds_wq, &pool->flush_worker);
+		}
+	}
+}
+
+void rds_ib_flush_mrs(void)
+{
+	struct rds_ib_device *rds_ibdev;
+
+	list_for_each_entry(rds_ibdev, &rds_ib_devices, list) {
+		struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+		if (pool)
+			rds_ib_flush_mr_pool(pool, 0);
+	}
+}
+
+void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_mr *ibmr = NULL;
+	int ret;
+
+	rds_ibdev = rds_ib_get_device(rs->rs_bound_addr);
+	if (!rds_ibdev) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (!rds_ibdev->mr_pool) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	ibmr = rds_ib_alloc_fmr(rds_ibdev);
+	if (IS_ERR(ibmr))
+		return ibmr;
+
+	ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+	if (ret == 0)
+		*key_ret = ibmr->fmr->rkey;
+	else
+		printk(KERN_WARNING "RDS/IB: map_fmr failed (errno=%d)\n", ret);
+
+	ibmr->device = rds_ibdev;
+
+ out:
+	if (ret) {
+		if (ibmr)
+			rds_ib_free_mr(ibmr, 0);
+		ibmr = ERR_PTR(ret);
+	}
+	return ibmr;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:33 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:33 -0800
Subject: [ofa-general] [PATCH 16/26] RDS/IB: Implement IB-specific datagram
	send.
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-17-git-send-email-andy.grover@oracle.com>

Specific to IB is a credits-based flow control mechanism, in
addition to the expected usage of the IB API to package outgoing
data into work requests.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib_send.c |  874 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 874 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib_send.c

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
new file mode 100644
index 0000000..cb6c52c
--- /dev/null
+++ b/net/rds/ib_send.c
@@ -0,0 +1,874 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/device.h>
+#include <linux/dmapool.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "ib.h"
+
+static void rds_ib_send_rdma_complete(struct rds_message *rm,
+				      int wc_status)
+{
+	int notify_status;
+
+	switch (wc_status) {
+	case IB_WC_WR_FLUSH_ERR:
+		return;
+
+	case IB_WC_SUCCESS:
+		notify_status = RDS_RDMA_SUCCESS;
+		break;
+
+	case IB_WC_REM_ACCESS_ERR:
+		notify_status = RDS_RDMA_REMOTE_ERROR;
+		break;
+
+	default:
+		notify_status = RDS_RDMA_OTHER_ERROR;
+		break;
+	}
+	rds_rdma_send_complete(rm, notify_status);
+}
+
+static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
+				   struct rds_rdma_op *op)
+{
+	if (op->r_mapped) {
+		ib_dma_unmap_sg(ic->i_cm_id->device,
+			op->r_sg, op->r_nents,
+			op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		op->r_mapped = 0;
+	}
+}
+
+static void rds_ib_send_unmap_rm(struct rds_ib_connection *ic,
+			  struct rds_ib_send_work *send,
+			  int wc_status)
+{
+	struct rds_message *rm = send->s_rm;
+
+	rdsdebug("ic %p send %p rm %p\n", ic, send, rm);
+
+	ib_dma_unmap_sg(ic->i_cm_id->device,
+		     rm->m_sg, rm->m_nents,
+		     DMA_TO_DEVICE);
+
+	if (rm->m_rdma_op != NULL) {
+		rds_ib_send_unmap_rdma(ic, rm->m_rdma_op);
+
+		/* If the user asked for a completion notification on this
+		 * message, we can implement three different semantics:
+		 *  1.	Notify when we received the ACK on the RDS message
+		 *	that was queued with the RDMA. This provides reliable
+		 *	notification of RDMA status at the expense of a one-way
+		 *	packet delay.
+		 *  2.	Notify when the IB stack gives us the completion event for
+		 *	the RDMA operation.
+		 *  3.	Notify when the IB stack gives us the completion event for
+		 *	the accompanying RDS messages.
+		 * Here, we implement approach #3. To implement approach #2,
+		 * call rds_rdma_send_complete from the cq_handler. To implement #1,
+		 * don't call rds_rdma_send_complete at all, and fall back to the notify
+		 * handling in the ACK processing code.
+		 *
+		 * Note: There's no need to explicitly sync any RDMA buffers using
+		 * ib_dma_sync_sg_for_cpu - the completion for the RDMA
+		 * operation itself unmapped the RDMA buffers, which takes care
+		 * of synching.
+		 */
+		rds_ib_send_rdma_complete(rm, wc_status);
+
+		if (rm->m_rdma_op->r_write)
+			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
+		else
+			rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes);
+	}
+
+	/* If anyone waited for this message to get flushed out, wake
+	 * them up now */
+	rds_message_unmapped(rm);
+
+	rds_message_put(rm);
+	send->s_rm = NULL;
+}
+
+void rds_ib_send_init_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		struct ib_sge *sge;
+
+		send->s_rm = NULL;
+		send->s_op = NULL;
+
+		send->s_wr.wr_id = i;
+		send->s_wr.sg_list = send->s_sge;
+		send->s_wr.num_sge = 1;
+		send->s_wr.opcode = IB_WR_SEND;
+		send->s_wr.send_flags = 0;
+		send->s_wr.ex.imm_data = 0;
+
+		sge = rds_ib_data_sge(ic, send->s_sge);
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, send->s_sge);
+		sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = ic->i_mr->lkey;
+	}
+}
+
+void rds_ib_send_clear_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		if (send->s_wr.opcode == 0xdead)
+			continue;
+		if (send->s_rm)
+			rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR);
+		if (send->s_op)
+			rds_ib_send_unmap_rdma(ic, send->s_op);
+	}
+}
+
+/*
+ * The _oldest/_free ring operations here race cleanly with the alloc/unalloc
+ * operations performed in the send path.  As the sender allocs and potentially
+ * unallocs the next free entry in the ring it doesn't alter which is
+ * the next to be freed, which is what this is concerned with.
+ */
+void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_ib_send_work *send;
+	u32 completed;
+	u32 oldest;
+	u32 i = 0;
+	int ret;
+
+	rdsdebug("cq %p conn %p\n", cq, conn);
+	rds_ib_stats_inc(s_ib_tx_cq_call);
+	ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	if (ret)
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_ib_stats_inc(s_ib_tx_cq_event);
+
+		if (wc.wr_id == RDS_IB_ACK_WR_ID) {
+			if (ic->i_ack_queued + HZ/2 < jiffies)
+				rds_ib_stats_inc(s_ib_tx_stalled);
+			rds_ib_ack_send_complete(ic);
+			continue;
+		}
+
+		oldest = rds_ib_ring_oldest(&ic->i_send_ring);
+
+		completed = rds_ib_ring_completed(&ic->i_send_ring, wc.wr_id, oldest);
+
+		for (i = 0; i < completed; i++) {
+			send = &ic->i_sends[oldest];
+
+			/* In the error case, wc.opcode sometimes contains garbage */
+			switch (send->s_wr.opcode) {
+			case IB_WR_SEND:
+				if (send->s_rm)
+					rds_ib_send_unmap_rm(ic, send, wc.status);
+				break;
+			case IB_WR_RDMA_WRITE:
+			case IB_WR_RDMA_READ:
+				/* Nothing to be done - the SG list will be unmapped
+				 * when the SEND completes. */
+				break;
+			default:
+				if (printk_ratelimit())
+					printk(KERN_NOTICE
+						"RDS/IB: %s: unexpected opcode 0x%x in WR!\n",
+						__func__, send->s_wr.opcode);
+				break;
+			}
+
+			send->s_wr.opcode = 0xdead;
+			send->s_wr.num_sge = 1;
+			if (send->s_queued + HZ/2 < jiffies)
+				rds_ib_stats_inc(s_ib_tx_stalled);
+
+			/* If a RDMA operation produced an error, signal this right
+			 * away. If we don't, the subsequent SEND that goes with this
+			 * RDMA will be canceled with ERR_WFLUSH, and the application
+			 * never learn that the RDMA failed. */
+			if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) {
+				struct rds_message *rm;
+
+				rm = rds_send_get_message(conn, send->s_op);
+				if (rm)
+					rds_ib_send_rdma_complete(rm, wc.status);
+			}
+
+			oldest = (oldest + 1) % ic->i_send_ring.w_nr;
+		}
+
+		rds_ib_ring_free(&ic->i_send_ring, completed);
+
+		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
+		 || test_bit(0, &conn->c_map_queued))
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+		/* We expect errors as the qp is drained during shutdown */
+		if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) {
+			rds_ib_conn_error(conn,
+				"send completion on %pI4 "
+				"had status %u, disconnecting and reconnecting\n",
+				&conn->c_faddr, wc.status);
+		}
+	}
+}
+
+/*
+ * This is the main function for allocating credits when sending
+ * messages.
+ *
+ * Conceptually, we have two counters:
+ *  -	send credits: this tells us how many WRs we're allowed
+ *	to submit without overruning the reciever's queue. For
+ *	each SEND WR we post, we decrement this by one.
+ *
+ *  -	posted credits: this tells us how many WRs we recently
+ *	posted to the receive queue. This value is transferred
+ *	to the peer as a "credit update" in a RDS header field.
+ *	Every time we transmit credits to the peer, we subtract
+ *	the amount of transferred credits from this counter.
+ *
+ * It is essential that we avoid situations where both sides have
+ * exhausted their send credits, and are unable to send new credits
+ * to the peer. We achieve this by requiring that we send at least
+ * one credit update to the peer before exhausting our credits.
+ * When new credits arrive, we subtract one credit that is withheld
+ * until we've posted new buffers and are ready to transmit these
+ * credits (see rds_ib_send_add_credits below).
+ *
+ * The RDS send code is essentially single-threaded; rds_send_xmit
+ * grabs c_send_lock to ensure exclusive access to the send ring.
+ * However, the ACK sending code is independent and can race with
+ * message SENDs.
+ *
+ * In the send path, we need to update the counters for send credits
+ * and the counter of posted buffers atomically - when we use the
+ * last available credit, we cannot allow another thread to race us
+ * and grab the posted credits counter.  Hence, we have to use a
+ * spinlock to protect the credit counter, or use atomics.
+ *
+ * Spinlocks shared between the send and the receive path are bad,
+ * because they create unnecessary delays. An early implementation
+ * using a spinlock showed a 5% degradation in throughput at some
+ * loads.
+ *
+ * This implementation avoids spinlocks completely, putting both
+ * counters into a single atomic, and updating that atomic using
+ * atomic_add (in the receive path, when receiving fresh credits),
+ * and using atomic_cmpxchg when updating the two counters.
+ */
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic,
+			     u32 wanted, u32 *adv_credits, int need_posted)
+{
+	unsigned int avail, posted, got = 0, advertise;
+	long oldval, newval;
+
+	*adv_credits = 0;
+	if (!ic->i_flowctl)
+		return wanted;
+
+try_again:
+	advertise = 0;
+	oldval = newval = atomic_read(&ic->i_credits);
+	posted = IB_GET_POST_CREDITS(oldval);
+	avail = IB_GET_SEND_CREDITS(oldval);
+
+	rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n",
+			wanted, avail, posted);
+
+	/* The last credit must be used to send a credit update. */
+	if (avail && !posted)
+		avail--;
+
+	if (avail < wanted) {
+		struct rds_connection *conn = ic->i_cm_id->context;
+
+		/* Oops, there aren't that many credits left! */
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		got = avail;
+	} else {
+		/* Sometimes you get what you want, lalala. */
+		got = wanted;
+	}
+	newval -= IB_SET_SEND_CREDITS(got);
+
+	/*
+	 * If need_posted is non-zero, then the caller wants
+	 * the posted regardless of whether any send credits are
+	 * available.
+	 */
+	if (posted && (got || need_posted)) {
+		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
+		newval -= IB_SET_POST_CREDITS(advertise);
+	}
+
+	/* Finally bill everything */
+	if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval)
+		goto try_again;
+
+	*adv_credits = advertise;
+	return got;
+}
+
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (credits == 0)
+		return;
+
+	rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n",
+			credits,
+			IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)),
+			test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : "");
+
+	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
+	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
+
+	rds_ib_stats_inc(s_ib_rx_credit_updates);
+}
+
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (posted == 0)
+		return;
+
+	atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits);
+
+	/* Decide whether to send an update to the peer now.
+	 * If we would send a credit update for every single buffer we
+	 * post, we would end up with an ACK storm (ACK arrives,
+	 * consumes buffer, we refill the ring, send ACK to remote
+	 * advertising the newly posted buffer... ad inf)
+	 *
+	 * Performance pretty much depends on how often we send
+	 * credit updates - too frequent updates mean lots of ACKs.
+	 * Too infrequent updates, and the peer will run out of
+	 * credits and has to throttle.
+	 * For the time being, 16 seems to be a good compromise.
+	 */
+	if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16)
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+}
+
+static inline void
+rds_ib_xmit_populate_wr(struct rds_ib_connection *ic,
+		struct rds_ib_send_work *send, unsigned int pos,
+		unsigned long buffer, unsigned int length,
+		int send_flags)
+{
+	struct ib_sge *sge;
+
+	WARN_ON(pos != send - ic->i_sends);
+
+	send->s_wr.send_flags = send_flags;
+	send->s_wr.opcode = IB_WR_SEND;
+	send->s_wr.num_sge = 2;
+	send->s_wr.next = NULL;
+	send->s_queued = jiffies;
+	send->s_op = NULL;
+
+	if (length != 0) {
+		sge = rds_ib_data_sge(ic, send->s_sge);
+		sge->addr = buffer;
+		sge->length = length;
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, send->s_sge);
+	} else {
+		/* We're sending a packet with no payload. There is only
+		 * one SGE */
+		send->s_wr.num_sge = 1;
+		sge = &send->s_sge[0];
+	}
+
+	sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header));
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = ic->i_mr->lkey;
+}
+
+/*
+ * This can be called multiple times for a given message.  The first time
+ * we see a message we map its scatterlist into the IB device so that
+ * we can provide that mapped address to the IB scatter gather entries
+ * in the IB work requests.  We translate the scatterlist into a series
+ * of work requests that fragment the message.  These work requests complete
+ * in order so we pass ownership of the message to the completion handler
+ * once we send the final fragment.
+ *
+ * The RDS core uses the c_send_lock to only enter this function once
+ * per connection.  This makes sure that the tx ring alloc/unalloc pairs
+ * don't get out of sync and confuse the ring.
+ */
+int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct rds_ib_send_work *send = NULL;
+	struct rds_ib_send_work *first;
+	struct rds_ib_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct scatterlist *scat;
+	u32 pos;
+	u32 i;
+	u32 work_alloc;
+	u32 credit_alloc;
+	u32 posted;
+	u32 adv_credits = 0;
+	int send_flags = 0;
+	int sent;
+	int ret;
+	int flow_controlled = 0;
+
+	BUG_ON(off % RDS_FRAG_SIZE);
+	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
+
+	/* FIXME we may overallocate here */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0)
+		i = 1;
+	else
+		i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE);
+
+	work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc == 0) {
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		rds_ib_stats_inc(s_ib_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	credit_alloc = work_alloc;
+	if (ic->i_flowctl) {
+		credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &posted, 0);
+		adv_credits += posted;
+		if (credit_alloc < work_alloc) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
+			work_alloc = credit_alloc;
+			flow_controlled++;
+		}
+		if (work_alloc == 0) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+			rds_ib_stats_inc(s_ib_tx_throttle);
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/* map the message the first time we see it */
+	if (ic->i_rm == NULL) {
+		/*
+		printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(rm->m_inc.i_hdr.h_dport),
+				rm->m_inc.i_hdr.h_flags,
+				be32_to_cpu(rm->m_inc.i_hdr.h_len));
+		   */
+		if (rm->m_nents) {
+			rm->m_count = ib_dma_map_sg(dev,
+					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
+			rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count);
+			if (rm->m_count == 0) {
+				rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
+				rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+				ret = -ENOMEM; /* XXX ? */
+				goto out;
+			}
+		} else {
+			rm->m_count = 0;
+		}
+
+		ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+		ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes;
+		rds_message_addref(rm);
+		ic->i_rm = rm;
+
+		/* Finalize the header */
+		if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED;
+		if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED;
+
+		/* If it has a RDMA op, tell the peer we did it. This is
+		 * used by the peer to release use-once RDMA MRs. */
+		if (rm->m_rdma_op) {
+			struct rds_ext_header_rdma ext_hdr;
+
+			ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key);
+			rds_message_add_extension(&rm->m_inc.i_hdr,
+					RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr));
+		}
+		if (rm->m_rdma_cookie) {
+			rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr,
+					rds_rdma_cookie_key(rm->m_rdma_cookie),
+					rds_rdma_cookie_offset(rm->m_rdma_cookie));
+		}
+
+		/* Note - rds_ib_piggyb_ack clears the ACK_REQUIRED bit, so
+		 * we should not do this unless we have a chance of at least
+		 * sticking the header into the send ring. Which is why we
+		 * should call rds_ib_ring_alloc first. */
+		rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_ib_piggyb_ack(ic));
+		rds_message_make_checksum(&rm->m_inc.i_hdr);
+
+		/*
+		 * Update adv_credits since we reset the ACK_REQUIRED bit.
+		 */
+		rds_ib_send_grab_credits(ic, 0, &posted, 1);
+		adv_credits += posted;
+		BUG_ON(adv_credits > 255);
+	} else if (ic->i_rm != rm)
+		BUG();
+
+	send = &ic->i_sends[pos];
+	first = send;
+	prev = NULL;
+	scat = &rm->m_sg[sg];
+	sent = 0;
+	i = 0;
+
+	/* Sometimes you want to put a fence between an RDMA
+	 * READ and the following SEND.
+	 * We could either do this all the time
+	 * or when requested by the user. Right now, we let
+	 * the application choose.
+	 */
+	if (rm->m_rdma_op && rm->m_rdma_op->r_fence)
+		send_flags = IB_SEND_FENCE;
+
+	/*
+	 * We could be copying the header into the unused tail of the page.
+	 * That would need to be changed in the future when those pages might
+	 * be mapped userspace pages or page cache pages.  So instead we always
+	 * use a second sge and our long-lived ring of mapped headers.  We send
+	 * the header after the data so that the data payload can be aligned on
+	 * the receiver.
+	 */
+
+	/* handle a 0-len message */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) {
+		rds_ib_xmit_populate_wr(ic, send, pos, 0, 0, send_flags);
+		goto add_header;
+	}
+
+	/* if there's data reference it with a chain of work reqs */
+	for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) {
+		unsigned int len;
+
+		send = &ic->i_sends[pos];
+
+		len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off);
+		rds_ib_xmit_populate_wr(ic, send, pos,
+				ib_sg_dma_address(dev, scat) + off, len,
+				send_flags);
+
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time
+		 * on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		ic->i_unsignaled_bytes -= len;
+		if (ic->i_unsignaled_bytes <= 0) {
+			ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		/*
+		 * Always signal the last one if we're stopping due to flow control.
+		 */
+		if (flow_controlled && i == (work_alloc-1))
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			 &send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		sent += len;
+		off += len;
+		if (off == ib_sg_dma_len(dev, scat)) {
+			scat++;
+			off = 0;
+		}
+
+add_header:
+		/* Tack on the header after the data. The header SGE should already
+		 * have been set up to point to the right header buffer. */
+		memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header));
+
+		if (0) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(hdr->h_dport),
+				hdr->h_flags,
+				be32_to_cpu(hdr->h_len));
+		}
+		if (adv_credits) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			/* add credit and redo the header checksum */
+			hdr->h_credit = adv_credits;
+			rds_message_make_checksum(hdr);
+			adv_credits = 0;
+			rds_ib_stats_inc(s_ib_tx_credit_updates);
+		}
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+		prev = send;
+
+		pos = (pos + 1) % ic->i_send_ring.w_nr;
+	}
+
+	/* Account the RDS header in the number of bytes we sent, but just once.
+	 * The caller has no concept of fragmentation. */
+	if (hdr_off == 0)
+		sent += sizeof(struct rds_header);
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &rm->m_sg[rm->m_count]) {
+		prev->s_rm = ic->i_rm;
+		prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		ic->i_rm = NULL;
+	}
+
+	if (i < work_alloc) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+	if (ic->i_flowctl && i < credit_alloc)
+		rds_ib_send_add_credits(conn, credit_alloc - i);
+
+	/* XXX need to worry about failed_wr and partial sends. */
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IB: ib_post_send to %pI4 "
+		       "returned %d\n", &conn->c_faddr, ret);
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		if (prev->s_rm) {
+			ic->i_rm = prev->s_rm;
+			prev->s_rm = NULL;
+		}
+		/* Finesse this later */
+		BUG();
+		goto out;
+	}
+
+	ret = sent;
+out:
+	BUG_ON(adv_credits);
+	return ret;
+}
+
+int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_send_work *send = NULL;
+	struct rds_ib_send_work *first;
+	struct rds_ib_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct rds_ib_device *rds_ibdev;
+	struct scatterlist *scat;
+	unsigned long len;
+	u64 remote_addr = op->r_remote_addr;
+	u32 pos;
+	u32 work_alloc;
+	u32 i;
+	u32 j;
+	int sent;
+	int ret;
+	int num_sge;
+
+	rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+
+	/* map the message the first time we see it */
+	if (!op->r_mapped) {
+		op->r_count = ib_dma_map_sg(ic->i_cm_id->device,
+					op->r_sg, op->r_nents, (op->r_write) ?
+					DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count);
+		if (op->r_count == 0) {
+			rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
+			ret = -ENOMEM; /* XXX ? */
+			goto out;
+		}
+
+		op->r_mapped = 1;
+	}
+
+	/*
+	 * Instead of knowing how to return a partial rdma read/write we insist that there
+	 * be enough work requests to send the entire message.
+	 */
+	i = ceil(op->r_count, rds_ibdev->max_sge);
+
+	work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc != i) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		rds_ib_stats_inc(s_ib_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	send = &ic->i_sends[pos];
+	first = send;
+	prev = NULL;
+	scat = &op->r_sg[0];
+	sent = 0;
+	num_sge = op->r_count;
+
+	for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) {
+		send->s_wr.send_flags = 0;
+		send->s_queued = jiffies;
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags = IB_SEND_SIGNALED;
+		}
+
+		send->s_wr.opcode = op->r_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ;
+		send->s_wr.wr.rdma.remote_addr = remote_addr;
+		send->s_wr.wr.rdma.rkey = op->r_key;
+		send->s_op = op;
+
+		if (num_sge > rds_ibdev->max_sge) {
+			send->s_wr.num_sge = rds_ibdev->max_sge;
+			num_sge -= rds_ibdev->max_sge;
+		} else {
+			send->s_wr.num_sge = num_sge;
+		}
+
+		send->s_wr.next = NULL;
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+
+		for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) {
+			len = ib_sg_dma_len(ic->i_cm_id->device, scat);
+			send->s_sge[j].addr =
+				 ib_sg_dma_address(ic->i_cm_id->device, scat);
+			send->s_sge[j].length = len;
+			send->s_sge[j].lkey = ic->i_mr->lkey;
+
+			sent += len;
+			rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr);
+
+			remote_addr += len;
+			scat++;
+		}
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			&send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		prev = send;
+		if (++send == &ic->i_sends[ic->i_send_ring.w_nr])
+			send = ic->i_sends;
+	}
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &op->r_sg[op->r_count])
+		prev->s_wr.send_flags = IB_SEND_SIGNALED;
+
+	if (i < work_alloc) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %pI4 "
+		       "returned %d\n", &conn->c_faddr, ret);
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		goto out;
+	}
+
+	if (unlikely(failed_wr != &first->s_wr)) {
+		printk(KERN_WARNING "RDS/IB: ib_post_send() rc=%d, but failed_wqe updated!\n", ret);
+		BUG_ON(failed_wr != &first->s_wr);
+	}
+
+
+out:
+	return ret;
+}
+
+void rds_ib_xmit_complete(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	/* We may have a pending ACK or window update we were unable
+	 * to send previously (due to flow control). Try again. */
+	rds_ib_attempt_ack(ic);
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:34 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:34 -0800
Subject: [ofa-general] [PATCH 17/26] RDS/IB: Receive datagrams via IB
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-18-git-send-email-andy.grover@oracle.com>

Header parsing, ring refill. It puts the incoming data into an
rds_incoming struct, which is passed up to rds-core.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib_recv.c |  869 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 869 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib_recv.c

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
new file mode 100644
index 0000000..5061b55
--- /dev/null
+++ b/net/rds/ib_recv.c
@@ -0,0 +1,869 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <rdma/rdma_cm.h>
+
+#include "rds.h"
+#include "ib.h"
+
+static struct kmem_cache *rds_ib_incoming_slab;
+static struct kmem_cache *rds_ib_frag_slab;
+static atomic_t	rds_ib_allocation = ATOMIC_INIT(0);
+
+static void rds_ib_frag_drop_page(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	__free_page(frag->f_page);
+	frag->f_page = NULL;
+}
+
+static void rds_ib_frag_free(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	BUG_ON(frag->f_page != NULL);
+	kmem_cache_free(rds_ib_frag_slab, frag);
+}
+
+/*
+ * We map a page at a time.  Its fragments are posted in order.  This
+ * is called in fragment order as the fragments get send completion events.
+ * Only the last frag in the page performs the unmapping.
+ *
+ * It's OK for ring cleanup to call this in whatever order it likes because
+ * DMA is not in flight and so we can unmap while other ring entries still
+ * hold page references in their frags.
+ */
+static void rds_ib_recv_unmap_page(struct rds_ib_connection *ic,
+				   struct rds_ib_recv_work *recv)
+{
+	struct rds_page_frag *frag = recv->r_frag;
+
+	rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page);
+	if (frag->f_mapped)
+		ib_dma_unmap_page(ic->i_cm_id->device,
+			       frag->f_mapped,
+			       RDS_FRAG_SIZE, DMA_FROM_DEVICE);
+	frag->f_mapped = 0;
+}
+
+void rds_ib_recv_init_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_recv_work *recv;
+	u32 i;
+
+	for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) {
+		struct ib_sge *sge;
+
+		recv->r_ibinc = NULL;
+		recv->r_frag = NULL;
+
+		recv->r_wr.next = NULL;
+		recv->r_wr.wr_id = i;
+		recv->r_wr.sg_list = recv->r_sge;
+		recv->r_wr.num_sge = RDS_IB_RECV_SGE;
+
+		sge = rds_ib_data_sge(ic, recv->r_sge);
+		sge->addr = 0;
+		sge->length = RDS_FRAG_SIZE;
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, recv->r_sge);
+		sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = ic->i_mr->lkey;
+	}
+}
+
+static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
+				  struct rds_ib_recv_work *recv)
+{
+	if (recv->r_ibinc) {
+		rds_inc_put(&recv->r_ibinc->ii_inc);
+		recv->r_ibinc = NULL;
+	}
+	if (recv->r_frag) {
+		rds_ib_recv_unmap_page(ic, recv);
+		if (recv->r_frag->f_page)
+			rds_ib_frag_drop_page(recv->r_frag);
+		rds_ib_frag_free(recv->r_frag);
+		recv->r_frag = NULL;
+	}
+}
+
+void rds_ib_recv_clear_ring(struct rds_ib_connection *ic)
+{
+	u32 i;
+
+	for (i = 0; i < ic->i_recv_ring.w_nr; i++)
+		rds_ib_recv_clear_one(ic, &ic->i_recvs[i]);
+
+	if (ic->i_frag.f_page)
+		rds_ib_frag_drop_page(&ic->i_frag);
+}
+
+static int rds_ib_recv_refill_one(struct rds_connection *conn,
+				  struct rds_ib_recv_work *recv,
+				  gfp_t kptr_gfp, gfp_t page_gfp)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	dma_addr_t dma_addr;
+	struct ib_sge *sge;
+	int ret = -ENOMEM;
+
+	if (recv->r_ibinc == NULL) {
+		if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
+			rds_ib_stats_inc(s_ib_rx_alloc_limit);
+			goto out;
+		}
+		recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
+						 kptr_gfp);
+		if (recv->r_ibinc == NULL)
+			goto out;
+		atomic_inc(&rds_ib_allocation);
+		INIT_LIST_HEAD(&recv->r_ibinc->ii_frags);
+		rds_inc_init(&recv->r_ibinc->ii_inc, conn, conn->c_faddr);
+	}
+
+	if (recv->r_frag == NULL) {
+		recv->r_frag = kmem_cache_alloc(rds_ib_frag_slab, kptr_gfp);
+		if (recv->r_frag == NULL)
+			goto out;
+		INIT_LIST_HEAD(&recv->r_frag->f_item);
+		recv->r_frag->f_page = NULL;
+	}
+
+	if (ic->i_frag.f_page == NULL) {
+		ic->i_frag.f_page = alloc_page(page_gfp);
+		if (ic->i_frag.f_page == NULL)
+			goto out;
+		ic->i_frag.f_offset = 0;
+	}
+
+	dma_addr = ib_dma_map_page(ic->i_cm_id->device,
+				  ic->i_frag.f_page,
+				  ic->i_frag.f_offset,
+				  RDS_FRAG_SIZE,
+				  DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr))
+		goto out;
+
+	/*
+	 * Once we get the RDS_PAGE_LAST_OFF frag then rds_ib_frag_unmap()
+	 * must be called on this recv.  This happens as completions hit
+	 * in order or on connection shutdown.
+	 */
+	recv->r_frag->f_page = ic->i_frag.f_page;
+	recv->r_frag->f_offset = ic->i_frag.f_offset;
+	recv->r_frag->f_mapped = dma_addr;
+
+	sge = rds_ib_data_sge(ic, recv->r_sge);
+	sge->addr = dma_addr;
+	sge->length = RDS_FRAG_SIZE;
+
+	sge = rds_ib_header_sge(ic, recv->r_sge);
+	sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header);
+	sge->length = sizeof(struct rds_header);
+
+	get_page(recv->r_frag->f_page);
+
+	if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) {
+		ic->i_frag.f_offset += RDS_FRAG_SIZE;
+	} else {
+		put_page(ic->i_frag.f_page);
+		ic->i_frag.f_page = NULL;
+		ic->i_frag.f_offset = 0;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * This tries to allocate and post unused work requests after making sure that
+ * they have all the allocations they need to queue received fragments into
+ * sockets.  The i_recv_mutex is held here so that ring_alloc and _unalloc
+ * pairs don't go unmatched.
+ *
+ * -1 is returned if posting fails due to temporary resource exhaustion.
+ */
+int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_recv_work *recv;
+	struct ib_recv_wr *failed_wr;
+	unsigned int posted = 0;
+	int ret = 0;
+	u32 pos;
+
+	while ((prefill || rds_conn_up(conn))
+			&& rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
+		if (pos >= ic->i_recv_ring.w_nr) {
+			printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n",
+					pos);
+			ret = -EINVAL;
+			break;
+		}
+
+		recv = &ic->i_recvs[pos];
+		ret = rds_ib_recv_refill_one(conn, recv, kptr_gfp, page_gfp);
+		if (ret) {
+			ret = -1;
+			break;
+		}
+
+		/* XXX when can this fail? */
+		ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+		rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
+			 recv->r_ibinc, recv->r_frag->f_page,
+			 (long) recv->r_frag->f_mapped, ret);
+		if (ret) {
+			rds_ib_conn_error(conn, "recv post on "
+			       "%pI4 returned %d, disconnecting and "
+			       "reconnecting\n", &conn->c_faddr,
+			       ret);
+			ret = -1;
+			break;
+		}
+
+		posted++;
+	}
+
+	/* We're doing flow control - update the window. */
+	if (ic->i_flowctl && posted)
+		rds_ib_advertise_credits(conn, posted);
+
+	if (ret)
+		rds_ib_ring_unalloc(&ic->i_recv_ring, 1);
+	return ret;
+}
+
+void rds_ib_inc_purge(struct rds_incoming *inc)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+	struct rds_page_frag *pos;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	rdsdebug("purging ibinc %p inc %p\n", ibinc, inc);
+
+	list_for_each_entry_safe(frag, pos, &ibinc->ii_frags, f_item) {
+		list_del_init(&frag->f_item);
+		rds_ib_frag_drop_page(frag);
+		rds_ib_frag_free(frag);
+	}
+}
+
+void rds_ib_inc_free(struct rds_incoming *inc)
+{
+	struct rds_ib_incoming *ibinc;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+
+	rds_ib_inc_purge(inc);
+	rdsdebug("freeing ibinc %p inc %p\n", ibinc, inc);
+	BUG_ON(!list_empty(&ibinc->ii_frags));
+	kmem_cache_free(rds_ib_incoming_slab, ibinc);
+	atomic_dec(&rds_ib_allocation);
+	BUG_ON(atomic_read(&rds_ib_allocation) < 0);
+}
+
+int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov,
+			    size_t size)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+	struct iovec *iov = first_iov;
+	unsigned long to_copy;
+	unsigned long frag_off = 0;
+	unsigned long iov_off = 0;
+	int copied = 0;
+	int ret;
+	u32 len;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	len = be32_to_cpu(inc->i_hdr.h_len);
+
+	while (copied < size && copied < len) {
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off);
+		to_copy = min_t(size_t, to_copy, size - copied);
+		to_copy = min_t(unsigned long, to_copy, len - copied);
+
+		rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag "
+			 "[%p, %lu] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 frag->f_page, frag->f_offset, frag_off);
+
+		/* XXX needs + offset for multiple recvs per page */
+		ret = rds_page_copy_to_user(frag->f_page,
+					    frag->f_offset + frag_off,
+					    iov->iov_base + iov_off,
+					    to_copy);
+		if (ret) {
+			copied = ret;
+			break;
+		}
+
+		iov_off += to_copy;
+		frag_off += to_copy;
+		copied += to_copy;
+	}
+
+	return copied;
+}
+
+/* ic starts out kzalloc()ed */
+void rds_ib_recv_init_ack(struct rds_ib_connection *ic)
+{
+	struct ib_send_wr *wr = &ic->i_ack_wr;
+	struct ib_sge *sge = &ic->i_ack_sge;
+
+	sge->addr = ic->i_ack_dma;
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = ic->i_mr->lkey;
+
+	wr->sg_list = sge;
+	wr->num_sge = 1;
+	wr->opcode = IB_WR_SEND;
+	wr->wr_id = RDS_IB_ACK_WR_ID;
+	wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+}
+
+/*
+ * You'd think that with reliable IB connections you wouldn't need to ack
+ * messages that have been received.  The problem is that IB hardware generates
+ * an ack message before it has DMAed the message into memory.  This creates a
+ * potential message loss if the HCA is disabled for any reason between when it
+ * sends the ack and before the message is DMAed and processed.  This is only a
+ * potential issue if another HCA is available for fail-over.
+ *
+ * When the remote host receives our ack they'll free the sent message from
+ * their send queue.  To decrease the latency of this we always send an ack
+ * immediately after we've received messages.
+ *
+ * For simplicity, we only have one ack in flight at a time.  This puts
+ * pressure on senders to have deep enough send queues to absorb the latency of
+ * a single ack frame being in flight.  This might not be good enough.
+ *
+ * This is implemented by have a long-lived send_wr and sge which point to a
+ * statically allocated ack frame.  This ack wr does not fall under the ring
+ * accounting that the tx and rx wrs do.  The QP attribute specifically makes
+ * room for it beyond the ring size.  Send completion notices its special
+ * wr_id and avoids working with the ring in that case.
+ */
+static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq,
+				int ack_required)
+{
+	rds_ib_set_64bit(&ic->i_ack_next, seq);
+	if (ack_required) {
+		smp_mb__before_clear_bit();
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	}
+}
+
+static u64 rds_ib_get_ack(struct rds_ib_connection *ic)
+{
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	smp_mb__after_clear_bit();
+
+	return ic->i_ack_next;
+}
+
+static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits)
+{
+	struct rds_header *hdr = ic->i_ack;
+	struct ib_send_wr *failed_wr;
+	u64 seq;
+	int ret;
+
+	seq = rds_ib_get_ack(ic);
+
+	rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq);
+	rds_message_populate_header(hdr, 0, 0, 0);
+	hdr->h_ack = cpu_to_be64(seq);
+	hdr->h_credit = adv_credits;
+	rds_message_make_checksum(hdr);
+	ic->i_ack_queued = jiffies;
+
+	ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr);
+	if (unlikely(ret)) {
+		/* Failed to send. Release the WR, and
+		 * force another ACK.
+		 */
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+
+		rds_ib_stats_inc(s_ib_ack_send_failure);
+		/* Need to finesse this later. */
+		BUG();
+	} else
+		rds_ib_stats_inc(s_ib_ack_sent);
+}
+
+/*
+ * There are 3 ways of getting acknowledgements to the peer:
+ *  1.	We call rds_ib_attempt_ack from the recv completion handler
+ *	to send an ACK-only frame.
+ *	However, there can be only one such frame in the send queue
+ *	at any time, so we may have to postpone it.
+ *  2.	When another (data) packet is transmitted while there's
+ *	an ACK in the queue, we piggyback the ACK sequence number
+ *	on the data packet.
+ *  3.	If the ACK WR is done sending, we get called from the
+ *	send queue completion handler, and check whether there's
+ *	another ACK pending (postponed because the WR was on the
+ *	queue). If so, we transmit it.
+ *
+ * We maintain 2 variables:
+ *  -	i_ack_flags, which keeps track of whether the ACK WR
+ *	is currently in the send queue or not (IB_ACK_IN_FLIGHT)
+ *  -	i_ack_next, which is the last sequence number we received
+ *
+ * Potentially, send queue and receive queue handlers can run concurrently.
+ *
+ * Reconnecting complicates this picture just slightly. When we
+ * reconnect, we may be seeing duplicate packets. The peer
+ * is retransmitting them, because it hasn't seen an ACK for
+ * them. It is important that we ACK these.
+ *
+ * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with
+ * this flag set *MUST* be acknowledged immediately.
+ */
+
+/*
+ * When we get here, we're called from the recv queue handler.
+ * Check whether we ought to transmit an ACK.
+ */
+void rds_ib_attempt_ack(struct rds_ib_connection *ic)
+{
+	unsigned int adv_credits;
+
+	if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		return;
+
+	if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
+		rds_ib_stats_inc(s_ib_ack_send_delayed);
+		return;
+	}
+
+	/* Can we get a send credit? */
+	if (!rds_ib_send_grab_credits(ic, 1, &adv_credits, 0)) {
+		rds_ib_stats_inc(s_ib_tx_throttle);
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		return;
+	}
+
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	rds_ib_send_ack(ic, adv_credits);
+}
+
+/*
+ * We get here from the send completion handler, when the
+ * adapter tells us the ACK frame was sent.
+ */
+void rds_ib_ack_send_complete(struct rds_ib_connection *ic)
+{
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+	rds_ib_attempt_ack(ic);
+}
+
+/*
+ * This is called by the regular xmit code when it wants to piggyback
+ * an ACK on an outgoing frame.
+ */
+u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic)
+{
+	if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		rds_ib_stats_inc(s_ib_ack_send_piggybacked);
+	return rds_ib_get_ack(ic);
+}
+
+/*
+ * It's kind of lame that we're copying from the posted receive pages into
+ * long-lived bitmaps.  We could have posted the bitmaps and rdma written into
+ * them.  But receiving new congestion bitmaps should be a *rare* event, so
+ * hopefully we won't need to invest that complexity in making it more
+ * efficient.  By copying we can share a simpler core with TCP which has to
+ * copy.
+ */
+static void rds_ib_cong_recv(struct rds_connection *conn,
+			      struct rds_ib_incoming *ibinc)
+{
+	struct rds_cong_map *map;
+	unsigned int map_off;
+	unsigned int map_page;
+	struct rds_page_frag *frag;
+	unsigned long frag_off;
+	unsigned long to_copy;
+	unsigned long copied;
+	uint64_t uncongested = 0;
+	void *addr;
+
+	/* catch completely corrupt packets */
+	if (be32_to_cpu(ibinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES)
+		return;
+
+	map = conn->c_fcong;
+	map_page = 0;
+	map_off = 0;
+
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	frag_off = 0;
+
+	copied = 0;
+
+	while (copied < RDS_CONG_MAP_BYTES) {
+		uint64_t *src, *dst;
+		unsigned int k;
+
+		to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off);
+		BUG_ON(to_copy & 7); /* Must be 64bit aligned. */
+
+		addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0);
+
+		src = addr + frag_off;
+		dst = (void *)map->m_page_addrs[map_page] + map_off;
+		for (k = 0; k < to_copy; k += 8) {
+			/* Record ports that became uncongested, ie
+			 * bits that changed from 0 to 1. */
+			uncongested |= ~(*src) & *dst;
+			*dst++ = *src++;
+		}
+		kunmap_atomic(addr, KM_SOFTIRQ0);
+
+		copied += to_copy;
+
+		map_off += to_copy;
+		if (map_off == PAGE_SIZE) {
+			map_off = 0;
+			map_page++;
+		}
+
+		frag_off += to_copy;
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+	}
+
+	/* the congestion map is in little endian order */
+	uncongested = le64_to_cpu(uncongested);
+
+	rds_cong_map_updated(map, uncongested);
+}
+
+/*
+ * Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_ib_ack_state {
+	u64		ack_next;
+	u64		ack_recv;
+	unsigned int	ack_required:1;
+	unsigned int	ack_next_valid:1;
+	unsigned int	ack_recv_valid:1;
+};
+
+static void rds_ib_process_recv(struct rds_connection *conn,
+				struct rds_ib_recv_work *recv, u32 byte_len,
+				struct rds_ib_ack_state *state)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_incoming *ibinc = ic->i_ibinc;
+	struct rds_header *ihdr, *hdr;
+
+	/* XXX shut down the connection if port 0,0 are seen? */
+
+	rdsdebug("ic %p ibinc %p recv %p byte len %u\n", ic, ibinc, recv,
+		 byte_len);
+
+	if (byte_len < sizeof(struct rds_header)) {
+		rds_ib_conn_error(conn, "incoming message "
+		       "from %pI4 didn't inclue a "
+		       "header, disconnecting and "
+		       "reconnecting\n",
+		       &conn->c_faddr);
+		return;
+	}
+	byte_len -= sizeof(struct rds_header);
+
+	ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs];
+
+	/* Validate the checksum. */
+	if (!rds_message_verify_checksum(ihdr)) {
+		rds_ib_conn_error(conn, "incoming message "
+		       "from %pI4 has corrupted header - "
+		       "forcing a reconnect\n",
+		       &conn->c_faddr);
+		rds_stats_inc(s_recv_drop_bad_checksum);
+		return;
+	}
+
+	/* Process the ACK sequence which comes with every packet */
+	state->ack_recv = be64_to_cpu(ihdr->h_ack);
+	state->ack_recv_valid = 1;
+
+	/* Process the credits update if there was one */
+	if (ihdr->h_credit)
+		rds_ib_send_add_credits(conn, ihdr->h_credit);
+
+	if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) {
+		/* This is an ACK-only packet. The fact that it gets
+		 * special treatment here is that historically, ACKs
+		 * were rather special beasts.
+		 */
+		rds_ib_stats_inc(s_ib_ack_received);
+
+		/*
+		 * Usually the frags make their way on to incs and are then freed as
+		 * the inc is freed.  We don't go that route, so we have to drop the
+		 * page ref ourselves.  We can't just leave the page on the recv
+		 * because that confuses the dma mapping of pages and each recv's use
+		 * of a partial page.  We can leave the frag, though, it will be
+		 * reused.
+		 *
+		 * FIXME: Fold this into the code path below.
+		 */
+		rds_ib_frag_drop_page(recv->r_frag);
+		return;
+	}
+
+	/*
+	 * If we don't already have an inc on the connection then this
+	 * fragment has a header and starts a message.. copy its header
+	 * into the inc and save the inc so we can hang upcoming fragments
+	 * off its list.
+	 */
+	if (ibinc == NULL) {
+		ibinc = recv->r_ibinc;
+		recv->r_ibinc = NULL;
+		ic->i_ibinc = ibinc;
+
+		hdr = &ibinc->ii_inc.i_hdr;
+		memcpy(hdr, ihdr, sizeof(*hdr));
+		ic->i_recv_data_rem = be32_to_cpu(hdr->h_len);
+
+		rdsdebug("ic %p ibinc %p rem %u flag 0x%x\n", ic, ibinc,
+			 ic->i_recv_data_rem, hdr->h_flags);
+	} else {
+		hdr = &ibinc->ii_inc.i_hdr;
+		/* We can't just use memcmp here; fragments of a
+		 * single message may carry different ACKs */
+		if (hdr->h_sequence != ihdr->h_sequence
+		 || hdr->h_len != ihdr->h_len
+		 || hdr->h_sport != ihdr->h_sport
+		 || hdr->h_dport != ihdr->h_dport) {
+			rds_ib_conn_error(conn,
+				"fragment header mismatch; forcing reconnect\n");
+			return;
+		}
+	}
+
+	list_add_tail(&recv->r_frag->f_item, &ibinc->ii_frags);
+	recv->r_frag = NULL;
+
+	if (ic->i_recv_data_rem > RDS_FRAG_SIZE)
+		ic->i_recv_data_rem -= RDS_FRAG_SIZE;
+	else {
+		ic->i_recv_data_rem = 0;
+		ic->i_ibinc = NULL;
+
+		if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP)
+			rds_ib_cong_recv(conn, ibinc);
+		else {
+			rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr,
+					  &ibinc->ii_inc, GFP_ATOMIC,
+					  KM_SOFTIRQ0);
+			state->ack_next = be64_to_cpu(hdr->h_sequence);
+			state->ack_next_valid = 1;
+		}
+
+		/* Evaluate the ACK_REQUIRED flag *after* we received
+		 * the complete frame, and after bumping the next_rx
+		 * sequence. */
+		if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) {
+			rds_stats_inc(s_recv_ack_required);
+			state->ack_required = 1;
+		}
+
+		rds_inc_put(&ibinc->ii_inc);
+	}
+}
+
+/*
+ * Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.
+ */
+void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_ib_ack_state state = { 0, };
+	struct rds_ib_recv_work *recv;
+
+	rdsdebug("conn %p cq %p\n", conn, cq);
+
+	rds_ib_stats_inc(s_ib_rx_cq_call);
+
+	ib_req_notify_cq(cq, IB_CQ_SOLICITED);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_ib_stats_inc(s_ib_rx_cq_event);
+
+		recv = &ic->i_recvs[rds_ib_ring_oldest(&ic->i_recv_ring)];
+
+		rds_ib_recv_unmap_page(ic, recv);
+
+		/*
+		 * Also process recvs in connecting state because it is possible
+		 * to get a recv completion _before_ the rdmacm ESTABLISHED
+		 * event is processed.
+		 */
+		if (rds_conn_up(conn) || rds_conn_connecting(conn)) {
+			/* We expect errors as the qp is drained during shutdown */
+			if (wc.status == IB_WC_SUCCESS) {
+				rds_ib_process_recv(conn, recv, wc.byte_len, &state);
+			} else {
+				rds_ib_conn_error(conn, "recv completion on "
+				       "%pI4 had status %u, disconnecting and "
+				       "reconnecting\n", &conn->c_faddr,
+				       wc.status);
+			}
+		}
+
+		rds_ib_ring_free(&ic->i_recv_ring, 1);
+	}
+
+	if (state.ack_next_valid)
+		rds_ib_set_ack(ic, state.ack_next, state.ack_required);
+	if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) {
+		rds_send_drop_acked(conn, state.ack_recv, NULL);
+		ic->i_ack_recv = state.ack_recv;
+	}
+	if (rds_conn_up(conn))
+		rds_ib_attempt_ack(ic);
+
+	/* If we ever end up with a really empty receive ring, we're
+	 * in deep trouble, as the sender will definitely see RNR
+	 * timeouts. */
+	if (rds_ib_ring_empty(&ic->i_recv_ring))
+		rds_ib_stats_inc(s_ib_rx_ring_empty);
+
+	/*
+	 * If the ring is running low, then schedule the thread to refill.
+	 */
+	if (rds_ib_ring_low(&ic->i_recv_ring))
+		queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+}
+
+int rds_ib_recv(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	int ret = 0;
+
+	rdsdebug("conn %p\n", conn);
+
+	/*
+	 * If we get a temporary posting failure in this context then
+	 * we're really low and we want the caller to back off for a bit.
+	 */
+	mutex_lock(&ic->i_recv_mutex);
+	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0))
+		ret = -ENOMEM;
+	else
+		rds_ib_stats_inc(s_ib_rx_refill_from_thread);
+	mutex_unlock(&ic->i_recv_mutex);
+
+	if (rds_conn_up(conn))
+		rds_ib_attempt_ack(ic);
+
+	return ret;
+}
+
+int __init rds_ib_recv_init(void)
+{
+	struct sysinfo si;
+	int ret = -ENOMEM;
+
+	/* Default to 30% of all available RAM for recv memory */
+	si_meminfo(&si);
+	rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE;
+
+	rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming",
+					sizeof(struct rds_ib_incoming),
+					0, 0, NULL);
+	if (rds_ib_incoming_slab == NULL)
+		goto out;
+
+	rds_ib_frag_slab = kmem_cache_create("rds_ib_frag",
+					sizeof(struct rds_page_frag),
+					0, 0, NULL);
+	if (rds_ib_frag_slab == NULL)
+		kmem_cache_destroy(rds_ib_incoming_slab);
+	else
+		ret = 0;
+out:
+	return ret;
+}
+
+void rds_ib_recv_exit(void)
+{
+	kmem_cache_destroy(rds_ib_incoming_slab);
+	kmem_cache_destroy(rds_ib_frag_slab);
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:37 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:37 -0800
Subject: [ofa-general] [PATCH 20/26] RDS: Common RDMA transport code
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-21-git-send-email-andy.grover@oracle.com>

Although most of IB and iWARP are separated from each other,
there is some common code required to handle their shared
CM listen port. This code listens for CM events and then
dispatches the event to the appropriate transport, either
IB or iWARP.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/rdma_transport.c |  214 ++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/rdma_transport.h |   28 ++++++
 2 files changed, 242 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/rdma_transport.c
 create mode 100644 net/rds/rdma_transport.h

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
new file mode 100644
index 0000000..7b19024
--- /dev/null
+++ b/net/rds/rdma_transport.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright (c) 2009 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <rdma/rdma_cm.h>
+
+#include "rdma_transport.h"
+
+static struct rdma_cm_id *rds_iw_listen_id;
+
+int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
+			      struct rdma_cm_event *event)
+{
+	/* this can be null in the listening path */
+	struct rds_connection *conn = cm_id->context;
+	struct rds_transport *trans;
+	int ret = 0;
+
+	rdsdebug("conn %p id %p handling event %u\n", conn, cm_id,
+		 event->event);
+
+	if (cm_id->device->node_type == RDMA_NODE_RNIC)
+		trans = &rds_iw_transport;
+	else
+		trans = &rds_ib_transport;
+
+	/* Prevent shutdown from tearing down the connection
+	 * while we're executing. */
+	if (conn) {
+		mutex_lock(&conn->c_cm_lock);
+
+		/* If the connection is being shut down, bail out
+		 * right away. We return 0 so cm_id doesn't get
+		 * destroyed prematurely */
+		if (rds_conn_state(conn) == RDS_CONN_DISCONNECTING) {
+			/* Reject incoming connections while we're tearing
+			 * down an existing one. */
+			if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST)
+				ret = 1;
+			goto out;
+		}
+	}
+
+	switch (event->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		ret = trans->cm_handle_connect(cm_id, event);
+		break;
+
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		/* XXX do we need to clean up if this fails? */
+		ret = rdma_resolve_route(cm_id,
+					 RDS_RDMA_RESOLVE_TIMEOUT_MS);
+		break;
+
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		/* XXX worry about racing with listen acceptance */
+		ret = trans->cm_initiate_connect(cm_id);
+		break;
+
+	case RDMA_CM_EVENT_ESTABLISHED:
+		trans->cm_connect_complete(conn, event);
+		break;
+
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+		if (conn)
+			rds_conn_drop(conn);
+		break;
+
+	case RDMA_CM_EVENT_DISCONNECTED:
+		printk(KERN_WARNING "RDS/IW: DISCONNECT event - dropping connection "
+			"%pI4->%pI4\n", &conn->c_laddr,
+			 &conn->c_faddr);
+		rds_conn_drop(conn);
+		break;
+
+	default:
+		/* things like device disconnect? */
+		printk(KERN_ERR "unknown event %u\n", event->event);
+		BUG();
+		break;
+	}
+
+out:
+	if (conn)
+		mutex_unlock(&conn->c_cm_lock);
+
+	rdsdebug("id %p event %u handling ret %d\n", cm_id, event->event, ret);
+
+	return ret;
+}
+
+static int __init rds_rdma_listen_init(void)
+{
+	struct sockaddr_in sin;
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(rds_rdma_cm_event_handler, NULL, RDMA_PS_TCP);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		printk(KERN_ERR "RDS/IW: failed to setup listener, "
+		       "rdma_create_id() returned %d\n", ret);
+		goto out;
+	}
+
+	sin.sin_family = PF_INET,
+	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
+	sin.sin_port = (__force u16)htons(RDS_PORT);
+
+	/*
+	 * XXX I bet this binds the cm_id to a device.  If we want to support
+	 * fail-over we'll have to take this into consideration.
+	 */
+	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	if (ret) {
+		printk(KERN_ERR "RDS/IW: failed to setup listener, "
+		       "rdma_bind_addr() returned %d\n", ret);
+		goto out;
+	}
+
+	ret = rdma_listen(cm_id, 128);
+	if (ret) {
+		printk(KERN_ERR "RDS/IW: failed to setup listener, "
+		       "rdma_listen() returned %d\n", ret);
+		goto out;
+	}
+
+	rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT);
+
+	rds_iw_listen_id = cm_id;
+	cm_id = NULL;
+out:
+	if (cm_id)
+		rdma_destroy_id(cm_id);
+	return ret;
+}
+
+static void rds_rdma_listen_stop(void)
+{
+	if (rds_iw_listen_id) {
+		rdsdebug("cm %p\n", rds_iw_listen_id);
+		rdma_destroy_id(rds_iw_listen_id);
+		rds_iw_listen_id = NULL;
+	}
+}
+
+int __init rds_rdma_init(void)
+{
+	int ret;
+
+	ret = rds_rdma_listen_init();
+	if (ret)
+		goto out;
+
+	ret = rds_iw_init();
+	if (ret)
+		goto err_iw_init;
+
+	ret = rds_ib_init();
+	if (ret)
+		goto err_ib_init;
+
+	goto out;
+
+err_ib_init:
+	rds_iw_exit();
+err_iw_init:
+	rds_rdma_listen_stop();
+out:
+	return ret;
+}
+
+void rds_rdma_exit(void)
+{
+	/* stop listening first to ensure no new connections are attempted */
+	rds_rdma_listen_stop();
+	rds_ib_exit();
+	rds_iw_exit();
+}
+
diff --git a/net/rds/rdma_transport.h b/net/rds/rdma_transport.h
new file mode 100644
index 0000000..2f2c7d9
--- /dev/null
+++ b/net/rds/rdma_transport.h
@@ -0,0 +1,28 @@
+#ifndef _RDMA_TRANSPORT_H
+#define _RDMA_TRANSPORT_H
+
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include "rds.h"
+
+#define RDS_RDMA_RESOLVE_TIMEOUT_MS     5000
+
+int rds_rdma_conn_connect(struct rds_connection *conn);
+int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
+			      struct rdma_cm_event *event);
+
+/* from rdma_transport.c */
+int rds_rdma_init(void);
+void rds_rdma_exit(void);
+
+/* from ib.c */
+extern struct rds_transport rds_ib_transport;
+int rds_ib_init(void);
+void rds_ib_exit(void);
+
+/* from iw.c */
+extern struct rds_transport rds_iw_transport;
+int rds_iw_init(void);
+void rds_iw_exit(void);
+
+#endif
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:35 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:35 -0800
Subject: [ofa-general] [PATCH 18/26] RDS/IB: Stats and sysctls
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-19-git-send-email-andy.grover@oracle.com>

IB-specific stats and sysctls.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/ib_stats.c  |   95 +++++++++++++++++++++++++++++++++++
 net/rds/ib_sysctl.c |  137 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 232 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/ib_stats.c
 create mode 100644 net/rds/ib_sysctl.c

diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
new file mode 100644
index 0000000..02e3e3d
--- /dev/null
+++ b/net/rds/ib_stats.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+#include "ib.h"
+
+DEFINE_PER_CPU(struct rds_ib_statistics, rds_ib_stats) ____cacheline_aligned;
+
+static char *rds_ib_stat_names[] = {
+	"ib_connect_raced",
+	"ib_listen_closed_stale",
+	"ib_tx_cq_call",
+	"ib_tx_cq_event",
+	"ib_tx_ring_full",
+	"ib_tx_throttle",
+	"ib_tx_sg_mapping_failure",
+	"ib_tx_stalled",
+	"ib_tx_credit_updates",
+	"ib_rx_cq_call",
+	"ib_rx_cq_event",
+	"ib_rx_ring_empty",
+	"ib_rx_refill_from_cq",
+	"ib_rx_refill_from_thread",
+	"ib_rx_alloc_limit",
+	"ib_rx_credit_updates",
+	"ib_ack_sent",
+	"ib_ack_send_failure",
+	"ib_ack_send_delayed",
+	"ib_ack_send_piggybacked",
+	"ib_ack_received",
+	"ib_rdma_mr_alloc",
+	"ib_rdma_mr_free",
+	"ib_rdma_mr_used",
+	"ib_rdma_mr_pool_flush",
+	"ib_rdma_mr_pool_wait",
+	"ib_rdma_mr_pool_depleted",
+};
+
+unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail)
+{
+	struct rds_ib_statistics stats = {0, };
+	uint64_t *src;
+	uint64_t *sum;
+	size_t i;
+	int cpu;
+
+	if (avail < ARRAY_SIZE(rds_ib_stat_names))
+		goto out;
+
+	for_each_online_cpu(cpu) {
+		src = (uint64_t *)&(per_cpu(rds_ib_stats, cpu));
+		sum = (uint64_t *)&stats;
+		for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++)
+			*(sum++) += *(src++);
+	}
+
+	rds_stats_info_copy(iter, (uint64_t *)&stats, rds_ib_stat_names,
+			    ARRAY_SIZE(rds_ib_stat_names));
+out:
+	return ARRAY_SIZE(rds_ib_stat_names);
+}
diff --git a/net/rds/ib_sysctl.c b/net/rds/ib_sysctl.c
new file mode 100644
index 0000000..d87830d
--- /dev/null
+++ b/net/rds/ib_sysctl.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/sysctl.h>
+#include <linux/proc_fs.h>
+
+#include "ib.h"
+
+static struct ctl_table_header *rds_ib_sysctl_hdr;
+
+unsigned long rds_ib_sysctl_max_send_wr = RDS_IB_DEFAULT_SEND_WR;
+unsigned long rds_ib_sysctl_max_recv_wr = RDS_IB_DEFAULT_RECV_WR;
+unsigned long rds_ib_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE;
+static unsigned long rds_ib_sysctl_max_wr_min = 1;
+/* hardware will fail CQ creation long before this */
+static unsigned long rds_ib_sysctl_max_wr_max = (u32)~0;
+
+unsigned long rds_ib_sysctl_max_unsig_wrs = 16;
+static unsigned long rds_ib_sysctl_max_unsig_wr_min = 1;
+static unsigned long rds_ib_sysctl_max_unsig_wr_max = 64;
+
+unsigned long rds_ib_sysctl_max_unsig_bytes = (16 << 20);
+static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1;
+static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL;
+
+unsigned int rds_ib_sysctl_flow_control = 1;
+
+ctl_table rds_ib_sysctl_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_send_wr",
+		.data		= &rds_ib_sysctl_max_send_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_wr_min,
+		.extra2		= &rds_ib_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_wr",
+		.data		= &rds_ib_sysctl_max_recv_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_wr_min,
+		.extra2		= &rds_ib_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_wr",
+		.data		= &rds_ib_sysctl_max_unsig_wrs,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_unsig_wr_min,
+		.extra2		= &rds_ib_sysctl_max_unsig_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_bytes",
+		.data		= &rds_ib_sysctl_max_unsig_bytes,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_unsig_bytes_min,
+		.extra2		= &rds_ib_sysctl_max_unsig_bytes_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_allocation",
+		.data		= &rds_ib_sysctl_max_recv_allocation,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "flow_control",
+		.data		= &rds_ib_sysctl_flow_control,
+		.maxlen		= sizeof(rds_ib_sysctl_flow_control),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{ .ctl_name = 0}
+};
+
+static struct ctl_path rds_ib_sysctl_path[] = {
+	{ .procname = "net", .ctl_name = CTL_NET, },
+	{ .procname = "rds", .ctl_name = CTL_UNNUMBERED, },
+	{ .procname = "ib", .ctl_name = CTL_UNNUMBERED, },
+	{ }
+};
+
+void rds_ib_sysctl_exit(void)
+{
+	if (rds_ib_sysctl_hdr)
+		unregister_sysctl_table(rds_ib_sysctl_hdr);
+}
+
+int __init rds_ib_sysctl_init(void)
+{
+	rds_ib_sysctl_hdr = register_sysctl_paths(rds_ib_sysctl_path, rds_ib_sysctl_table);
+	if (rds_ib_sysctl_hdr == NULL)
+		return -ENOMEM;
+	return 0;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:38 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:38 -0800
Subject: [ofa-general] [PATCH 21/26] RDS: Documentation
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-22-git-send-email-andy.grover@oracle.com>

This file documents the specifics of the RDS sockets API,
as well as covering some of the details of its internal
implementation.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 Documentation/networking/rds.txt |  356 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/rds.txt

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 0000000..c67077c
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,356 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports.  The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ *	Addressing
+        RDS uses IPv4 addresses and 16bit port numbers to identify
+        the end point of a connection. All socket operations that involve
+        passing addresses between kernel and user space generally
+        use a struct sockaddr_in.
+
+        The fact that IPv4 addresses are used does not mean the underlying
+        transport has to be IP-based. In fact, RDS over IB uses a
+        reliable IB connection; the IP address is used exclusively to
+        locate the remote node's GID (by ARPing for the given IP).
+
+        The port space is entirely independent of UDP, TCP or any other
+        protocol.
+
+ *	Socket interface
+        RDS sockets work *mostly* as you would expect from a BSD
+        socket. The next section will cover the details. At any rate,
+        all I/O is performed through the standard BSD socket API.
+        Some additions like zerocopy support are implemented through
+        control messages, while other extensions use the getsockopt/
+        setsockopt calls.
+
+        Sockets must be bound before you can send or receive data.
+        This is needed because binding also selects a transport and
+        attaches it to the socket. Once bound, the transport assignment
+        does not change. RDS will tolerate IPs moving around (eg in
+        a active-active HA scenario), but only as long as the address
+        doesn't move to a different transport.
+
+ *	sysctls
+        RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+  AF_RDS, PF_RDS, SOL_RDS
+        These constants haven't been assigned yet, because RDS isn't in
+        mainline yet. Currently, the kernel module assigns some constant
+        and publishes it to user space through two sysctl files
+                /proc/sys/net/rds/pf_rds
+                /proc/sys/net/rds/sol_rds
+
+  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+        This creates a new, unbound RDS socket.
+
+  setsockopt(SOL_SOCKET): send and receive buffer size
+        RDS honors the send and receive buffer size socket options.
+        You are not allowed to queue more than SO_SNDSIZE bytes to
+        a socket. A message is queued when sendmsg is called, and
+        it leaves the queue when the remote system acknowledges
+        its arrival.
+
+        The SO_RCVSIZE option controls the maximum receive queue length.
+        This is a soft limit rather than a hard limit - RDS will
+        continue to accept and queue incoming messages, even if that
+        takes the queue length over the limit. However, it will also
+        mark the port as "congested" and send a congestion update to
+        the source node. The source node is supposed to throttle any
+        processes sending to this congested port.
+
+  bind(fd, &sockaddr_in, ...)
+        This binds the socket to a local IP address and port, and a
+        transport.
+
+  sendmsg(fd, ...)
+        Sends a message to the indicated recipient. The kernel will
+        transparently establish the underlying reliable connection
+        if it isn't up yet.
+
+        An attempt to send a message that exceeds SO_SNDSIZE will
+        return with -EMSGSIZE
+
+        An attempt to send a message that would take the total number
+        of queued bytes over the SO_SNDSIZE threshold will return
+        EAGAIN.
+
+        An attempt to send a message to a destination that is marked
+        as "congested" will return ENOBUFS.
+
+  recvmsg(fd, ...)
+        Receives a message that was queued to this socket. The sockets
+        recv queue accounting is adjusted, and if the queue length
+        drops below SO_SNDSIZE, the port is marked uncongested, and
+        a congestion update is sent to all peers.
+
+        Applications can ask the RDS kernel module to receive
+        notifications via control messages (for instance, there is a
+        notification when a congestion update arrived, or when a RDMA
+        operation completes). These notifications are received through
+        the msg.msg_control buffer of struct msghdr. The format of the
+        messages is described in manpages.
+
+  poll(fd)
+        RDS supports the poll interface to allow the application
+        to implement async I/O.
+
+        POLLIN handling is pretty straightforward. When there's an
+        incoming message queued to the socket, or a pending notification,
+        we signal POLLIN.
+
+        POLLOUT is a little harder. Since you can essentially send
+        to any destination, RDS will always signal POLLOUT as long as
+        there's room on the send queue (ie the number of bytes queued
+        is less than the sendbuf size).
+
+        However, the kernel will refuse to accept messages to
+        a destination marked congested - in this case you will loop
+        forever if you rely on poll to tell you what to do.
+        This isn't a trivial problem, but applications can deal with
+        this - by using congestion notifications, and by checking for
+        ENOBUFS errors returned by sendmsg.
+
+  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+        This allows the application to discard all messages queued to a
+        specific destination on this particular socket.
+
+        This allows the application to cancel outstanding messages if
+        it detects a timeout. For instance, if it tried to send a message,
+        and the remote host is unreachable, RDS will keep trying forever.
+        The application may decide it's not worth it, and cancel the
+        operation. In this case, it would use RDS_CANCEL_SENT_TO to
+        nuke any pending messages.
+
+
+RDMA for RDS
+============
+
+  see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+  see rds(7) manpage
+
+
+RDS Protocol
+============
+
+  Message header
+
+    The message header is a 'struct rds_header' (see rds.h):
+    Fields:
+      h_sequence:
+          per-packet sequence number
+      h_ack:
+          piggybacked acknowledgment of last packet received
+      h_len:
+          length of data, not including header
+      h_sport:
+          source port
+      h_dport:
+          destination port
+      h_flags:
+          CONG_BITMAP - this is a congestion update bitmap
+          ACK_REQUIRED - receiver must ack this packet
+          RETRANSMITTED - packet has previously been sent
+      h_credit:
+          indicate to other end of connection that
+          it has more credits available (i.e. there is
+          more send room)
+      h_padding[4]:
+          unused, for future use
+      h_csum:
+          header checksum
+      h_exthdr:
+          optional data can be passed here. This is currently used for
+          passing RDMA-related information.
+
+  ACK and retransmit handling
+
+      One might think that with reliable IB connections you wouldn't need
+      to ack messages that have been received.  The problem is that IB
+      hardware generates an ack message before it has DMAed the message
+      into memory.  This creates a potential message loss if the HCA is
+      disabled for any reason between when it sends the ack and before
+      the message is DMAed and processed.  This is only a potential issue
+      if another HCA is available for fail-over.
+
+      Sending an ack immediately would allow the sender to free the sent
+      message from their send queue quickly, but could cause excessive
+      traffic to be used for acks. RDS piggybacks acks on sent data
+      packets.  Ack-only packets are reduced by only allowing one to be
+      in flight at a time, and by the sender only asking for acks when
+      its send buffers start to fill up. All retransmissions are also
+      acked.
+
+  Flow Control
+
+      RDS's IB transport uses a credit-based mechanism to verify that
+      there is space in the peer's receive buffers for more data. This
+      eliminates the need for hardware retries on the connection.
+
+  Congestion
+
+      Messages waiting in the receive queue on the receiving socket
+      are accounted against the sockets SO_RCVBUF option value.  Only
+      the payload bytes in the message are accounted for.  If the
+      number of bytes queued equals or exceeds rcvbuf then the socket
+      is congested.  All sends attempted to this socket's address
+      should return block or return -EWOULDBLOCK.
+
+      Applications are expected to be reasonably tuned such that this
+      situation very rarely occurs.  An application encountering this
+      "back-pressure" is considered a bug.
+
+      This is implemented by having each node maintain bitmaps which
+      indicate which ports on bound addresses are congested.  As the
+      bitmap changes it is sent through all the connections which
+      terminate in the local address of the bitmap which changed.
+
+      The bitmaps are allocated as connections are brought up.  This
+      avoids allocation in the interrupt handling path which queues
+      sages on sockets.  The dense bitmaps let transports send the
+      entire bitmap on any bitmap change reasonably efficiently.  This
+      is much easier to implement than some finer-grained
+      communication of per-port congestion.  The sender does a very
+      inexpensive bit test to test if the port it's about to send to
+      is congested or not.
+
+
+RDS Transport Layer
+==================
+
+  As mentioned above, RDS is not IB-specific. Its code is divided
+  into a general RDS layer and a transport layer.
+
+  The general layer handles the socket API, congestion handling,
+  loopback, stats, usermem pinning, and the connection state machine.
+
+  The transport layer handles the details of the transport. The IB
+  transport, for example, handles all the queue pairs, work requests,
+  CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+  struct rds_message
+    aka possibly "rds_outgoing", the generic RDS layer copies data to
+    be sent and sets header fields as needed, based on the socket API.
+    This is then queued for the individual connection and sent by the
+    connection's transport.
+  struct rds_incoming
+    a generic struct referring to incoming data that can be handed from
+    the transport to the general code and queued by the general code
+    while the socket is awoken. It is then passed back to the transport
+    code to handle the actual copy-to-user.
+  struct rds_socket
+    per-socket information
+  struct rds_connection
+    per-connection information
+  struct rds_transport
+    pointers to transport-specific functions
+  struct rds_statistics
+    non-transport-specific statistics
+  struct rds_cong_map
+    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+  ERROR states.
+
+  The first time an attempt is made by an RDS socket to send data to
+  a node, a connection is allocated and connected. That connection is
+  then maintained forever -- if there are transport errors, the
+  connection will be dropped and re-established.
+
+  Dropping a connection while packets are queued will cause queued or
+  partially-sent datagrams to be retransmitted when the connection is
+  re-established.
+
+
+The send path
+=============
+
+  rds_sendmsg()
+    struct rds_message built from incoming data
+    CMSGs parsed (e.g. RDMA ops)
+    transport connection alloced and connected if not already
+    rds_message placed on send queue
+    send worker awoken
+  rds_send_worker()
+    calls rds_send_xmit() until queue is empty
+  rds_send_xmit()
+    transmits congestion map if one is pending
+    may set ACK_REQUIRED
+    calls transport to send either non-RDMA or RDMA message
+    (RDMA ops never retransmitted)
+  rds_ib_xmit()
+    allocs work requests from send ring
+    adds any new send credits available to peer (h_credits)
+    maps the rds_message's sg list
+    piggybacks ack
+    populates work requests
+    post send to connection's queue pair
+
+The recv path
+=============
+
+  rds_ib_recv_cq_comp_handler()
+    looks at write completions
+    unmaps recv buffer from device
+    no errors, call rds_ib_process_recv()
+    refill recv ring
+  rds_ib_process_recv()
+    validate header checksum
+    copy header to rds_ib_incoming struct if start of a new datagram
+    add to ibinc's fraglist
+    if competed datagram:
+      update cong map if datagram was cong update
+      call rds_recv_incoming() otherwise
+      note if ack is required
+  rds_recv_incoming()
+    drop duplicate packets
+    respond to pings
+    find the sock associated with this datagram
+    add to sock queue
+    wake up sock
+    do some congestion calculations
+  rds_recvmsg
+    copy data into user iovec
+    handle CMSGs
+    return to application
+
+
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:39 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:39 -0800
Subject: [ofa-general] [PATCH 22/26] RDS: Kconfig and Makefile
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-23-git-send-email-andy.grover@oracle.com>

Add RDS Kconfig and Makefile, and modify net/'s to add
us to the build.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/Kconfig      |    1 +
 net/Makefile     |    1 +
 net/rds/Kconfig  |   13 +++++++++++++
 net/rds/Makefile |   14 ++++++++++++++
 4 files changed, 29 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/Kconfig
 create mode 100644 net/rds/Makefile

diff --git a/net/Kconfig b/net/Kconfig
index a12bae0..6b39ede 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -171,6 +171,7 @@ endif
 
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
+source "net/rds/Kconfig"
 source "net/tipc/Kconfig"
 source "net/atm/Kconfig"
 source "net/802/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 0fcce89..9e00a55 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -49,6 +49,7 @@ obj-y				+= 8021q/
 endif
 obj-$(CONFIG_IP_DCCP)		+= dccp/
 obj-$(CONFIG_IP_SCTP)		+= sctp/
+obj-$(CONFIG_RDS)		+= rds/
 obj-y				+= wireless/
 obj-$(CONFIG_MAC80211)		+= mac80211/
 obj-$(CONFIG_TIPC)		+= tipc/
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
new file mode 100644
index 0000000..63bd370
--- /dev/null
+++ b/net/rds/Kconfig
@@ -0,0 +1,13 @@
+
+config RDS
+	tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)"
+	depends on INET && INFINIBAND_IPOIB && EXPERIMENTAL
+	---help---
+	  RDS provides reliable, sequenced delivery of datagrams
+	  over Infiniband.
+
+config RDS_DEBUG
+        bool "Debugging messages"
+	depends on RDS
+        default n
+
diff --git a/net/rds/Makefile b/net/rds/Makefile
new file mode 100644
index 0000000..51f2758
--- /dev/null
+++ b/net/rds/Makefile
@@ -0,0 +1,14 @@
+obj-$(CONFIG_RDS) += rds.o
+rds-y :=	af_rds.o bind.o cong.o connection.o info.o message.o   \
+			recv.o send.o stats.o sysctl.o threads.o transport.o \
+			loop.o page.o rdma.o \
+			rdma_transport.o \
+			ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
+			ib_sysctl.o ib_rdma.o \
+			iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
+			iw_sysctl.o iw_rdma.o
+
+ifeq ($(CONFIG_RDS_DEBUG), y)
+EXTRA_CFLAGS += -DDEBUG
+endif
+
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:40 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:40 -0800
Subject: [ofa-general] [PATCH 23/26] RDS: Add AF and PF #defines for RDS
	sockets
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-24-git-send-email-andy.grover@oracle.com>

RDS is a reliable datagram protocol used for IPC on Oracle
database clusters. This adds address and protocol family numbers
for it.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 include/linux/socket.h |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 20fc4bb..3cdc041 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -191,7 +191,8 @@ struct ucred {
 #define AF_RXRPC	33	/* RxRPC sockets 		*/
 #define AF_ISDN		34	/* mISDN sockets 		*/
 #define AF_PHONET	35	/* Phonet sockets		*/
-#define AF_MAX		36	/* For now.. */
+#define AF_RDS		36	/* RDS sockets 			*/
+#define AF_MAX		37	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -229,6 +230,7 @@ struct ucred {
 #define PF_RXRPC	AF_RXRPC
 #define PF_ISDN		AF_ISDN
 #define PF_PHONET	AF_PHONET
+#define PF_RDS		AF_RDS
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -298,6 +300,7 @@ struct ucred {
 #define SOL_PPPOL2TP	273
 #define SOL_BLUETOOTH	274
 #define SOL_PNPIPE	275
+#define SOL_RDS		276
 
 /* IPX options */
 #define IPX_TYPE	1
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:42 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:42 -0800
Subject: [ofa-general] [PATCH 25/26] RDS: Add userspace header
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-26-git-send-email-andy.grover@oracle.com>

Applications include this header in order to use RDS sockets.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 include/linux/rds.h |  250 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 250 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/rds.h

diff --git a/include/linux/rds.h b/include/linux/rds.h
new file mode 100644
index 0000000..d91dc91
--- /dev/null
+++ b/include/linux/rds.h
@@ -0,0 +1,250 @@
+/*
+ * Copyright (c) 2008 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _LINUX_RDS_H
+#define _LINUX_RDS_H
+
+#include <linux/types.h>
+
+/* These sparse annotated types shouldn't be in any user
+ * visible header file. We should clean this up rather
+ * than kludging around them. */
+#ifndef __KERNEL__
+#define __be16	u_int16_t
+#define __be32	u_int32_t
+#define __be64	u_int64_t
+#endif
+
+#define RDS_IB_ABI_VERSION		0x301
+
+/*
+ * setsockopt/getsockopt for SOL_RDS
+ */
+#define RDS_CANCEL_SENT_TO      	1
+#define RDS_GET_MR			2
+#define RDS_FREE_MR			3
+/* deprecated: RDS_BARRIER 4 */
+#define RDS_RECVERR			5
+#define RDS_CONG_MONITOR		6
+
+/*
+ * Control message types for SOL_RDS.
+ *
+ * CMSG_RDMA_ARGS (sendmsg)
+ *	Request a RDMA transfer to/from the specified
+ *	memory ranges.
+ *	The cmsg_data is a struct rds_rdma_args.
+ * RDS_CMSG_RDMA_DEST (recvmsg, sendmsg)
+ *	Kernel informs application about intended
+ *	source/destination of a RDMA transfer
+ * RDS_CMSG_RDMA_MAP (sendmsg)
+ *	Application asks kernel to map the given
+ *	memory range into a IB MR, and send the
+ *	R_Key along in an RDS extension header.
+ *	The cmsg_data is a struct rds_get_mr_args,
+ *	the same as for the GET_MR setsockopt.
+ * RDS_CMSG_RDMA_STATUS (recvmsg)
+ *	Returns the status of a completed RDMA operation.
+ */
+#define RDS_CMSG_RDMA_ARGS		1
+#define RDS_CMSG_RDMA_DEST		2
+#define RDS_CMSG_RDMA_MAP		3
+#define RDS_CMSG_RDMA_STATUS		4
+#define RDS_CMSG_CONG_UPDATE		5
+
+#define RDS_INFO_FIRST			10000
+#define RDS_INFO_COUNTERS		10000
+#define RDS_INFO_CONNECTIONS		10001
+/* 10002 aka RDS_INFO_FLOWS is deprecated */
+#define RDS_INFO_SEND_MESSAGES		10003
+#define RDS_INFO_RETRANS_MESSAGES       10004
+#define RDS_INFO_RECV_MESSAGES          10005
+#define RDS_INFO_SOCKETS                10006
+#define RDS_INFO_TCP_SOCKETS            10007
+#define RDS_INFO_IB_CONNECTIONS		10008
+#define RDS_INFO_CONNECTION_STATS	10009
+#define RDS_INFO_IWARP_CONNECTIONS	10010
+#define RDS_INFO_LAST			10010
+
+struct rds_info_counter {
+	u_int8_t	name[32];
+	u_int64_t	value;
+} __attribute__((packed));
+
+#define RDS_INFO_CONNECTION_FLAG_SENDING	0x01
+#define RDS_INFO_CONNECTION_FLAG_CONNECTING	0x02
+#define RDS_INFO_CONNECTION_FLAG_CONNECTED	0x04
+
+#define TRANSNAMSIZ	16
+
+struct rds_info_connection {
+	u_int64_t	next_tx_seq;
+	u_int64_t	next_rx_seq;
+	__be32		laddr;
+	__be32		faddr;
+	u_int8_t	transport[TRANSNAMSIZ];		/* null term ascii */
+	u_int8_t	flags;
+} __attribute__((packed));
+
+struct rds_info_flow {
+	__be32		laddr;
+	__be32		faddr;
+	u_int32_t	bytes;
+	__be16		lport;
+	__be16		fport;
+} __attribute__((packed));
+
+#define RDS_INFO_MESSAGE_FLAG_ACK               0x01
+#define RDS_INFO_MESSAGE_FLAG_FAST_ACK          0x02
+
+struct rds_info_message {
+	u_int64_t	seq;
+	u_int32_t	len;
+	__be32		laddr;
+	__be32		faddr;
+	__be16		lport;
+	__be16		fport;
+	u_int8_t	flags;
+} __attribute__((packed));
+
+struct rds_info_socket {
+	u_int32_t	sndbuf;
+	__be32		bound_addr;
+	__be32		connected_addr;
+	__be16		bound_port;
+	__be16		connected_port;
+	u_int32_t	rcvbuf;
+	u_int64_t	inum;
+} __attribute__((packed));
+
+#define RDS_IB_GID_LEN	16
+struct rds_info_rdma_connection {
+	__be32		src_addr;
+	__be32		dst_addr;
+	uint8_t		src_gid[RDS_IB_GID_LEN];
+	uint8_t		dst_gid[RDS_IB_GID_LEN];
+
+	uint32_t	max_send_wr;
+	uint32_t	max_recv_wr;
+	uint32_t	max_send_sge;
+	uint32_t	rdma_mr_max;
+	uint32_t	rdma_mr_size;
+};
+
+/*
+ * Congestion monitoring.
+ * Congestion control in RDS happens at the host connection
+ * level by exchanging a bitmap marking congested ports.
+ * By default, a process sleeping in poll() is always woken
+ * up when the congestion map is updated.
+ * With explicit monitoring, an application can have more
+ * fine-grained control.
+ * The application installs a 64bit mask value in the socket,
+ * where each bit corresponds to a group of ports.
+ * When a congestion update arrives, RDS checks the set of
+ * ports that are now uncongested against the list bit mask
+ * installed in the socket, and if they overlap, we queue a
+ * cong_notification on the socket.
+ *
+ * To install the congestion monitor bitmask, use RDS_CONG_MONITOR
+ * with the 64bit mask.
+ * Congestion updates are received via RDS_CMSG_CONG_UPDATE
+ * control messages.
+ *
+ * The correspondence between bits and ports is
+ *	1 << (portnum % 64)
+ */
+#define RDS_CONG_MONITOR_SIZE	64
+#define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
+#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port))
+
+/*
+ * RDMA related types
+ */
+
+/*
+ * This encapsulates a remote memory location.
+ * In the current implementation, it contains the R_Key
+ * of the remote memory region, and the offset into it
+ * (so that the application does not have to worry about
+ * alignment).
+ */
+typedef u_int64_t	rds_rdma_cookie_t;
+
+struct rds_iovec {
+	u_int64_t	addr;
+	u_int64_t	bytes;
+};
+
+struct rds_get_mr_args {
+	struct rds_iovec vec;
+	u_int64_t	cookie_addr;
+	uint64_t	flags;
+};
+
+struct rds_free_mr_args {
+	rds_rdma_cookie_t cookie;
+	u_int64_t	flags;
+};
+
+struct rds_rdma_args {
+	rds_rdma_cookie_t cookie;
+	struct rds_iovec remote_vec;
+	u_int64_t	local_vec_addr;
+	u_int64_t	nr_local;
+	u_int64_t	flags;
+	u_int64_t	user_token;
+};
+
+struct rds_rdma_notify {
+	u_int64_t	user_token;
+	int32_t		status;
+};
+
+#define RDS_RDMA_SUCCESS	0
+#define RDS_RDMA_REMOTE_ERROR	1
+#define RDS_RDMA_CANCELED	2
+#define RDS_RDMA_DROPPED	3
+#define RDS_RDMA_OTHER_ERROR	4
+
+/*
+ * Common set of flags for all RDMA related structs
+ */
+#define RDS_RDMA_READWRITE	0x0001
+#define RDS_RDMA_FENCE		0x0002	/* use FENCE for immediate send */
+#define RDS_RDMA_INVALIDATE	0x0004	/* invalidate R_Key after freeing MR */
+#define RDS_RDMA_USE_ONCE	0x0008	/* free MR after use */
+#define RDS_RDMA_DONTWAIT	0x0010	/* Don't wait in SET_BARRIER */
+#define RDS_RDMA_NOTIFY_ME	0x0020	/* Notify when operation completes */
+
+#endif /* IB_RDS_H */
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:41 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:41 -0800
Subject: [ofa-general] [PATCH 24/26] RDS: Add MAINTAINERS entry
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-25-git-send-email-andy.grover@oracle.com>

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 MAINTAINERS |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 59fd2d1..fd68b34 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3639,6 +3639,12 @@ M:	florian.fainelli at telecomint.eu
 L:	netdev at vger.kernel.org
 S:	Maintained
 
+RDS - RELIABLE DATAGRAM SOCKETS
+P:	Andy Grover
+M:	andy.grover at oracle.com
+L:	rds-devel at oss.oracle.com
+S:	Supported
+
 READ-COPY UPDATE (RCU)
 P:	Dipankar Sarma
 M:	dipankar at in.ibm.com
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:36 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:36 -0800
Subject: [ofa-general] [PATCH 19/26] RDS: Add iWARP support
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-20-git-send-email-andy.grover@oracle.com>

Support for iWARP NICs is implemented as a separate
RDS transport from IB. The code, however, is very
similar to IB (it was forked, basically.) so let's keep
it in one changeset.

The reason for this duplicationis that despite its similarity
to IB, there are a number of places where it has different
semantics. iwarp zcopy support is still under development,
and giving it its own sandbox ensures that IB code isn't
disrupted while iwarp changes. Over time these transports
will re-converge.

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/rds/iw.c        |  333 ++++++++++++++++++
 net/rds/iw.h        |  395 +++++++++++++++++++++
 net/rds/iw_cm.c     |  750 +++++++++++++++++++++++++++++++++++++++
 net/rds/iw_rdma.c   |  888 ++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/iw_recv.c   |  869 +++++++++++++++++++++++++++++++++++++++++++++
 net/rds/iw_ring.c   |  169 +++++++++
 net/rds/iw_send.c   |  975 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/iw_stats.c  |   95 +++++
 net/rds/iw_sysctl.c |  137 +++++++
 9 files changed, 4611 insertions(+), 0 deletions(-)
 create mode 100644 net/rds/iw.c
 create mode 100644 net/rds/iw.h
 create mode 100644 net/rds/iw_cm.c
 create mode 100644 net/rds/iw_rdma.c
 create mode 100644 net/rds/iw_recv.c
 create mode 100644 net/rds/iw_ring.c
 create mode 100644 net/rds/iw_send.c
 create mode 100644 net/rds/iw_stats.c
 create mode 100644 net/rds/iw_sysctl.c

diff --git a/net/rds/iw.c b/net/rds/iw.c
new file mode 100644
index 0000000..1b56905
--- /dev/null
+++ b/net/rds/iw.c
@@ -0,0 +1,333 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/if.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/if_arp.h>
+#include <linux/delay.h>
+
+#include "rds.h"
+#include "iw.h"
+
+unsigned int fastreg_pool_size = RDS_FASTREG_POOL_SIZE;
+unsigned int fastreg_message_size = RDS_FASTREG_SIZE + 1; /* +1 allows for unaligned MRs */
+
+module_param(fastreg_pool_size, int, 0444);
+MODULE_PARM_DESC(fastreg_pool_size, " Max number of fastreg MRs per device");
+module_param(fastreg_message_size, int, 0444);
+MODULE_PARM_DESC(fastreg_message_size, " Max size of a RDMA transfer (fastreg MRs)");
+
+struct list_head rds_iw_devices;
+
+DEFINE_SPINLOCK(iw_nodev_conns_lock);
+LIST_HEAD(iw_nodev_conns);
+
+void rds_iw_add_one(struct ib_device *device)
+{
+	struct rds_iw_device *rds_iwdev;
+	struct ib_device_attr *dev_attr;
+
+	/* Only handle iwarp devices */
+	if (device->node_type != RDMA_NODE_RNIC)
+		return;
+
+	dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
+	if (!dev_attr)
+		return;
+
+	if (ib_query_device(device, dev_attr)) {
+		rdsdebug("Query device failed for %s\n", device->name);
+		goto free_attr;
+	}
+
+	rds_iwdev = kmalloc(sizeof *rds_iwdev, GFP_KERNEL);
+	if (!rds_iwdev)
+		goto free_attr;
+
+	spin_lock_init(&rds_iwdev->spinlock);
+
+	rds_iwdev->dma_local_lkey = !!(dev_attr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY);
+	rds_iwdev->max_wrs = dev_attr->max_qp_wr;
+	rds_iwdev->max_sge = min(dev_attr->max_sge, RDS_IW_MAX_SGE);
+
+	rds_iwdev->page_shift = max(PAGE_SHIFT, ffs(dev_attr->page_size_cap) - 1);
+
+	rds_iwdev->dev = device;
+	rds_iwdev->pd = ib_alloc_pd(device);
+	if (IS_ERR(rds_iwdev->pd))
+		goto free_dev;
+
+	if (!rds_iwdev->dma_local_lkey) {
+		if (device->node_type != RDMA_NODE_RNIC) {
+			rds_iwdev->mr = ib_get_dma_mr(rds_iwdev->pd,
+						IB_ACCESS_LOCAL_WRITE);
+		} else {
+			rds_iwdev->mr = ib_get_dma_mr(rds_iwdev->pd,
+						IB_ACCESS_REMOTE_READ |
+						IB_ACCESS_REMOTE_WRITE |
+						IB_ACCESS_LOCAL_WRITE);
+		}
+		if (IS_ERR(rds_iwdev->mr))
+			goto err_pd;
+	} else
+		rds_iwdev->mr = NULL;
+
+	rds_iwdev->mr_pool = rds_iw_create_mr_pool(rds_iwdev);
+	if (IS_ERR(rds_iwdev->mr_pool)) {
+		rds_iwdev->mr_pool = NULL;
+		goto err_mr;
+	}
+
+	INIT_LIST_HEAD(&rds_iwdev->cm_id_list);
+	INIT_LIST_HEAD(&rds_iwdev->conn_list);
+	list_add_tail(&rds_iwdev->list, &rds_iw_devices);
+
+	ib_set_client_data(device, &rds_iw_client, rds_iwdev);
+
+	goto free_attr;
+
+err_mr:
+	if (rds_iwdev->mr)
+		ib_dereg_mr(rds_iwdev->mr);
+err_pd:
+	ib_dealloc_pd(rds_iwdev->pd);
+free_dev:
+	kfree(rds_iwdev);
+free_attr:
+	kfree(dev_attr);
+}
+
+void rds_iw_remove_one(struct ib_device *device)
+{
+	struct rds_iw_device *rds_iwdev;
+	struct rds_iw_cm_id *i_cm_id, *next;
+
+	rds_iwdev = ib_get_client_data(device, &rds_iw_client);
+	if (!rds_iwdev)
+		return;
+
+	spin_lock_irq(&rds_iwdev->spinlock);
+	list_for_each_entry_safe(i_cm_id, next, &rds_iwdev->cm_id_list, list) {
+		list_del(&i_cm_id->list);
+		kfree(i_cm_id);
+	}
+	spin_unlock_irq(&rds_iwdev->spinlock);
+
+	rds_iw_remove_conns(rds_iwdev);
+
+	if (rds_iwdev->mr_pool)
+		rds_iw_destroy_mr_pool(rds_iwdev->mr_pool);
+
+	if (rds_iwdev->mr)
+		ib_dereg_mr(rds_iwdev->mr);
+
+	while (ib_dealloc_pd(rds_iwdev->pd)) {
+		rdsdebug("Failed to dealloc pd %p\n", rds_iwdev->pd);
+		msleep(1);
+	}
+
+	list_del(&rds_iwdev->list);
+	kfree(rds_iwdev);
+}
+
+struct ib_client rds_iw_client = {
+	.name   = "rds_iw",
+	.add    = rds_iw_add_one,
+	.remove = rds_iw_remove_one
+};
+
+static int rds_iw_conn_info_visitor(struct rds_connection *conn,
+				    void *buffer)
+{
+	struct rds_info_rdma_connection *iinfo = buffer;
+	struct rds_iw_connection *ic;
+
+	/* We will only ever look at IB transports */
+	if (conn->c_trans != &rds_iw_transport)
+		return 0;
+
+	iinfo->src_addr = conn->c_laddr;
+	iinfo->dst_addr = conn->c_faddr;
+
+	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
+	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		struct rds_iw_device *rds_iwdev;
+		struct rdma_dev_addr *dev_addr;
+
+		ic = conn->c_transport_data;
+		dev_addr = &ic->i_cm_id->route.addr.dev_addr;
+
+		ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid);
+		ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid);
+
+		rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client);
+		iinfo->max_send_wr = ic->i_send_ring.w_nr;
+		iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
+		iinfo->max_send_sge = rds_iwdev->max_sge;
+		rds_iw_get_mr_info(rds_iwdev, iinfo);
+	}
+	return 1;
+}
+
+static void rds_iw_ic_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	rds_for_each_conn_info(sock, len, iter, lens,
+				rds_iw_conn_info_visitor,
+				sizeof(struct rds_info_rdma_connection));
+}
+
+
+/*
+ * Early RDS/IB was built to only bind to an address if there is an IPoIB
+ * device with that address set.
+ *
+ * If it were me, I'd advocate for something more flexible.  Sending and
+ * receiving should be device-agnostic.  Transports would try and maintain
+ * connections between peers who have messages queued.  Userspace would be
+ * allowed to influence which paths have priority.  We could call userspace
+ * asserting this policy "routing".
+ */
+static int rds_iw_laddr_check(__be32 addr)
+{
+	int ret;
+	struct rdma_cm_id *cm_id;
+	struct sockaddr_in sin;
+
+	/* Create a CMA ID and try to bind it. This catches both
+	 * IB and iWARP capable NICs.
+	 */
+	cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP);
+	if (!cm_id)
+		return -EADDRNOTAVAIL;
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = addr;
+
+	/* rdma_bind_addr will only succeed for IB & iWARP devices */
+	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	/* due to this, we will claim to support IB devices unless we
+	   check node_type. */
+	if (ret || cm_id->device->node_type != RDMA_NODE_RNIC)
+		ret = -EADDRNOTAVAIL;
+
+	rdsdebug("addr %pI4 ret %d node type %d\n",
+		&addr, ret,
+		cm_id->device ? cm_id->device->node_type : -1);
+
+	rdma_destroy_id(cm_id);
+
+	return ret;
+}
+
+void rds_iw_exit(void)
+{
+	rds_info_deregister_func(RDS_INFO_IWARP_CONNECTIONS, rds_iw_ic_info);
+	rds_iw_remove_nodev_conns();
+	ib_unregister_client(&rds_iw_client);
+	rds_iw_sysctl_exit();
+	rds_iw_recv_exit();
+	rds_trans_unregister(&rds_iw_transport);
+}
+
+struct rds_transport rds_iw_transport = {
+	.laddr_check		= rds_iw_laddr_check,
+	.xmit_complete		= rds_iw_xmit_complete,
+	.xmit			= rds_iw_xmit,
+	.xmit_cong_map		= NULL,
+	.xmit_rdma		= rds_iw_xmit_rdma,
+	.recv			= rds_iw_recv,
+	.conn_alloc		= rds_iw_conn_alloc,
+	.conn_free		= rds_iw_conn_free,
+	.conn_connect		= rds_iw_conn_connect,
+	.conn_shutdown		= rds_iw_conn_shutdown,
+	.inc_copy_to_user	= rds_iw_inc_copy_to_user,
+	.inc_purge		= rds_iw_inc_purge,
+	.inc_free		= rds_iw_inc_free,
+	.cm_initiate_connect	= rds_iw_cm_initiate_connect,
+	.cm_handle_connect	= rds_iw_cm_handle_connect,
+	.cm_connect_complete	= rds_iw_cm_connect_complete,
+	.stats_info_copy	= rds_iw_stats_info_copy,
+	.exit			= rds_iw_exit,
+	.get_mr			= rds_iw_get_mr,
+	.sync_mr		= rds_iw_sync_mr,
+	.free_mr		= rds_iw_free_mr,
+	.flush_mrs		= rds_iw_flush_mrs,
+	.t_owner		= THIS_MODULE,
+	.t_name			= "iwarp",
+	.t_prefer_loopback	= 1,
+};
+
+int __init rds_iw_init(void)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&rds_iw_devices);
+
+	ret = ib_register_client(&rds_iw_client);
+	if (ret)
+		goto out;
+
+	ret = rds_iw_sysctl_init();
+	if (ret)
+		goto out_ibreg;
+
+	ret = rds_iw_recv_init();
+	if (ret)
+		goto out_sysctl;
+
+	ret = rds_trans_register(&rds_iw_transport);
+	if (ret)
+		goto out_recv;
+
+	rds_info_register_func(RDS_INFO_IWARP_CONNECTIONS, rds_iw_ic_info);
+
+	goto out;
+
+out_recv:
+	rds_iw_recv_exit();
+out_sysctl:
+	rds_iw_sysctl_exit();
+out_ibreg:
+	ib_unregister_client(&rds_iw_client);
+out:
+	return ret;
+}
+
+MODULE_LICENSE("GPL");
+
diff --git a/net/rds/iw.h b/net/rds/iw.h
new file mode 100644
index 0000000..0ddda34
--- /dev/null
+++ b/net/rds/iw.h
@@ -0,0 +1,395 @@
+#ifndef _RDS_IW_H
+#define _RDS_IW_H
+
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include "rds.h"
+#include "rdma_transport.h"
+
+#define RDS_FASTREG_SIZE		20
+#define RDS_FASTREG_POOL_SIZE		2048
+
+#define RDS_IW_MAX_SGE			8
+#define RDS_IW_RECV_SGE 		2
+
+#define RDS_IW_DEFAULT_RECV_WR		1024
+#define RDS_IW_DEFAULT_SEND_WR		256
+
+#define RDS_IW_SUPPORTED_PROTOCOLS	0x00000003	/* minor versions supported */
+
+extern struct list_head rds_iw_devices;
+
+/*
+ * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to
+ * try and minimize the amount of memory tied up both the device and
+ * socket receive queues.
+ */
+/* page offset of the final full frag that fits in the page */
+#define RDS_PAGE_LAST_OFF (((PAGE_SIZE  / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE)
+struct rds_page_frag {
+	struct list_head	f_item;
+	struct page		*f_page;
+	unsigned long		f_offset;
+	dma_addr_t 		f_mapped;
+};
+
+struct rds_iw_incoming {
+	struct list_head	ii_frags;
+	struct rds_incoming	ii_inc;
+};
+
+struct rds_iw_connect_private {
+	/* Add new fields at the end, and don't permute existing fields. */
+	__be32			dp_saddr;
+	__be32			dp_daddr;
+	u8			dp_protocol_major;
+	u8			dp_protocol_minor;
+	__be16			dp_protocol_minor_mask; /* bitmask */
+	__be32			dp_reserved1;
+	__be64			dp_ack_seq;
+	__be32			dp_credit;		/* non-zero enables flow ctl */
+};
+
+struct rds_iw_scatterlist {
+	struct scatterlist	*list;
+	unsigned int		len;
+	int			dma_len;
+	unsigned int		dma_npages;
+	unsigned int		bytes;
+};
+
+struct rds_iw_mapping {
+	spinlock_t		m_lock;	/* protect the mapping struct */
+	struct list_head	m_list;
+	struct rds_iw_mr	*m_mr;
+	uint32_t		m_rkey;
+	struct rds_iw_scatterlist m_sg;
+};
+
+struct rds_iw_send_work {
+	struct rds_message	*s_rm;
+
+	/* We should really put these into a union: */
+	struct rds_rdma_op	*s_op;
+	struct rds_iw_mapping	*s_mapping;
+	struct ib_mr		*s_mr;
+	struct ib_fast_reg_page_list *s_page_list;
+	unsigned char		s_remap_count;
+
+	struct ib_send_wr	s_wr;
+	struct ib_sge		s_sge[RDS_IW_MAX_SGE];
+	unsigned long		s_queued;
+};
+
+struct rds_iw_recv_work {
+	struct rds_iw_incoming 	*r_iwinc;
+	struct rds_page_frag	*r_frag;
+	struct ib_recv_wr	r_wr;
+	struct ib_sge		r_sge[2];
+};
+
+struct rds_iw_work_ring {
+	u32		w_nr;
+	u32		w_alloc_ptr;
+	u32		w_alloc_ctr;
+	u32		w_free_ptr;
+	atomic_t	w_free_ctr;
+};
+
+struct rds_iw_device;
+
+struct rds_iw_connection {
+
+	struct list_head	iw_node;
+	struct rds_iw_device 	*rds_iwdev;
+	struct rds_connection	*conn;
+
+	/* alphabet soup, IBTA style */
+	struct rdma_cm_id	*i_cm_id;
+	struct ib_pd		*i_pd;
+	struct ib_mr		*i_mr;
+	struct ib_cq		*i_send_cq;
+	struct ib_cq		*i_recv_cq;
+
+	/* tx */
+	struct rds_iw_work_ring	i_send_ring;
+	struct rds_message	*i_rm;
+	struct rds_header	*i_send_hdrs;
+	u64			i_send_hdrs_dma;
+	struct rds_iw_send_work *i_sends;
+
+	/* rx */
+	struct mutex		i_recv_mutex;
+	struct rds_iw_work_ring	i_recv_ring;
+	struct rds_iw_incoming	*i_iwinc;
+	u32			i_recv_data_rem;
+	struct rds_header	*i_recv_hdrs;
+	u64			i_recv_hdrs_dma;
+	struct rds_iw_recv_work *i_recvs;
+	struct rds_page_frag	i_frag;
+	u64			i_ack_recv;	/* last ACK received */
+
+	/* sending acks */
+	unsigned long		i_ack_flags;
+	u64			i_ack_next;	/* next ACK to send */
+	struct rds_header	*i_ack;
+	struct ib_send_wr	i_ack_wr;
+	struct ib_sge		i_ack_sge;
+	u64			i_ack_dma;
+	unsigned long		i_ack_queued;
+
+	/* Flow control related information
+	 *
+	 * Our algorithm uses a pair variables that we need to access
+	 * atomically - one for the send credits, and one posted
+	 * recv credits we need to transfer to remote.
+	 * Rather than protect them using a slow spinlock, we put both into
+	 * a single atomic_t and update it using cmpxchg
+	 */
+	atomic_t		i_credits;
+
+	/* Protocol version specific information */
+	unsigned int		i_flowctl:1;	/* enable/disable flow ctl */
+	unsigned int		i_dma_local_lkey:1;
+	unsigned int		i_fastreg_posted:1; /* fastreg posted on this connection */
+	/* Batched completions */
+	unsigned int		i_unsignaled_wrs;
+	long			i_unsignaled_bytes;
+};
+
+/* This assumes that atomic_t is at least 32 bits */
+#define IB_GET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_GET_POST_CREDITS(v)	((v) >> 16)
+#define IB_SET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_SET_POST_CREDITS(v)	((v) << 16)
+
+struct rds_iw_cm_id {
+	struct list_head	list;
+	struct rdma_cm_id	*cm_id;
+};
+
+struct rds_iw_device {
+	struct list_head	list;
+	struct list_head	cm_id_list;
+	struct list_head	conn_list;
+	struct ib_device	*dev;
+	struct ib_pd		*pd;
+	struct ib_mr		*mr;
+	struct rds_iw_mr_pool	*mr_pool;
+	int			page_shift;
+	int			max_sge;
+	unsigned int		max_wrs;
+	unsigned int		dma_local_lkey:1;
+	spinlock_t		spinlock;	/* protect the above */
+};
+
+/* bits for i_ack_flags */
+#define IB_ACK_IN_FLIGHT	0
+#define IB_ACK_REQUESTED	1
+
+/* Magic WR_ID for ACKs */
+#define RDS_IW_ACK_WR_ID	((u64)0xffffffffffffffffULL)
+#define RDS_IW_FAST_REG_WR_ID	((u64)0xefefefefefefefefULL)
+#define RDS_IW_LOCAL_INV_WR_ID	((u64)0xdfdfdfdfdfdfdfdfULL)
+
+struct rds_iw_statistics {
+	uint64_t	s_iw_connect_raced;
+	uint64_t	s_iw_listen_closed_stale;
+	uint64_t	s_iw_tx_cq_call;
+	uint64_t	s_iw_tx_cq_event;
+	uint64_t	s_iw_tx_ring_full;
+	uint64_t	s_iw_tx_throttle;
+	uint64_t	s_iw_tx_sg_mapping_failure;
+	uint64_t	s_iw_tx_stalled;
+	uint64_t	s_iw_tx_credit_updates;
+	uint64_t	s_iw_rx_cq_call;
+	uint64_t	s_iw_rx_cq_event;
+	uint64_t	s_iw_rx_ring_empty;
+	uint64_t	s_iw_rx_refill_from_cq;
+	uint64_t	s_iw_rx_refill_from_thread;
+	uint64_t	s_iw_rx_alloc_limit;
+	uint64_t	s_iw_rx_credit_updates;
+	uint64_t	s_iw_ack_sent;
+	uint64_t	s_iw_ack_send_failure;
+	uint64_t	s_iw_ack_send_delayed;
+	uint64_t	s_iw_ack_send_piggybacked;
+	uint64_t	s_iw_ack_received;
+	uint64_t	s_iw_rdma_mr_alloc;
+	uint64_t	s_iw_rdma_mr_free;
+	uint64_t	s_iw_rdma_mr_used;
+	uint64_t	s_iw_rdma_mr_pool_flush;
+	uint64_t	s_iw_rdma_mr_pool_wait;
+	uint64_t	s_iw_rdma_mr_pool_depleted;
+};
+
+extern struct workqueue_struct *rds_iw_wq;
+
+/*
+ * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h
+ * doesn't define it.
+ */
+static inline void rds_iw_dma_sync_sg_for_cpu(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_cpu(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_cpu	rds_iw_dma_sync_sg_for_cpu
+
+static inline void rds_iw_dma_sync_sg_for_device(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_device(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_device	rds_iw_dma_sync_sg_for_device
+
+static inline u32 rds_iw_local_dma_lkey(struct rds_iw_connection *ic)
+{
+	return ic->i_dma_local_lkey ? ic->i_cm_id->device->local_dma_lkey : ic->i_mr->lkey;
+}
+
+/* ib.c */
+extern struct rds_transport rds_iw_transport;
+extern void rds_iw_add_one(struct ib_device *device);
+extern void rds_iw_remove_one(struct ib_device *device);
+extern struct ib_client rds_iw_client;
+
+extern unsigned int fastreg_pool_size;
+extern unsigned int fastreg_message_size;
+
+extern spinlock_t iw_nodev_conns_lock;
+extern struct list_head iw_nodev_conns;
+
+/* ib_cm.c */
+int rds_iw_conn_alloc(struct rds_connection *conn, gfp_t gfp);
+void rds_iw_conn_free(void *arg);
+int rds_iw_conn_connect(struct rds_connection *conn);
+void rds_iw_conn_shutdown(struct rds_connection *conn);
+void rds_iw_state_change(struct sock *sk);
+int __init rds_iw_listen_init(void);
+void rds_iw_listen_stop(void);
+void __rds_iw_conn_error(struct rds_connection *conn, const char *, ...);
+int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id,
+			     struct rdma_cm_event *event);
+int rds_iw_cm_initiate_connect(struct rdma_cm_id *cm_id);
+void rds_iw_cm_connect_complete(struct rds_connection *conn,
+				struct rdma_cm_event *event);
+
+
+#define rds_iw_conn_error(conn, fmt...) \
+	__rds_iw_conn_error(conn, KERN_WARNING "RDS/IW: " fmt)
+
+/* ib_rdma.c */
+int rds_iw_update_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id);
+int rds_iw_add_conn(struct rds_iw_device *rds_iwdev, struct rds_connection *conn);
+void rds_iw_remove_nodev_conns(void);
+void rds_iw_remove_conns(struct rds_iw_device *rds_iwdev);
+struct rds_iw_mr_pool *rds_iw_create_mr_pool(struct rds_iw_device *);
+void rds_iw_get_mr_info(struct rds_iw_device *rds_iwdev, struct rds_info_rdma_connection *iinfo);
+void rds_iw_destroy_mr_pool(struct rds_iw_mr_pool *);
+void *rds_iw_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret);
+void rds_iw_sync_mr(void *trans_private, int dir);
+void rds_iw_free_mr(void *trans_private, int invalidate);
+void rds_iw_flush_mrs(void);
+void rds_iw_remove_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id);
+
+/* ib_recv.c */
+int __init rds_iw_recv_init(void);
+void rds_iw_recv_exit(void);
+int rds_iw_recv(struct rds_connection *conn);
+int rds_iw_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill);
+void rds_iw_inc_purge(struct rds_incoming *inc);
+void rds_iw_inc_free(struct rds_incoming *inc);
+int rds_iw_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov,
+			     size_t size);
+void rds_iw_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_iw_recv_init_ring(struct rds_iw_connection *ic);
+void rds_iw_recv_clear_ring(struct rds_iw_connection *ic);
+void rds_iw_recv_init_ack(struct rds_iw_connection *ic);
+void rds_iw_attempt_ack(struct rds_iw_connection *ic);
+void rds_iw_ack_send_complete(struct rds_iw_connection *ic);
+u64 rds_iw_piggyb_ack(struct rds_iw_connection *ic);
+
+/* ib_ring.c */
+void rds_iw_ring_init(struct rds_iw_work_ring *ring, u32 nr);
+void rds_iw_ring_resize(struct rds_iw_work_ring *ring, u32 nr);
+u32 rds_iw_ring_alloc(struct rds_iw_work_ring *ring, u32 val, u32 *pos);
+void rds_iw_ring_free(struct rds_iw_work_ring *ring, u32 val);
+void rds_iw_ring_unalloc(struct rds_iw_work_ring *ring, u32 val);
+int rds_iw_ring_empty(struct rds_iw_work_ring *ring);
+int rds_iw_ring_low(struct rds_iw_work_ring *ring);
+u32 rds_iw_ring_oldest(struct rds_iw_work_ring *ring);
+u32 rds_iw_ring_completed(struct rds_iw_work_ring *ring, u32 wr_id, u32 oldest);
+extern wait_queue_head_t rds_iw_ring_empty_wait;
+
+/* ib_send.c */
+void rds_iw_xmit_complete(struct rds_connection *conn);
+int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off);
+void rds_iw_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_iw_send_init_ring(struct rds_iw_connection *ic);
+void rds_iw_send_clear_ring(struct rds_iw_connection *ic);
+int rds_iw_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
+void rds_iw_send_add_credits(struct rds_connection *conn, unsigned int credits);
+void rds_iw_advertise_credits(struct rds_connection *conn, unsigned int posted);
+int rds_iw_send_grab_credits(struct rds_iw_connection *ic, u32 wanted,
+			     u32 *adv_credits, int need_posted);
+
+/* ib_stats.c */
+DECLARE_PER_CPU(struct rds_iw_statistics, rds_iw_stats);
+#define rds_iw_stats_inc(member) rds_stats_inc_which(rds_iw_stats, member)
+unsigned int rds_iw_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail);
+
+/* ib_sysctl.c */
+int __init rds_iw_sysctl_init(void);
+void rds_iw_sysctl_exit(void);
+extern unsigned long rds_iw_sysctl_max_send_wr;
+extern unsigned long rds_iw_sysctl_max_recv_wr;
+extern unsigned long rds_iw_sysctl_max_unsig_wrs;
+extern unsigned long rds_iw_sysctl_max_unsig_bytes;
+extern unsigned long rds_iw_sysctl_max_recv_allocation;
+extern unsigned int rds_iw_sysctl_flow_control;
+extern ctl_table rds_iw_sysctl_table[];
+
+/*
+ * Helper functions for getting/setting the header and data SGEs in
+ * RDS packets (not RDMA)
+ */
+static inline struct ib_sge *
+rds_iw_header_sge(struct rds_iw_connection *ic, struct ib_sge *sge)
+{
+	return &sge[0];
+}
+
+static inline struct ib_sge *
+rds_iw_data_sge(struct rds_iw_connection *ic, struct ib_sge *sge)
+{
+	return &sge[1];
+}
+
+static inline void rds_iw_set_64bit(u64 *ptr, u64 val)
+{
+#if BITS_PER_LONG == 64
+	*ptr = val;
+#else
+	set_64bit(ptr, val);
+#endif
+}
+
+#endif
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
new file mode 100644
index 0000000..57ecb3d
--- /dev/null
+++ b/net/rds/iw_cm.c
@@ -0,0 +1,750 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/vmalloc.h>
+
+#include "rds.h"
+#include "iw.h"
+
+/*
+ * Set the selected protocol version
+ */
+static void rds_iw_set_protocol(struct rds_connection *conn, unsigned int version)
+{
+	conn->c_version = version;
+}
+
+/*
+ * Set up flow control
+ */
+static void rds_iw_set_flow_control(struct rds_connection *conn, u32 credits)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	if (rds_iw_sysctl_flow_control && credits != 0) {
+		/* We're doing flow control */
+		ic->i_flowctl = 1;
+		rds_iw_send_add_credits(conn, credits);
+	} else {
+		ic->i_flowctl = 0;
+	}
+}
+
+/*
+ * Connection established.
+ * We get here for both outgoing and incoming connection.
+ */
+void rds_iw_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event)
+{
+	const struct rds_iw_connect_private *dp = NULL;
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rds_iw_device *rds_iwdev;
+	int err;
+
+	if (event->param.conn.private_data_len) {
+		dp = event->param.conn.private_data;
+
+		rds_iw_set_protocol(conn,
+				RDS_PROTOCOL(dp->dp_protocol_major,
+					dp->dp_protocol_minor));
+		rds_iw_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+	}
+
+	/* update ib_device with this local ipaddr & conn */
+	rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client);
+	err = rds_iw_update_cm_id(rds_iwdev, ic->i_cm_id);
+	if (err)
+		printk(KERN_ERR "rds_iw_update_ipaddr failed (%d)\n", err);
+	err = rds_iw_add_conn(rds_iwdev, conn);
+	if (err)
+		printk(KERN_ERR "rds_iw_add_conn failed (%d)\n", err);
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp && dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	printk(KERN_NOTICE "RDS/IW: connected to %pI4<->%pI4 version %u.%u%s\n",
+			&conn->c_laddr, &conn->c_faddr,
+			RDS_PROTOCOL_MAJOR(conn->c_version),
+			RDS_PROTOCOL_MINOR(conn->c_version),
+			ic->i_flowctl ? ", flow control" : "");
+
+	rds_connect_complete(conn);
+}
+
+static void rds_iw_cm_fill_conn_param(struct rds_connection *conn,
+			struct rdma_conn_param *conn_param,
+			struct rds_iw_connect_private *dp,
+			u32 protocol_version)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	memset(conn_param, 0, sizeof(struct rdma_conn_param));
+	/* XXX tune these? */
+	conn_param->responder_resources = 1;
+	conn_param->initiator_depth = 1;
+
+	if (dp) {
+		memset(dp, 0, sizeof(*dp));
+		dp->dp_saddr = conn->c_laddr;
+		dp->dp_daddr = conn->c_faddr;
+		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
+		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
+		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IW_SUPPORTED_PROTOCOLS);
+		dp->dp_ack_seq = rds_iw_piggyb_ack(ic);
+
+		/* Advertise flow control */
+		if (ic->i_flowctl) {
+			unsigned int credits;
+
+			credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits));
+			dp->dp_credit = cpu_to_be32(credits);
+			atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits);
+		}
+
+		conn_param->private_data = dp;
+		conn_param->private_data_len = sizeof(*dp);
+	}
+}
+
+static void rds_iw_cq_event_handler(struct ib_event *event, void *data)
+{
+	rdsdebug("event %u data %p\n", event->event, data);
+}
+
+static void rds_iw_qp_event_handler(struct ib_event *event, void *data)
+{
+	struct rds_connection *conn = data;
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event);
+
+	switch (event->event) {
+	case IB_EVENT_COMM_EST:
+		rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST);
+		break;
+	case IB_EVENT_QP_REQ_ERR:
+	case IB_EVENT_QP_FATAL:
+	default:
+		rds_iw_conn_error(conn, "RDS/IW: Fatal QP Event %u - connection %pI4->%pI4...reconnecting\n",
+			event->event, &conn->c_laddr,
+			&conn->c_faddr);
+		break;
+	}
+}
+
+/*
+ * Create a QP
+ */
+static int rds_iw_init_qp_attrs(struct ib_qp_init_attr *attr,
+		struct rds_iw_device *rds_iwdev,
+		struct rds_iw_work_ring *send_ring,
+		void (*send_cq_handler)(struct ib_cq *, void *),
+		struct rds_iw_work_ring *recv_ring,
+		void (*recv_cq_handler)(struct ib_cq *, void *),
+		void *context)
+{
+	struct ib_device *dev = rds_iwdev->dev;
+	unsigned int send_size, recv_size;
+	int ret;
+
+	/* The offset of 1 is to accomodate the additional ACK WR. */
+	send_size = min_t(unsigned int, rds_iwdev->max_wrs, rds_iw_sysctl_max_send_wr + 1);
+	recv_size = min_t(unsigned int, rds_iwdev->max_wrs, rds_iw_sysctl_max_recv_wr + 1);
+	rds_iw_ring_resize(send_ring, send_size - 1);
+	rds_iw_ring_resize(recv_ring, recv_size - 1);
+
+	memset(attr, 0, sizeof(*attr));
+	attr->event_handler = rds_iw_qp_event_handler;
+	attr->qp_context = context;
+	attr->cap.max_send_wr = send_size;
+	attr->cap.max_recv_wr = recv_size;
+	attr->cap.max_send_sge = rds_iwdev->max_sge;
+	attr->cap.max_recv_sge = RDS_IW_RECV_SGE;
+	attr->sq_sig_type = IB_SIGNAL_REQ_WR;
+	attr->qp_type = IB_QPT_RC;
+
+	attr->send_cq = ib_create_cq(dev, send_cq_handler,
+				     rds_iw_cq_event_handler,
+				     context, send_size, 0);
+	if (IS_ERR(attr->send_cq)) {
+		ret = PTR_ERR(attr->send_cq);
+		attr->send_cq = NULL;
+		rdsdebug("ib_create_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	attr->recv_cq = ib_create_cq(dev, recv_cq_handler,
+				     rds_iw_cq_event_handler,
+				     context, recv_size, 0);
+	if (IS_ERR(attr->recv_cq)) {
+		ret = PTR_ERR(attr->recv_cq);
+		attr->recv_cq = NULL;
+		rdsdebug("ib_create_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(attr->send_cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(attr->recv_cq, IB_CQ_SOLICITED);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
+		goto out;
+	}
+
+out:
+	if (ret) {
+		if (attr->send_cq)
+			ib_destroy_cq(attr->send_cq);
+		if (attr->recv_cq)
+			ib_destroy_cq(attr->recv_cq);
+	}
+	return ret;
+}
+
+/*
+ * This needs to be very careful to not leave IS_ERR pointers around for
+ * cleanup to trip over.
+ */
+static int rds_iw_setup_qp(struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct ib_qp_init_attr attr;
+	struct rds_iw_device *rds_iwdev;
+	int ret;
+
+	/* rds_iw_add_one creates a rds_iw_device object per IB device,
+	 * and allocates a protection domain, memory range and MR pool
+	 * for each.  If that fails for any reason, it will not register
+	 * the rds_iwdev at all.
+	 */
+	rds_iwdev = ib_get_client_data(dev, &rds_iw_client);
+	if (rds_iwdev == NULL) {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "RDS/IW: No client_data for device %s\n",
+					dev->name);
+		return -EOPNOTSUPP;
+	}
+
+	/* Protection domain and memory range */
+	ic->i_pd = rds_iwdev->pd;
+	ic->i_mr = rds_iwdev->mr;
+
+	ret = rds_iw_init_qp_attrs(&attr, rds_iwdev,
+			&ic->i_send_ring, rds_iw_send_cq_comp_handler,
+			&ic->i_recv_ring, rds_iw_recv_cq_comp_handler,
+			conn);
+	if (ret < 0)
+		goto out;
+
+	ic->i_send_cq = attr.send_cq;
+	ic->i_recv_cq = attr.recv_cq;
+
+	/*
+	 * XXX this can fail if max_*_wr is too large?  Are we supposed
+	 * to back off until we get a value that the hardware can support?
+	 */
+	ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
+	if (ret) {
+		rdsdebug("rdma_create_qp failed: %d\n", ret);
+		goto out;
+	}
+
+	ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_send_hdrs_dma, GFP_KERNEL);
+	if (ic->i_send_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent send failed\n");
+		goto out;
+	}
+
+	ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_recv_hdrs_dma, GFP_KERNEL);
+	if (ic->i_recv_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent recv failed\n");
+		goto out;
+	}
+
+	ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
+				       &ic->i_ack_dma, GFP_KERNEL);
+	if (ic->i_ack == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent ack failed\n");
+		goto out;
+	}
+
+	ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_iw_send_work));
+	if (ic->i_sends == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("send allocation failed\n");
+		goto out;
+	}
+	rds_iw_send_init_ring(ic);
+
+	ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_iw_recv_work));
+	if (ic->i_recvs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("recv allocation failed\n");
+		goto out;
+	}
+
+	rds_iw_recv_init_ring(ic);
+	rds_iw_recv_init_ack(ic);
+
+	/* Post receive buffers - as a side effect, this will update
+	 * the posted credit count. */
+	rds_iw_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
+
+	rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr,
+		 ic->i_send_cq, ic->i_recv_cq);
+
+out:
+	return ret;
+}
+
+static u32 rds_iw_protocol_compatible(const struct rds_iw_connect_private *dp)
+{
+	u16 common;
+	u32 version = 0;
+
+	/* rdma_cm private data is odd - when there is any private data in the
+	 * request, we will be given a pretty large buffer without telling us the
+	 * original size. The only way to tell the difference is by looking at
+	 * the contents, which are initialized to zero.
+	 * If the protocol version fields aren't set, this is a connection attempt
+	 * from an older version. This could could be 3.0 or 2.0 - we can't tell.
+	 * We really should have changed this for OFED 1.3 :-( */
+	if (dp->dp_protocol_major == 0)
+		return RDS_PROTOCOL_3_0;
+
+	common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IW_SUPPORTED_PROTOCOLS;
+	if (dp->dp_protocol_major == 3 && common) {
+		version = RDS_PROTOCOL_3_0;
+		while ((common >>= 1) != 0)
+			version++;
+	} else if (printk_ratelimit()) {
+		printk(KERN_NOTICE "RDS: Connection from %pI4 using "
+			"incompatible protocol version %u.%u\n",
+			&dp->dp_saddr,
+			dp->dp_protocol_major,
+			dp->dp_protocol_minor);
+	}
+	return version;
+}
+
+int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id,
+				    struct rdma_cm_event *event)
+{
+	const struct rds_iw_connect_private *dp = event->param.conn.private_data;
+	struct rds_iw_connect_private dp_rep;
+	struct rds_connection *conn = NULL;
+	struct rds_iw_connection *ic = NULL;
+	struct rdma_conn_param conn_param;
+	struct rds_iw_device *rds_iwdev;
+	u32 version;
+	int err, destroy = 1;
+
+	/* Check whether the remote protocol version matches ours. */
+	version = rds_iw_protocol_compatible(dp);
+	if (!version)
+		goto out;
+
+	rdsdebug("saddr %pI4 daddr %pI4 RDSv%u.%u\n",
+		 &dp->dp_saddr, &dp->dp_daddr,
+		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version));
+
+	conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_iw_transport,
+			       GFP_KERNEL);
+	if (IS_ERR(conn)) {
+		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
+		conn = NULL;
+		goto out;
+	}
+
+	/*
+	 * The connection request may occur while the
+	 * previous connection exist, e.g. in case of failover.
+	 * But as connections may be initiated simultaneously
+	 * by both hosts, we have a random backoff mechanism -
+	 * see the comment above rds_queue_reconnect()
+	 */
+	mutex_lock(&conn->c_cm_lock);
+	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
+		if (rds_conn_state(conn) == RDS_CONN_UP) {
+			rdsdebug("incoming connect while connecting\n");
+			rds_conn_drop(conn);
+			rds_iw_stats_inc(s_iw_listen_closed_stale);
+		} else
+		if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
+			/* Wait and see - our connect may still be succeeding */
+			rds_iw_stats_inc(s_iw_connect_raced);
+		}
+		mutex_unlock(&conn->c_cm_lock);
+		goto out;
+	}
+
+	ic = conn->c_transport_data;
+
+	rds_iw_set_protocol(conn, version);
+	rds_iw_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	BUG_ON(cm_id->context);
+	BUG_ON(ic->i_cm_id);
+
+	ic->i_cm_id = cm_id;
+	cm_id->context = conn;
+
+	rds_iwdev = ib_get_client_data(cm_id->device, &rds_iw_client);
+	ic->i_dma_local_lkey = rds_iwdev->dma_local_lkey;
+
+	/* We got halfway through setting up the ib_connection, if we
+	 * fail now, we have to take the long route out of this mess. */
+	destroy = 0;
+
+	err = rds_iw_setup_qp(conn);
+	if (err) {
+		rds_iw_conn_error(conn, "rds_iw_setup_qp failed (%d)\n", err);
+		goto out;
+	}
+
+	rds_iw_cm_fill_conn_param(conn, &conn_param, &dp_rep, version);
+
+	/* rdma_accept() calls rdma_reject() internally if it fails */
+	err = rdma_accept(cm_id, &conn_param);
+	mutex_unlock(&conn->c_cm_lock);
+	if (err) {
+		rds_iw_conn_error(conn, "rdma_accept failed (%d)\n", err);
+		goto out;
+	}
+
+	return 0;
+
+out:
+	rdma_reject(cm_id, NULL, 0);
+	return destroy;
+}
+
+
+int rds_iw_cm_initiate_connect(struct rdma_cm_id *cm_id)
+{
+	struct rds_connection *conn = cm_id->context;
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rdma_conn_param conn_param;
+	struct rds_iw_connect_private dp;
+	int ret;
+
+	/* If the peer doesn't do protocol negotiation, we must
+	 * default to RDSv3.0 */
+	rds_iw_set_protocol(conn, RDS_PROTOCOL_3_0);
+	ic->i_flowctl = rds_iw_sysctl_flow_control;	/* advertise flow control */
+
+	ret = rds_iw_setup_qp(conn);
+	if (ret) {
+		rds_iw_conn_error(conn, "rds_iw_setup_qp failed (%d)\n", ret);
+		goto out;
+	}
+
+	rds_iw_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION);
+
+	ret = rdma_connect(cm_id, &conn_param);
+	if (ret)
+		rds_iw_conn_error(conn, "rdma_connect failed (%d)\n", ret);
+
+out:
+	/* Beware - returning non-zero tells the rdma_cm to destroy
+	 * the cm_id. We should certainly not do it as long as we still
+	 * "own" the cm_id. */
+	if (ret) {
+		struct rds_iw_connection *ic = conn->c_transport_data;
+
+		if (ic->i_cm_id == cm_id)
+			ret = 0;
+	}
+	return ret;
+}
+
+int rds_iw_conn_connect(struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rds_iw_device *rds_iwdev;
+	struct sockaddr_in src, dest;
+	int ret;
+
+	/* XXX I wonder what affect the port space has */
+	/* delegate cm event handler to rdma_transport */
+	ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn,
+				     RDMA_PS_TCP);
+	if (IS_ERR(ic->i_cm_id)) {
+		ret = PTR_ERR(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+		rdsdebug("rdma_create_id() failed: %d\n", ret);
+		goto out;
+	}
+
+	rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn);
+
+	src.sin_family = AF_INET;
+	src.sin_addr.s_addr = (__force u32)conn->c_laddr;
+	src.sin_port = (__force u16)htons(0);
+
+	/* First, bind to the local address and device. */
+	ret = rdma_bind_addr(ic->i_cm_id, (struct sockaddr *) &src);
+	if (ret) {
+		rdsdebug("rdma_bind_addr(%pI4) failed: %d\n",
+				&conn->c_laddr, ret);
+		rdma_destroy_id(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+		goto out;
+	}
+
+	rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client);
+	ic->i_dma_local_lkey = rds_iwdev->dma_local_lkey;
+
+	dest.sin_family = AF_INET;
+	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
+	dest.sin_port = (__force u16)htons(RDS_PORT);
+
+	ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src,
+				(struct sockaddr *)&dest,
+				RDS_RDMA_RESOLVE_TIMEOUT_MS);
+	if (ret) {
+		rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id,
+			 ret);
+		rdma_destroy_id(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * This is so careful about only cleaning up resources that were built up
+ * so that it can be called at any point during startup.  In fact it
+ * can be called multiple times for a given connection.
+ */
+void rds_iw_conn_shutdown(struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	int err = 0;
+	struct ib_qp_attr qp_attr;
+
+	rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id,
+		 ic->i_pd, ic->i_send_cq, ic->i_recv_cq,
+		 ic->i_cm_id ? ic->i_cm_id->qp : NULL);
+
+	if (ic->i_cm_id) {
+		struct ib_device *dev = ic->i_cm_id->device;
+
+		rdsdebug("disconnecting cm %p\n", ic->i_cm_id);
+		err = rdma_disconnect(ic->i_cm_id);
+		if (err) {
+			/* Actually this may happen quite frequently, when
+			 * an outgoing connect raced with an incoming connect.
+			 */
+			rdsdebug("rds_iw_conn_shutdown: failed to disconnect,"
+				   " cm: %p err %d\n", ic->i_cm_id, err);
+		}
+
+		if (ic->i_cm_id->qp) {
+			qp_attr.qp_state = IB_QPS_ERR;
+			ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
+		}
+
+		wait_event(rds_iw_ring_empty_wait,
+			rds_iw_ring_empty(&ic->i_send_ring) &&
+			rds_iw_ring_empty(&ic->i_recv_ring));
+
+		if (ic->i_send_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_send_hdrs,
+					   ic->i_send_hdrs_dma);
+
+		if (ic->i_recv_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_recv_hdrs,
+					   ic->i_recv_hdrs_dma);
+
+		if (ic->i_ack)
+			ib_dma_free_coherent(dev, sizeof(struct rds_header),
+					     ic->i_ack, ic->i_ack_dma);
+
+		if (ic->i_sends)
+			rds_iw_send_clear_ring(ic);
+		if (ic->i_recvs)
+			rds_iw_recv_clear_ring(ic);
+
+		if (ic->i_cm_id->qp)
+			rdma_destroy_qp(ic->i_cm_id);
+		if (ic->i_send_cq)
+			ib_destroy_cq(ic->i_send_cq);
+		if (ic->i_recv_cq)
+			ib_destroy_cq(ic->i_recv_cq);
+
+		/*
+		 * If associated with an rds_iw_device:
+		 * 	Move connection back to the nodev list.
+		 * 	Remove cm_id from the device cm_id list.
+		 */
+		if (ic->rds_iwdev) {
+
+			spin_lock_irq(&ic->rds_iwdev->spinlock);
+			BUG_ON(list_empty(&ic->iw_node));
+			list_del(&ic->iw_node);
+			spin_unlock_irq(&ic->rds_iwdev->spinlock);
+
+			spin_lock_irq(&iw_nodev_conns_lock);
+			list_add_tail(&ic->iw_node, &iw_nodev_conns);
+			spin_unlock_irq(&iw_nodev_conns_lock);
+			rds_iw_remove_cm_id(ic->rds_iwdev, ic->i_cm_id);
+			ic->rds_iwdev = NULL;
+		}
+
+		rdma_destroy_id(ic->i_cm_id);
+
+		ic->i_cm_id = NULL;
+		ic->i_pd = NULL;
+		ic->i_mr = NULL;
+		ic->i_send_cq = NULL;
+		ic->i_recv_cq = NULL;
+		ic->i_send_hdrs = NULL;
+		ic->i_recv_hdrs = NULL;
+		ic->i_ack = NULL;
+	}
+	BUG_ON(ic->rds_iwdev);
+
+	/* Clear pending transmit */
+	if (ic->i_rm) {
+		rds_message_put(ic->i_rm);
+		ic->i_rm = NULL;
+	}
+
+	/* Clear the ACK state */
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+	rds_iw_set_64bit(&ic->i_ack_next, 0);
+	ic->i_ack_recv = 0;
+
+	/* Clear flow control state */
+	ic->i_flowctl = 0;
+	atomic_set(&ic->i_credits, 0);
+
+	rds_iw_ring_init(&ic->i_send_ring, rds_iw_sysctl_max_send_wr);
+	rds_iw_ring_init(&ic->i_recv_ring, rds_iw_sysctl_max_recv_wr);
+
+	if (ic->i_iwinc) {
+		rds_inc_put(&ic->i_iwinc->ii_inc);
+		ic->i_iwinc = NULL;
+	}
+
+	vfree(ic->i_sends);
+	ic->i_sends = NULL;
+	vfree(ic->i_recvs);
+	ic->i_recvs = NULL;
+	rdsdebug("shutdown complete\n");
+}
+
+int rds_iw_conn_alloc(struct rds_connection *conn, gfp_t gfp)
+{
+	struct rds_iw_connection *ic;
+	unsigned long flags;
+
+	/* XXX too lazy? */
+	ic = kzalloc(sizeof(struct rds_iw_connection), GFP_KERNEL);
+	if (ic == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&ic->iw_node);
+	mutex_init(&ic->i_recv_mutex);
+
+	/*
+	 * rds_iw_conn_shutdown() waits for these to be emptied so they
+	 * must be initialized before it can be called.
+	 */
+	rds_iw_ring_init(&ic->i_send_ring, rds_iw_sysctl_max_send_wr);
+	rds_iw_ring_init(&ic->i_recv_ring, rds_iw_sysctl_max_recv_wr);
+
+	ic->conn = conn;
+	conn->c_transport_data = ic;
+
+	spin_lock_irqsave(&iw_nodev_conns_lock, flags);
+	list_add_tail(&ic->iw_node, &iw_nodev_conns);
+	spin_unlock_irqrestore(&iw_nodev_conns_lock, flags);
+
+
+	rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data);
+	return 0;
+}
+
+void rds_iw_conn_free(void *arg)
+{
+	struct rds_iw_connection *ic = arg;
+	rdsdebug("ic %p\n", ic);
+	list_del(&ic->iw_node);
+	kfree(ic);
+}
+
+/*
+ * An error occurred on the connection
+ */
+void
+__rds_iw_conn_error(struct rds_connection *conn, const char *fmt, ...)
+{
+	va_list ap;
+
+	rds_conn_drop(conn);
+
+	va_start(ap, fmt);
+	vprintk(fmt, ap);
+	va_end(ap);
+}
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
new file mode 100644
index 0000000..1c02a8f
--- /dev/null
+++ b/net/rds/iw_rdma.c
@@ -0,0 +1,888 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "iw.h"
+
+
+/*
+ * This is stored as mr->r_trans_private.
+ */
+struct rds_iw_mr {
+	struct rds_iw_device	*device;
+	struct rds_iw_mr_pool	*pool;
+	struct rdma_cm_id	*cm_id;
+
+	struct ib_mr	*mr;
+	struct ib_fast_reg_page_list *page_list;
+
+	struct rds_iw_mapping	mapping;
+	unsigned char		remap_count;
+};
+
+/*
+ * Our own little MR pool
+ */
+struct rds_iw_mr_pool {
+	struct rds_iw_device	*device;		/* back ptr to the device that owns us */
+
+	struct mutex		flush_lock;		/* serialize fmr invalidate */
+	struct work_struct	flush_worker;		/* flush worker */
+
+	spinlock_t		list_lock;		/* protect variables below */
+	atomic_t		item_count;		/* total # of MRs */
+	atomic_t		dirty_count;		/* # dirty of MRs */
+	struct list_head	dirty_list;		/* dirty mappings */
+	struct list_head	clean_list;		/* unused & unamapped MRs */
+	atomic_t		free_pinned;		/* memory pinned by free MRs */
+	unsigned long		max_message_size;	/* in pages */
+	unsigned long		max_items;
+	unsigned long		max_items_soft;
+	unsigned long		max_free_pinned;
+	int			max_pages;
+};
+
+static int rds_iw_flush_mr_pool(struct rds_iw_mr_pool *pool, int free_all);
+static void rds_iw_mr_pool_flush_worker(struct work_struct *work);
+static int rds_iw_init_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr);
+static int rds_iw_map_fastreg(struct rds_iw_mr_pool *pool,
+			  struct rds_iw_mr *ibmr,
+			  struct scatterlist *sg, unsigned int nents);
+static void rds_iw_free_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr);
+static unsigned int rds_iw_unmap_fastreg_list(struct rds_iw_mr_pool *pool,
+			struct list_head *unmap_list,
+			struct list_head *kill_list);
+static void rds_iw_destroy_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr);
+
+static int rds_iw_get_device(struct rds_sock *rs, struct rds_iw_device **rds_iwdev, struct rdma_cm_id **cm_id)
+{
+	struct rds_iw_device *iwdev;
+	struct rds_iw_cm_id *i_cm_id;
+
+	*rds_iwdev = NULL;
+	*cm_id = NULL;
+
+	list_for_each_entry(iwdev, &rds_iw_devices, list) {
+		spin_lock_irq(&iwdev->spinlock);
+		list_for_each_entry(i_cm_id, &iwdev->cm_id_list, list) {
+			struct sockaddr_in *src_addr, *dst_addr;
+
+			src_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.src_addr;
+			dst_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.dst_addr;
+
+			rdsdebug("local ipaddr = %x port %d, "
+				 "remote ipaddr = %x port %d"
+				 "..looking for %x port %d, "
+				 "remote ipaddr = %x port %d\n",
+				src_addr->sin_addr.s_addr,
+				src_addr->sin_port,
+				dst_addr->sin_addr.s_addr,
+				dst_addr->sin_port,
+				rs->rs_bound_addr,
+				rs->rs_bound_port,
+				rs->rs_conn_addr,
+				rs->rs_conn_port);
+#ifdef WORKING_TUPLE_DETECTION
+			if (src_addr->sin_addr.s_addr == rs->rs_bound_addr &&
+			    src_addr->sin_port == rs->rs_bound_port &&
+			    dst_addr->sin_addr.s_addr == rs->rs_conn_addr &&
+			    dst_addr->sin_port == rs->rs_conn_port) {
+#else
+			/* FIXME - needs to compare the local and remote
+			 * ipaddr/port tuple, but the ipaddr is the only
+			 * available infomation in the rds_sock (as the rest are
+			 * zero'ed.  It doesn't appear to be properly populated
+			 * during connection setup...
+			 */
+			if (src_addr->sin_addr.s_addr == rs->rs_bound_addr) {
+#endif
+				spin_unlock_irq(&iwdev->spinlock);
+				*rds_iwdev = iwdev;
+				*cm_id = i_cm_id->cm_id;
+				return 0;
+			}
+		}
+		spin_unlock_irq(&iwdev->spinlock);
+	}
+
+	return 1;
+}
+
+static int rds_iw_add_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id)
+{
+	struct rds_iw_cm_id *i_cm_id;
+
+	i_cm_id = kmalloc(sizeof *i_cm_id, GFP_KERNEL);
+	if (!i_cm_id)
+		return -ENOMEM;
+
+	i_cm_id->cm_id = cm_id;
+
+	spin_lock_irq(&rds_iwdev->spinlock);
+	list_add_tail(&i_cm_id->list, &rds_iwdev->cm_id_list);
+	spin_unlock_irq(&rds_iwdev->spinlock);
+
+	return 0;
+}
+
+void rds_iw_remove_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id)
+{
+	struct rds_iw_cm_id *i_cm_id;
+
+	spin_lock_irq(&rds_iwdev->spinlock);
+	list_for_each_entry(i_cm_id, &rds_iwdev->cm_id_list, list) {
+		if (i_cm_id->cm_id == cm_id) {
+			list_del(&i_cm_id->list);
+			kfree(i_cm_id);
+			break;
+		}
+	}
+	spin_unlock_irq(&rds_iwdev->spinlock);
+}
+
+
+int rds_iw_update_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id)
+{
+	struct sockaddr_in *src_addr, *dst_addr;
+	struct rds_iw_device *rds_iwdev_old;
+	struct rds_sock rs;
+	struct rdma_cm_id *pcm_id;
+	int rc;
+
+	src_addr = (struct sockaddr_in *)&cm_id->route.addr.src_addr;
+	dst_addr = (struct sockaddr_in *)&cm_id->route.addr.dst_addr;
+
+	rs.rs_bound_addr = src_addr->sin_addr.s_addr;
+	rs.rs_bound_port = src_addr->sin_port;
+	rs.rs_conn_addr = dst_addr->sin_addr.s_addr;
+	rs.rs_conn_port = dst_addr->sin_port;
+
+	rc = rds_iw_get_device(&rs, &rds_iwdev_old, &pcm_id);
+	if (rc)
+		rds_iw_remove_cm_id(rds_iwdev, cm_id);
+
+	return rds_iw_add_cm_id(rds_iwdev, cm_id);
+}
+
+int rds_iw_add_conn(struct rds_iw_device *rds_iwdev, struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	/* conn was previously on the nodev_conns_list */
+	spin_lock_irq(&iw_nodev_conns_lock);
+	BUG_ON(list_empty(&iw_nodev_conns));
+	BUG_ON(list_empty(&ic->iw_node));
+	list_del(&ic->iw_node);
+	spin_unlock_irq(&iw_nodev_conns_lock);
+
+	spin_lock_irq(&rds_iwdev->spinlock);
+	list_add_tail(&ic->iw_node, &rds_iwdev->conn_list);
+	spin_unlock_irq(&rds_iwdev->spinlock);
+
+	ic->rds_iwdev = rds_iwdev;
+
+	return 0;
+}
+
+void rds_iw_remove_nodev_conns(void)
+{
+	struct rds_iw_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&iw_nodev_conns_lock);
+	list_splice(&iw_nodev_conns, &tmp_list);
+	INIT_LIST_HEAD(&iw_nodev_conns);
+	spin_unlock_irq(&iw_nodev_conns_lock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, iw_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+void rds_iw_remove_conns(struct rds_iw_device *rds_iwdev)
+{
+	struct rds_iw_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&rds_iwdev->spinlock);
+	list_splice(&rds_iwdev->conn_list, &tmp_list);
+	INIT_LIST_HEAD(&rds_iwdev->conn_list);
+	spin_unlock_irq(&rds_iwdev->spinlock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, iw_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+static void rds_iw_set_scatterlist(struct rds_iw_scatterlist *sg,
+		struct scatterlist *list, unsigned int sg_len)
+{
+	sg->list = list;
+	sg->len = sg_len;
+	sg->dma_len = 0;
+	sg->dma_npages = 0;
+	sg->bytes = 0;
+}
+
+static u64 *rds_iw_map_scatterlist(struct rds_iw_device *rds_iwdev,
+			struct rds_iw_scatterlist *sg,
+			unsigned int dma_page_shift)
+{
+	struct ib_device *dev = rds_iwdev->dev;
+	u64 *dma_pages = NULL;
+	u64 dma_mask;
+	unsigned int dma_page_size;
+	int i, j, ret;
+
+	dma_page_size = 1 << dma_page_shift;
+	dma_mask = dma_page_size - 1;
+
+	WARN_ON(sg->dma_len);
+
+	sg->dma_len = ib_dma_map_sg(dev, sg->list, sg->len, DMA_BIDIRECTIONAL);
+	if (unlikely(!sg->dma_len)) {
+		printk(KERN_WARNING "RDS/IW: dma_map_sg failed!\n");
+		return ERR_PTR(-EBUSY);
+	}
+
+	sg->bytes = 0;
+	sg->dma_npages = 0;
+
+	ret = -EINVAL;
+	for (i = 0; i < sg->dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]);
+		u64 end_addr;
+
+		sg->bytes += dma_len;
+
+		end_addr = dma_addr + dma_len;
+		if (dma_addr & dma_mask) {
+			if (i > 0)
+				goto out_unmap;
+			dma_addr &= ~dma_mask;
+		}
+		if (end_addr & dma_mask) {
+			if (i < sg->dma_len - 1)
+				goto out_unmap;
+			end_addr = (end_addr + dma_mask) & ~dma_mask;
+		}
+
+		sg->dma_npages += (end_addr - dma_addr) >> dma_page_shift;
+	}
+
+	/* Now gather the dma addrs into one list */
+	if (sg->dma_npages > fastreg_message_size)
+		goto out_unmap;
+
+	dma_pages = kmalloc(sizeof(u64) * sg->dma_npages, GFP_ATOMIC);
+	if (!dma_pages) {
+		ret = -ENOMEM;
+		goto out_unmap;
+	}
+
+	for (i = j = 0; i < sg->dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]);
+		u64 end_addr;
+
+		end_addr = dma_addr + dma_len;
+		dma_addr &= ~dma_mask;
+		for (; dma_addr < end_addr; dma_addr += dma_page_size)
+			dma_pages[j++] = dma_addr;
+		BUG_ON(j > sg->dma_npages);
+	}
+
+	return dma_pages;
+
+out_unmap:
+	ib_dma_unmap_sg(rds_iwdev->dev, sg->list, sg->len, DMA_BIDIRECTIONAL);
+	sg->dma_len = 0;
+	kfree(dma_pages);
+	return ERR_PTR(ret);
+}
+
+
+struct rds_iw_mr_pool *rds_iw_create_mr_pool(struct rds_iw_device *rds_iwdev)
+{
+	struct rds_iw_mr_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool) {
+		printk(KERN_WARNING "RDS/IW: rds_iw_create_mr_pool alloc error\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	pool->device = rds_iwdev;
+	INIT_LIST_HEAD(&pool->dirty_list);
+	INIT_LIST_HEAD(&pool->clean_list);
+	mutex_init(&pool->flush_lock);
+	spin_lock_init(&pool->list_lock);
+	INIT_WORK(&pool->flush_worker, rds_iw_mr_pool_flush_worker);
+
+	pool->max_message_size = fastreg_message_size;
+	pool->max_items = fastreg_pool_size;
+	pool->max_free_pinned = pool->max_items * pool->max_message_size / 4;
+	pool->max_pages = fastreg_message_size;
+
+	/* We never allow more than max_items MRs to be allocated.
+	 * When we exceed more than max_items_soft, we start freeing
+	 * items more aggressively.
+	 * Make sure that max_items > max_items_soft > max_items / 2
+	 */
+	pool->max_items_soft = pool->max_items * 3 / 4;
+
+	return pool;
+}
+
+void rds_iw_get_mr_info(struct rds_iw_device *rds_iwdev, struct rds_info_rdma_connection *iinfo)
+{
+	struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool;
+
+	iinfo->rdma_mr_max = pool->max_items;
+	iinfo->rdma_mr_size = pool->max_pages;
+}
+
+void rds_iw_destroy_mr_pool(struct rds_iw_mr_pool *pool)
+{
+	flush_workqueue(rds_wq);
+	rds_iw_flush_mr_pool(pool, 1);
+	BUG_ON(atomic_read(&pool->item_count));
+	BUG_ON(atomic_read(&pool->free_pinned));
+	kfree(pool);
+}
+
+static inline struct rds_iw_mr *rds_iw_reuse_fmr(struct rds_iw_mr_pool *pool)
+{
+	struct rds_iw_mr *ibmr = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	if (!list_empty(&pool->clean_list)) {
+		ibmr = list_entry(pool->clean_list.next, struct rds_iw_mr, mapping.m_list);
+		list_del_init(&ibmr->mapping.m_list);
+	}
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	return ibmr;
+}
+
+static struct rds_iw_mr *rds_iw_alloc_mr(struct rds_iw_device *rds_iwdev)
+{
+	struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool;
+	struct rds_iw_mr *ibmr = NULL;
+	int err = 0, iter = 0;
+
+	while (1) {
+		ibmr = rds_iw_reuse_fmr(pool);
+		if (ibmr)
+			return ibmr;
+
+		/* No clean MRs - now we have the choice of either
+		 * allocating a fresh MR up to the limit imposed by the
+		 * driver, or flush any dirty unused MRs.
+		 * We try to avoid stalling in the send path if possible,
+		 * so we allocate as long as we're allowed to.
+		 *
+		 * We're fussy with enforcing the FMR limit, though. If the driver
+		 * tells us we can't use more than N fmrs, we shouldn't start
+		 * arguing with it */
+		if (atomic_inc_return(&pool->item_count) <= pool->max_items)
+			break;
+
+		atomic_dec(&pool->item_count);
+
+		if (++iter > 2) {
+			rds_iw_stats_inc(s_iw_rdma_mr_pool_depleted);
+			return ERR_PTR(-EAGAIN);
+		}
+
+		/* We do have some empty MRs. Flush them out. */
+		rds_iw_stats_inc(s_iw_rdma_mr_pool_wait);
+		rds_iw_flush_mr_pool(pool, 0);
+	}
+
+	ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL);
+	if (!ibmr) {
+		err = -ENOMEM;
+		goto out_no_cigar;
+	}
+
+	spin_lock_init(&ibmr->mapping.m_lock);
+	INIT_LIST_HEAD(&ibmr->mapping.m_list);
+	ibmr->mapping.m_mr = ibmr;
+
+	err = rds_iw_init_fastreg(pool, ibmr);
+	if (err)
+		goto out_no_cigar;
+
+	rds_iw_stats_inc(s_iw_rdma_mr_alloc);
+	return ibmr;
+
+out_no_cigar:
+	if (ibmr) {
+		rds_iw_destroy_fastreg(pool, ibmr);
+		kfree(ibmr);
+	}
+	atomic_dec(&pool->item_count);
+	return ERR_PTR(err);
+}
+
+void rds_iw_sync_mr(void *trans_private, int direction)
+{
+	struct rds_iw_mr *ibmr = trans_private;
+	struct rds_iw_device *rds_iwdev = ibmr->device;
+
+	switch (direction) {
+	case DMA_FROM_DEVICE:
+		ib_dma_sync_sg_for_cpu(rds_iwdev->dev, ibmr->mapping.m_sg.list,
+			ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL);
+		break;
+	case DMA_TO_DEVICE:
+		ib_dma_sync_sg_for_device(rds_iwdev->dev, ibmr->mapping.m_sg.list,
+			ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL);
+		break;
+	}
+}
+
+static inline unsigned int rds_iw_flush_goal(struct rds_iw_mr_pool *pool, int free_all)
+{
+	unsigned int item_count;
+
+	item_count = atomic_read(&pool->item_count);
+	if (free_all)
+		return item_count;
+
+	return 0;
+}
+
+/*
+ * Flush our pool of MRs.
+ * At a minimum, all currently unused MRs are unmapped.
+ * If the number of MRs allocated exceeds the limit, we also try
+ * to free as many MRs as needed to get back to this limit.
+ */
+static int rds_iw_flush_mr_pool(struct rds_iw_mr_pool *pool, int free_all)
+{
+	struct rds_iw_mr *ibmr, *next;
+	LIST_HEAD(unmap_list);
+	LIST_HEAD(kill_list);
+	unsigned long flags;
+	unsigned int nfreed = 0, ncleaned = 0, free_goal;
+	int ret = 0;
+
+	rds_iw_stats_inc(s_iw_rdma_mr_pool_flush);
+
+	mutex_lock(&pool->flush_lock);
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	/* Get the list of all mappings to be destroyed */
+	list_splice_init(&pool->dirty_list, &unmap_list);
+	if (free_all)
+		list_splice_init(&pool->clean_list, &kill_list);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	free_goal = rds_iw_flush_goal(pool, free_all);
+
+	/* Batched invalidate of dirty MRs.
+	 * For FMR based MRs, the mappings on the unmap list are
+	 * actually members of an ibmr (ibmr->mapping). They either
+	 * migrate to the kill_list, or have been cleaned and should be
+	 * moved to the clean_list.
+	 * For fastregs, they will be dynamically allocated, and
+	 * will be destroyed by the unmap function.
+	 */
+	if (!list_empty(&unmap_list)) {
+		ncleaned = rds_iw_unmap_fastreg_list(pool, &unmap_list, &kill_list);
+		/* If we've been asked to destroy all MRs, move those
+		 * that were simply cleaned to the kill list */
+		if (free_all)
+			list_splice_init(&unmap_list, &kill_list);
+	}
+
+	/* Destroy any MRs that are past their best before date */
+	list_for_each_entry_safe(ibmr, next, &kill_list, mapping.m_list) {
+		rds_iw_stats_inc(s_iw_rdma_mr_free);
+		list_del(&ibmr->mapping.m_list);
+		rds_iw_destroy_fastreg(pool, ibmr);
+		kfree(ibmr);
+		nfreed++;
+	}
+
+	/* Anything that remains are laundered ibmrs, which we can add
+	 * back to the clean list. */
+	if (!list_empty(&unmap_list)) {
+		spin_lock_irqsave(&pool->list_lock, flags);
+		list_splice(&unmap_list, &pool->clean_list);
+		spin_unlock_irqrestore(&pool->list_lock, flags);
+	}
+
+	atomic_sub(ncleaned, &pool->dirty_count);
+	atomic_sub(nfreed, &pool->item_count);
+
+	mutex_unlock(&pool->flush_lock);
+	return ret;
+}
+
+static void rds_iw_mr_pool_flush_worker(struct work_struct *work)
+{
+	struct rds_iw_mr_pool *pool = container_of(work, struct rds_iw_mr_pool, flush_worker);
+
+	rds_iw_flush_mr_pool(pool, 0);
+}
+
+void rds_iw_free_mr(void *trans_private, int invalidate)
+{
+	struct rds_iw_mr *ibmr = trans_private;
+	struct rds_iw_mr_pool *pool = ibmr->device->mr_pool;
+
+	rdsdebug("RDS/IW: free_mr nents %u\n", ibmr->mapping.m_sg.len);
+	if (!pool)
+		return;
+
+	/* Return it to the pool's free list */
+	rds_iw_free_fastreg(pool, ibmr);
+
+	/* If we've pinned too many pages, request a flush */
+	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned
+	 || atomic_read(&pool->dirty_count) >= pool->max_items / 10)
+		queue_work(rds_wq, &pool->flush_worker);
+
+	if (invalidate) {
+		if (likely(!in_interrupt())) {
+			rds_iw_flush_mr_pool(pool, 0);
+		} else {
+			/* We get here if the user created a MR marked
+			 * as use_once and invalidate at the same time. */
+			queue_work(rds_wq, &pool->flush_worker);
+		}
+	}
+}
+
+void rds_iw_flush_mrs(void)
+{
+	struct rds_iw_device *rds_iwdev;
+
+	list_for_each_entry(rds_iwdev, &rds_iw_devices, list) {
+		struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool;
+
+		if (pool)
+			rds_iw_flush_mr_pool(pool, 0);
+	}
+}
+
+void *rds_iw_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret)
+{
+	struct rds_iw_device *rds_iwdev;
+	struct rds_iw_mr *ibmr = NULL;
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	ret = rds_iw_get_device(rs, &rds_iwdev, &cm_id);
+	if (ret || !cm_id) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (!rds_iwdev->mr_pool) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	ibmr = rds_iw_alloc_mr(rds_iwdev);
+	if (IS_ERR(ibmr))
+		return ibmr;
+
+	ibmr->cm_id = cm_id;
+	ibmr->device = rds_iwdev;
+
+	ret = rds_iw_map_fastreg(rds_iwdev->mr_pool, ibmr, sg, nents);
+	if (ret == 0)
+		*key_ret = ibmr->mr->rkey;
+	else
+		printk(KERN_WARNING "RDS/IW: failed to map mr (errno=%d)\n", ret);
+
+out:
+	if (ret) {
+		if (ibmr)
+			rds_iw_free_mr(ibmr, 0);
+		ibmr = ERR_PTR(ret);
+	}
+	return ibmr;
+}
+
+/*
+ * iWARP fastreg handling
+ *
+ * The life cycle of a fastreg registration is a bit different from
+ * FMRs.
+ * The idea behind fastreg is to have one MR, to which we bind different
+ * mappings over time. To avoid stalling on the expensive map and invalidate
+ * operations, these operations are pipelined on the same send queue on
+ * which we want to send the message containing the r_key.
+ *
+ * This creates a bit of a problem for us, as we do not have the destination
+ * IP in GET_MR, so the connection must be setup prior to the GET_MR call for
+ * RDMA to be correctly setup.  If a fastreg request is present, rds_iw_xmit
+ * will try to queue a LOCAL_INV (if needed) and a FAST_REG_MR work request
+ * before queuing the SEND. When completions for these arrive, they are
+ * dispatched to the MR has a bit set showing that RDMa can be performed.
+ *
+ * There is another interesting aspect that's related to invalidation.
+ * The application can request that a mapping is invalidated in FREE_MR.
+ * The expectation there is that this invalidation step includes ALL
+ * PREVIOUSLY FREED MRs.
+ */
+static int rds_iw_init_fastreg(struct rds_iw_mr_pool *pool,
+				struct rds_iw_mr *ibmr)
+{
+	struct rds_iw_device *rds_iwdev = pool->device;
+	struct ib_fast_reg_page_list *page_list = NULL;
+	struct ib_mr *mr;
+	int err;
+
+	mr = ib_alloc_fast_reg_mr(rds_iwdev->pd, pool->max_message_size);
+	if (IS_ERR(mr)) {
+		err = PTR_ERR(mr);
+
+		printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_mr failed (err=%d)\n", err);
+		return err;
+	}
+
+	/* FIXME - this is overkill, but mapping->m_sg.dma_len/mapping->m_sg.dma_npages
+	 * is not filled in.
+	 */
+	page_list = ib_alloc_fast_reg_page_list(rds_iwdev->dev, pool->max_message_size);
+	if (IS_ERR(page_list)) {
+		err = PTR_ERR(page_list);
+
+		printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_page_list failed (err=%d)\n", err);
+		ib_dereg_mr(mr);
+		return err;
+	}
+
+	ibmr->page_list = page_list;
+	ibmr->mr = mr;
+	return 0;
+}
+
+static int rds_iw_rdma_build_fastreg(struct rds_iw_mapping *mapping)
+{
+	struct rds_iw_mr *ibmr = mapping->m_mr;
+	struct ib_send_wr f_wr, *failed_wr;
+	int ret;
+
+	/*
+	 * Perform a WR for the fast_reg_mr. Each individual page
+	 * in the sg list is added to the fast reg page list and placed
+	 * inside the fast_reg_mr WR.  The key used is a rolling 8bit
+	 * counter, which should guarantee uniqueness.
+	 */
+	ib_update_fast_reg_key(ibmr->mr, ibmr->remap_count++);
+	mapping->m_rkey = ibmr->mr->rkey;
+
+	memset(&f_wr, 0, sizeof(f_wr));
+	f_wr.wr_id = RDS_IW_FAST_REG_WR_ID;
+	f_wr.opcode = IB_WR_FAST_REG_MR;
+	f_wr.wr.fast_reg.length = mapping->m_sg.bytes;
+	f_wr.wr.fast_reg.rkey = mapping->m_rkey;
+	f_wr.wr.fast_reg.page_list = ibmr->page_list;
+	f_wr.wr.fast_reg.page_list_len = mapping->m_sg.dma_len;
+	f_wr.wr.fast_reg.page_shift = ibmr->device->page_shift;
+	f_wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE |
+				IB_ACCESS_REMOTE_READ |
+				IB_ACCESS_REMOTE_WRITE;
+	f_wr.wr.fast_reg.iova_start = 0;
+	f_wr.send_flags = IB_SEND_SIGNALED;
+
+	failed_wr = &f_wr;
+	ret = ib_post_send(ibmr->cm_id->qp, &f_wr, &failed_wr);
+	BUG_ON(failed_wr != &f_wr);
+	if (ret && printk_ratelimit())
+		printk(KERN_WARNING "RDS/IW: %s:%d ib_post_send returned %d\n",
+			__func__, __LINE__, ret);
+	return ret;
+}
+
+static int rds_iw_rdma_fastreg_inv(struct rds_iw_mr *ibmr)
+{
+	struct ib_send_wr s_wr, *failed_wr;
+	int ret = 0;
+
+	if (!ibmr->cm_id->qp || !ibmr->mr)
+		goto out;
+
+	memset(&s_wr, 0, sizeof(s_wr));
+	s_wr.wr_id = RDS_IW_LOCAL_INV_WR_ID;
+	s_wr.opcode = IB_WR_LOCAL_INV;
+	s_wr.ex.invalidate_rkey = ibmr->mr->rkey;
+	s_wr.send_flags = IB_SEND_SIGNALED;
+
+	failed_wr = &s_wr;
+	ret = ib_post_send(ibmr->cm_id->qp, &s_wr, &failed_wr);
+	if (ret && printk_ratelimit()) {
+		printk(KERN_WARNING "RDS/IW: %s:%d ib_post_send returned %d\n",
+			__func__, __LINE__, ret);
+		goto out;
+	}
+out:
+	return ret;
+}
+
+static int rds_iw_map_fastreg(struct rds_iw_mr_pool *pool,
+			struct rds_iw_mr *ibmr,
+			struct scatterlist *sg,
+			unsigned int sg_len)
+{
+	struct rds_iw_device *rds_iwdev = pool->device;
+	struct rds_iw_mapping *mapping = &ibmr->mapping;
+	u64 *dma_pages;
+	int i, ret = 0;
+
+	rds_iw_set_scatterlist(&mapping->m_sg, sg, sg_len);
+
+	dma_pages = rds_iw_map_scatterlist(rds_iwdev,
+				&mapping->m_sg,
+				rds_iwdev->page_shift);
+	if (IS_ERR(dma_pages)) {
+		ret = PTR_ERR(dma_pages);
+		dma_pages = NULL;
+		goto out;
+	}
+
+	if (mapping->m_sg.dma_len > pool->max_message_size) {
+		ret = -EMSGSIZE;
+		goto out;
+	}
+
+	for (i = 0; i < mapping->m_sg.dma_npages; ++i)
+		ibmr->page_list->page_list[i] = dma_pages[i];
+
+	ret = rds_iw_rdma_build_fastreg(mapping);
+	if (ret)
+		goto out;
+
+	rds_iw_stats_inc(s_iw_rdma_mr_used);
+
+out:
+	kfree(dma_pages);
+
+	return ret;
+}
+
+/*
+ * "Free" a fastreg MR.
+ */
+static void rds_iw_free_fastreg(struct rds_iw_mr_pool *pool,
+		struct rds_iw_mr *ibmr)
+{
+	unsigned long flags;
+	int ret;
+
+	if (!ibmr->mapping.m_sg.dma_len)
+		return;
+
+	ret = rds_iw_rdma_fastreg_inv(ibmr);
+	if (ret)
+		return;
+
+	/* Try to post the LOCAL_INV WR to the queue. */
+	spin_lock_irqsave(&pool->list_lock, flags);
+
+	list_add_tail(&ibmr->mapping.m_list, &pool->dirty_list);
+	atomic_add(ibmr->mapping.m_sg.len, &pool->free_pinned);
+	atomic_inc(&pool->dirty_count);
+
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+}
+
+static unsigned int rds_iw_unmap_fastreg_list(struct rds_iw_mr_pool *pool,
+				struct list_head *unmap_list,
+				struct list_head *kill_list)
+{
+	struct rds_iw_mapping *mapping, *next;
+	unsigned int ncleaned = 0;
+	LIST_HEAD(laundered);
+
+	/* Batched invalidation of fastreg MRs.
+	 * Why do we do it this way, even though we could pipeline unmap
+	 * and remap? The reason is the application semantics - when the
+	 * application requests an invalidation of MRs, it expects all
+	 * previously released R_Keys to become invalid.
+	 *
+	 * If we implement MR reuse naively, we risk memory corruption
+	 * (this has actually been observed). So the default behavior
+	 * requires that a MR goes through an explicit unmap operation before
+	 * we can reuse it again.
+	 *
+	 * We could probably improve on this a little, by allowing immediate
+	 * reuse of a MR on the same socket (eg you could add small
+	 * cache of unused MRs to strct rds_socket - GET_MR could grab one
+	 * of these without requiring an explicit invalidate).
+	 */
+	while (!list_empty(unmap_list)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&pool->list_lock, flags);
+		list_for_each_entry_safe(mapping, next, unmap_list, m_list) {
+			list_move(&mapping->m_list, &laundered);
+			ncleaned++;
+		}
+		spin_unlock_irqrestore(&pool->list_lock, flags);
+	}
+
+	/* Move all laundered mappings back to the unmap list.
+	 * We do not kill any WRs right now - it doesn't seem the
+	 * fastreg API has a max_remap limit. */
+	list_splice_init(&laundered, unmap_list);
+
+	return ncleaned;
+}
+
+static void rds_iw_destroy_fastreg(struct rds_iw_mr_pool *pool,
+		struct rds_iw_mr *ibmr)
+{
+	if (ibmr->page_list)
+		ib_free_fast_reg_page_list(ibmr->page_list);
+	if (ibmr->mr)
+		ib_dereg_mr(ibmr->mr);
+}
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
new file mode 100644
index 0000000..a1931f0
--- /dev/null
+++ b/net/rds/iw_recv.c
@@ -0,0 +1,869 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <rdma/rdma_cm.h>
+
+#include "rds.h"
+#include "iw.h"
+
+static struct kmem_cache *rds_iw_incoming_slab;
+static struct kmem_cache *rds_iw_frag_slab;
+static atomic_t	rds_iw_allocation = ATOMIC_INIT(0);
+
+static void rds_iw_frag_drop_page(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	__free_page(frag->f_page);
+	frag->f_page = NULL;
+}
+
+static void rds_iw_frag_free(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	BUG_ON(frag->f_page != NULL);
+	kmem_cache_free(rds_iw_frag_slab, frag);
+}
+
+/*
+ * We map a page at a time.  Its fragments are posted in order.  This
+ * is called in fragment order as the fragments get send completion events.
+ * Only the last frag in the page performs the unmapping.
+ *
+ * It's OK for ring cleanup to call this in whatever order it likes because
+ * DMA is not in flight and so we can unmap while other ring entries still
+ * hold page references in their frags.
+ */
+static void rds_iw_recv_unmap_page(struct rds_iw_connection *ic,
+				   struct rds_iw_recv_work *recv)
+{
+	struct rds_page_frag *frag = recv->r_frag;
+
+	rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page);
+	if (frag->f_mapped)
+		ib_dma_unmap_page(ic->i_cm_id->device,
+			       frag->f_mapped,
+			       RDS_FRAG_SIZE, DMA_FROM_DEVICE);
+	frag->f_mapped = 0;
+}
+
+void rds_iw_recv_init_ring(struct rds_iw_connection *ic)
+{
+	struct rds_iw_recv_work *recv;
+	u32 i;
+
+	for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) {
+		struct ib_sge *sge;
+
+		recv->r_iwinc = NULL;
+		recv->r_frag = NULL;
+
+		recv->r_wr.next = NULL;
+		recv->r_wr.wr_id = i;
+		recv->r_wr.sg_list = recv->r_sge;
+		recv->r_wr.num_sge = RDS_IW_RECV_SGE;
+
+		sge = rds_iw_data_sge(ic, recv->r_sge);
+		sge->addr = 0;
+		sge->length = RDS_FRAG_SIZE;
+		sge->lkey = 0;
+
+		sge = rds_iw_header_sge(ic, recv->r_sge);
+		sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = 0;
+	}
+}
+
+static void rds_iw_recv_clear_one(struct rds_iw_connection *ic,
+				  struct rds_iw_recv_work *recv)
+{
+	if (recv->r_iwinc) {
+		rds_inc_put(&recv->r_iwinc->ii_inc);
+		recv->r_iwinc = NULL;
+	}
+	if (recv->r_frag) {
+		rds_iw_recv_unmap_page(ic, recv);
+		if (recv->r_frag->f_page)
+			rds_iw_frag_drop_page(recv->r_frag);
+		rds_iw_frag_free(recv->r_frag);
+		recv->r_frag = NULL;
+	}
+}
+
+void rds_iw_recv_clear_ring(struct rds_iw_connection *ic)
+{
+	u32 i;
+
+	for (i = 0; i < ic->i_recv_ring.w_nr; i++)
+		rds_iw_recv_clear_one(ic, &ic->i_recvs[i]);
+
+	if (ic->i_frag.f_page)
+		rds_iw_frag_drop_page(&ic->i_frag);
+}
+
+static int rds_iw_recv_refill_one(struct rds_connection *conn,
+				  struct rds_iw_recv_work *recv,
+				  gfp_t kptr_gfp, gfp_t page_gfp)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	dma_addr_t dma_addr;
+	struct ib_sge *sge;
+	int ret = -ENOMEM;
+
+	if (recv->r_iwinc == NULL) {
+		if (atomic_read(&rds_iw_allocation) >= rds_iw_sysctl_max_recv_allocation) {
+			rds_iw_stats_inc(s_iw_rx_alloc_limit);
+			goto out;
+		}
+		recv->r_iwinc = kmem_cache_alloc(rds_iw_incoming_slab,
+						 kptr_gfp);
+		if (recv->r_iwinc == NULL)
+			goto out;
+		atomic_inc(&rds_iw_allocation);
+		INIT_LIST_HEAD(&recv->r_iwinc->ii_frags);
+		rds_inc_init(&recv->r_iwinc->ii_inc, conn, conn->c_faddr);
+	}
+
+	if (recv->r_frag == NULL) {
+		recv->r_frag = kmem_cache_alloc(rds_iw_frag_slab, kptr_gfp);
+		if (recv->r_frag == NULL)
+			goto out;
+		INIT_LIST_HEAD(&recv->r_frag->f_item);
+		recv->r_frag->f_page = NULL;
+	}
+
+	if (ic->i_frag.f_page == NULL) {
+		ic->i_frag.f_page = alloc_page(page_gfp);
+		if (ic->i_frag.f_page == NULL)
+			goto out;
+		ic->i_frag.f_offset = 0;
+	}
+
+	dma_addr = ib_dma_map_page(ic->i_cm_id->device,
+				  ic->i_frag.f_page,
+				  ic->i_frag.f_offset,
+				  RDS_FRAG_SIZE,
+				  DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr))
+		goto out;
+
+	/*
+	 * Once we get the RDS_PAGE_LAST_OFF frag then rds_iw_frag_unmap()
+	 * must be called on this recv.  This happens as completions hit
+	 * in order or on connection shutdown.
+	 */
+	recv->r_frag->f_page = ic->i_frag.f_page;
+	recv->r_frag->f_offset = ic->i_frag.f_offset;
+	recv->r_frag->f_mapped = dma_addr;
+
+	sge = rds_iw_data_sge(ic, recv->r_sge);
+	sge->addr = dma_addr;
+	sge->length = RDS_FRAG_SIZE;
+
+	sge = rds_iw_header_sge(ic, recv->r_sge);
+	sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header);
+	sge->length = sizeof(struct rds_header);
+
+	get_page(recv->r_frag->f_page);
+
+	if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) {
+		ic->i_frag.f_offset += RDS_FRAG_SIZE;
+	} else {
+		put_page(ic->i_frag.f_page);
+		ic->i_frag.f_page = NULL;
+		ic->i_frag.f_offset = 0;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * This tries to allocate and post unused work requests after making sure that
+ * they have all the allocations they need to queue received fragments into
+ * sockets.  The i_recv_mutex is held here so that ring_alloc and _unalloc
+ * pairs don't go unmatched.
+ *
+ * -1 is returned if posting fails due to temporary resource exhaustion.
+ */
+int rds_iw_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rds_iw_recv_work *recv;
+	struct ib_recv_wr *failed_wr;
+	unsigned int posted = 0;
+	int ret = 0;
+	u32 pos;
+
+	while ((prefill || rds_conn_up(conn))
+			&& rds_iw_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
+		if (pos >= ic->i_recv_ring.w_nr) {
+			printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n",
+					pos);
+			ret = -EINVAL;
+			break;
+		}
+
+		recv = &ic->i_recvs[pos];
+		ret = rds_iw_recv_refill_one(conn, recv, kptr_gfp, page_gfp);
+		if (ret) {
+			ret = -1;
+			break;
+		}
+
+		/* XXX when can this fail? */
+		ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+		rdsdebug("recv %p iwinc %p page %p addr %lu ret %d\n", recv,
+			 recv->r_iwinc, recv->r_frag->f_page,
+			 (long) recv->r_frag->f_mapped, ret);
+		if (ret) {
+			rds_iw_conn_error(conn, "recv post on "
+			       "%pI4 returned %d, disconnecting and "
+			       "reconnecting\n", &conn->c_faddr,
+			       ret);
+			ret = -1;
+			break;
+		}
+
+		posted++;
+	}
+
+	/* We're doing flow control - update the window. */
+	if (ic->i_flowctl && posted)
+		rds_iw_advertise_credits(conn, posted);
+
+	if (ret)
+		rds_iw_ring_unalloc(&ic->i_recv_ring, 1);
+	return ret;
+}
+
+void rds_iw_inc_purge(struct rds_incoming *inc)
+{
+	struct rds_iw_incoming *iwinc;
+	struct rds_page_frag *frag;
+	struct rds_page_frag *pos;
+
+	iwinc = container_of(inc, struct rds_iw_incoming, ii_inc);
+	rdsdebug("purging iwinc %p inc %p\n", iwinc, inc);
+
+	list_for_each_entry_safe(frag, pos, &iwinc->ii_frags, f_item) {
+		list_del_init(&frag->f_item);
+		rds_iw_frag_drop_page(frag);
+		rds_iw_frag_free(frag);
+	}
+}
+
+void rds_iw_inc_free(struct rds_incoming *inc)
+{
+	struct rds_iw_incoming *iwinc;
+
+	iwinc = container_of(inc, struct rds_iw_incoming, ii_inc);
+
+	rds_iw_inc_purge(inc);
+	rdsdebug("freeing iwinc %p inc %p\n", iwinc, inc);
+	BUG_ON(!list_empty(&iwinc->ii_frags));
+	kmem_cache_free(rds_iw_incoming_slab, iwinc);
+	atomic_dec(&rds_iw_allocation);
+	BUG_ON(atomic_read(&rds_iw_allocation) < 0);
+}
+
+int rds_iw_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov,
+			    size_t size)
+{
+	struct rds_iw_incoming *iwinc;
+	struct rds_page_frag *frag;
+	struct iovec *iov = first_iov;
+	unsigned long to_copy;
+	unsigned long frag_off = 0;
+	unsigned long iov_off = 0;
+	int copied = 0;
+	int ret;
+	u32 len;
+
+	iwinc = container_of(inc, struct rds_iw_incoming, ii_inc);
+	frag = list_entry(iwinc->ii_frags.next, struct rds_page_frag, f_item);
+	len = be32_to_cpu(inc->i_hdr.h_len);
+
+	while (copied < size && copied < len) {
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off);
+		to_copy = min_t(size_t, to_copy, size - copied);
+		to_copy = min_t(unsigned long, to_copy, len - copied);
+
+		rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag "
+			 "[%p, %lu] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 frag->f_page, frag->f_offset, frag_off);
+
+		/* XXX needs + offset for multiple recvs per page */
+		ret = rds_page_copy_to_user(frag->f_page,
+					    frag->f_offset + frag_off,
+					    iov->iov_base + iov_off,
+					    to_copy);
+		if (ret) {
+			copied = ret;
+			break;
+		}
+
+		iov_off += to_copy;
+		frag_off += to_copy;
+		copied += to_copy;
+	}
+
+	return copied;
+}
+
+/* ic starts out kzalloc()ed */
+void rds_iw_recv_init_ack(struct rds_iw_connection *ic)
+{
+	struct ib_send_wr *wr = &ic->i_ack_wr;
+	struct ib_sge *sge = &ic->i_ack_sge;
+
+	sge->addr = ic->i_ack_dma;
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = rds_iw_local_dma_lkey(ic);
+
+	wr->sg_list = sge;
+	wr->num_sge = 1;
+	wr->opcode = IB_WR_SEND;
+	wr->wr_id = RDS_IW_ACK_WR_ID;
+	wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+}
+
+/*
+ * You'd think that with reliable IB connections you wouldn't need to ack
+ * messages that have been received.  The problem is that IB hardware generates
+ * an ack message before it has DMAed the message into memory.  This creates a
+ * potential message loss if the HCA is disabled for any reason between when it
+ * sends the ack and before the message is DMAed and processed.  This is only a
+ * potential issue if another HCA is available for fail-over.
+ *
+ * When the remote host receives our ack they'll free the sent message from
+ * their send queue.  To decrease the latency of this we always send an ack
+ * immediately after we've received messages.
+ *
+ * For simplicity, we only have one ack in flight at a time.  This puts
+ * pressure on senders to have deep enough send queues to absorb the latency of
+ * a single ack frame being in flight.  This might not be good enough.
+ *
+ * This is implemented by have a long-lived send_wr and sge which point to a
+ * statically allocated ack frame.  This ack wr does not fall under the ring
+ * accounting that the tx and rx wrs do.  The QP attribute specifically makes
+ * room for it beyond the ring size.  Send completion notices its special
+ * wr_id and avoids working with the ring in that case.
+ */
+static void rds_iw_set_ack(struct rds_iw_connection *ic, u64 seq,
+				int ack_required)
+{
+	rds_iw_set_64bit(&ic->i_ack_next, seq);
+	if (ack_required) {
+		smp_mb__before_clear_bit();
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	}
+}
+
+static u64 rds_iw_get_ack(struct rds_iw_connection *ic)
+{
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	smp_mb__after_clear_bit();
+
+	return ic->i_ack_next;
+}
+
+static void rds_iw_send_ack(struct rds_iw_connection *ic, unsigned int adv_credits)
+{
+	struct rds_header *hdr = ic->i_ack;
+	struct ib_send_wr *failed_wr;
+	u64 seq;
+	int ret;
+
+	seq = rds_iw_get_ack(ic);
+
+	rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq);
+	rds_message_populate_header(hdr, 0, 0, 0);
+	hdr->h_ack = cpu_to_be64(seq);
+	hdr->h_credit = adv_credits;
+	rds_message_make_checksum(hdr);
+	ic->i_ack_queued = jiffies;
+
+	ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr);
+	if (unlikely(ret)) {
+		/* Failed to send. Release the WR, and
+		 * force another ACK.
+		 */
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+
+		rds_iw_stats_inc(s_iw_ack_send_failure);
+		/* Need to finesse this later. */
+		BUG();
+	} else
+		rds_iw_stats_inc(s_iw_ack_sent);
+}
+
+/*
+ * There are 3 ways of getting acknowledgements to the peer:
+ *  1.	We call rds_iw_attempt_ack from the recv completion handler
+ *	to send an ACK-only frame.
+ *	However, there can be only one such frame in the send queue
+ *	at any time, so we may have to postpone it.
+ *  2.	When another (data) packet is transmitted while there's
+ *	an ACK in the queue, we piggyback the ACK sequence number
+ *	on the data packet.
+ *  3.	If the ACK WR is done sending, we get called from the
+ *	send queue completion handler, and check whether there's
+ *	another ACK pending (postponed because the WR was on the
+ *	queue). If so, we transmit it.
+ *
+ * We maintain 2 variables:
+ *  -	i_ack_flags, which keeps track of whether the ACK WR
+ *	is currently in the send queue or not (IB_ACK_IN_FLIGHT)
+ *  -	i_ack_next, which is the last sequence number we received
+ *
+ * Potentially, send queue and receive queue handlers can run concurrently.
+ *
+ * Reconnecting complicates this picture just slightly. When we
+ * reconnect, we may be seeing duplicate packets. The peer
+ * is retransmitting them, because it hasn't seen an ACK for
+ * them. It is important that we ACK these.
+ *
+ * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with
+ * this flag set *MUST* be acknowledged immediately.
+ */
+
+/*
+ * When we get here, we're called from the recv queue handler.
+ * Check whether we ought to transmit an ACK.
+ */
+void rds_iw_attempt_ack(struct rds_iw_connection *ic)
+{
+	unsigned int adv_credits;
+
+	if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		return;
+
+	if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
+		rds_iw_stats_inc(s_iw_ack_send_delayed);
+		return;
+	}
+
+	/* Can we get a send credit? */
+	if (!rds_iw_send_grab_credits(ic, 1, &adv_credits, 0)) {
+		rds_iw_stats_inc(s_iw_tx_throttle);
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		return;
+	}
+
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	rds_iw_send_ack(ic, adv_credits);
+}
+
+/*
+ * We get here from the send completion handler, when the
+ * adapter tells us the ACK frame was sent.
+ */
+void rds_iw_ack_send_complete(struct rds_iw_connection *ic)
+{
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+	rds_iw_attempt_ack(ic);
+}
+
+/*
+ * This is called by the regular xmit code when it wants to piggyback
+ * an ACK on an outgoing frame.
+ */
+u64 rds_iw_piggyb_ack(struct rds_iw_connection *ic)
+{
+	if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		rds_iw_stats_inc(s_iw_ack_send_piggybacked);
+	return rds_iw_get_ack(ic);
+}
+
+/*
+ * It's kind of lame that we're copying from the posted receive pages into
+ * long-lived bitmaps.  We could have posted the bitmaps and rdma written into
+ * them.  But receiving new congestion bitmaps should be a *rare* event, so
+ * hopefully we won't need to invest that complexity in making it more
+ * efficient.  By copying we can share a simpler core with TCP which has to
+ * copy.
+ */
+static void rds_iw_cong_recv(struct rds_connection *conn,
+			      struct rds_iw_incoming *iwinc)
+{
+	struct rds_cong_map *map;
+	unsigned int map_off;
+	unsigned int map_page;
+	struct rds_page_frag *frag;
+	unsigned long frag_off;
+	unsigned long to_copy;
+	unsigned long copied;
+	uint64_t uncongested = 0;
+	void *addr;
+
+	/* catch completely corrupt packets */
+	if (be32_to_cpu(iwinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES)
+		return;
+
+	map = conn->c_fcong;
+	map_page = 0;
+	map_off = 0;
+
+	frag = list_entry(iwinc->ii_frags.next, struct rds_page_frag, f_item);
+	frag_off = 0;
+
+	copied = 0;
+
+	while (copied < RDS_CONG_MAP_BYTES) {
+		uint64_t *src, *dst;
+		unsigned int k;
+
+		to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off);
+		BUG_ON(to_copy & 7); /* Must be 64bit aligned. */
+
+		addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0);
+
+		src = addr + frag_off;
+		dst = (void *)map->m_page_addrs[map_page] + map_off;
+		for (k = 0; k < to_copy; k += 8) {
+			/* Record ports that became uncongested, ie
+			 * bits that changed from 0 to 1. */
+			uncongested |= ~(*src) & *dst;
+			*dst++ = *src++;
+		}
+		kunmap_atomic(addr, KM_SOFTIRQ0);
+
+		copied += to_copy;
+
+		map_off += to_copy;
+		if (map_off == PAGE_SIZE) {
+			map_off = 0;
+			map_page++;
+		}
+
+		frag_off += to_copy;
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+	}
+
+	/* the congestion map is in little endian order */
+	uncongested = le64_to_cpu(uncongested);
+
+	rds_cong_map_updated(map, uncongested);
+}
+
+/*
+ * Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_iw_ack_state {
+	u64		ack_next;
+	u64		ack_recv;
+	unsigned int	ack_required:1;
+	unsigned int	ack_next_valid:1;
+	unsigned int	ack_recv_valid:1;
+};
+
+static void rds_iw_process_recv(struct rds_connection *conn,
+				struct rds_iw_recv_work *recv, u32 byte_len,
+				struct rds_iw_ack_state *state)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rds_iw_incoming *iwinc = ic->i_iwinc;
+	struct rds_header *ihdr, *hdr;
+
+	/* XXX shut down the connection if port 0,0 are seen? */
+
+	rdsdebug("ic %p iwinc %p recv %p byte len %u\n", ic, iwinc, recv,
+		 byte_len);
+
+	if (byte_len < sizeof(struct rds_header)) {
+		rds_iw_conn_error(conn, "incoming message "
+		       "from %pI4 didn't inclue a "
+		       "header, disconnecting and "
+		       "reconnecting\n",
+		       &conn->c_faddr);
+		return;
+	}
+	byte_len -= sizeof(struct rds_header);
+
+	ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs];
+
+	/* Validate the checksum. */
+	if (!rds_message_verify_checksum(ihdr)) {
+		rds_iw_conn_error(conn, "incoming message "
+		       "from %pI4 has corrupted header - "
+		       "forcing a reconnect\n",
+		       &conn->c_faddr);
+		rds_stats_inc(s_recv_drop_bad_checksum);
+		return;
+	}
+
+	/* Process the ACK sequence which comes with every packet */
+	state->ack_recv = be64_to_cpu(ihdr->h_ack);
+	state->ack_recv_valid = 1;
+
+	/* Process the credits update if there was one */
+	if (ihdr->h_credit)
+		rds_iw_send_add_credits(conn, ihdr->h_credit);
+
+	if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) {
+		/* This is an ACK-only packet. The fact that it gets
+		 * special treatment here is that historically, ACKs
+		 * were rather special beasts.
+		 */
+		rds_iw_stats_inc(s_iw_ack_received);
+
+		/*
+		 * Usually the frags make their way on to incs and are then freed as
+		 * the inc is freed.  We don't go that route, so we have to drop the
+		 * page ref ourselves.  We can't just leave the page on the recv
+		 * because that confuses the dma mapping of pages and each recv's use
+		 * of a partial page.  We can leave the frag, though, it will be
+		 * reused.
+		 *
+		 * FIXME: Fold this into the code path below.
+		 */
+		rds_iw_frag_drop_page(recv->r_frag);
+		return;
+	}
+
+	/*
+	 * If we don't already have an inc on the connection then this
+	 * fragment has a header and starts a message.. copy its header
+	 * into the inc and save the inc so we can hang upcoming fragments
+	 * off its list.
+	 */
+	if (iwinc == NULL) {
+		iwinc = recv->r_iwinc;
+		recv->r_iwinc = NULL;
+		ic->i_iwinc = iwinc;
+
+		hdr = &iwinc->ii_inc.i_hdr;
+		memcpy(hdr, ihdr, sizeof(*hdr));
+		ic->i_recv_data_rem = be32_to_cpu(hdr->h_len);
+
+		rdsdebug("ic %p iwinc %p rem %u flag 0x%x\n", ic, iwinc,
+			 ic->i_recv_data_rem, hdr->h_flags);
+	} else {
+		hdr = &iwinc->ii_inc.i_hdr;
+		/* We can't just use memcmp here; fragments of a
+		 * single message may carry different ACKs */
+		if (hdr->h_sequence != ihdr->h_sequence
+		 || hdr->h_len != ihdr->h_len
+		 || hdr->h_sport != ihdr->h_sport
+		 || hdr->h_dport != ihdr->h_dport) {
+			rds_iw_conn_error(conn,
+				"fragment header mismatch; forcing reconnect\n");
+			return;
+		}
+	}
+
+	list_add_tail(&recv->r_frag->f_item, &iwinc->ii_frags);
+	recv->r_frag = NULL;
+
+	if (ic->i_recv_data_rem > RDS_FRAG_SIZE)
+		ic->i_recv_data_rem -= RDS_FRAG_SIZE;
+	else {
+		ic->i_recv_data_rem = 0;
+		ic->i_iwinc = NULL;
+
+		if (iwinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP)
+			rds_iw_cong_recv(conn, iwinc);
+		else {
+			rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr,
+					  &iwinc->ii_inc, GFP_ATOMIC,
+					  KM_SOFTIRQ0);
+			state->ack_next = be64_to_cpu(hdr->h_sequence);
+			state->ack_next_valid = 1;
+		}
+
+		/* Evaluate the ACK_REQUIRED flag *after* we received
+		 * the complete frame, and after bumping the next_rx
+		 * sequence. */
+		if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) {
+			rds_stats_inc(s_recv_ack_required);
+			state->ack_required = 1;
+		}
+
+		rds_inc_put(&iwinc->ii_inc);
+	}
+}
+
+/*
+ * Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.
+ */
+void rds_iw_recv_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_iw_ack_state state = { 0, };
+	struct rds_iw_recv_work *recv;
+
+	rdsdebug("conn %p cq %p\n", conn, cq);
+
+	rds_iw_stats_inc(s_iw_rx_cq_call);
+
+	ib_req_notify_cq(cq, IB_CQ_SOLICITED);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_iw_stats_inc(s_iw_rx_cq_event);
+
+		recv = &ic->i_recvs[rds_iw_ring_oldest(&ic->i_recv_ring)];
+
+		rds_iw_recv_unmap_page(ic, recv);
+
+		/*
+		 * Also process recvs in connecting state because it is possible
+		 * to get a recv completion _before_ the rdmacm ESTABLISHED
+		 * event is processed.
+		 */
+		if (rds_conn_up(conn) || rds_conn_connecting(conn)) {
+			/* We expect errors as the qp is drained during shutdown */
+			if (wc.status == IB_WC_SUCCESS) {
+				rds_iw_process_recv(conn, recv, wc.byte_len, &state);
+			} else {
+				rds_iw_conn_error(conn, "recv completion on "
+				       "%pI4 had status %u, disconnecting and "
+				       "reconnecting\n", &conn->c_faddr,
+				       wc.status);
+			}
+		}
+
+		rds_iw_ring_free(&ic->i_recv_ring, 1);
+	}
+
+	if (state.ack_next_valid)
+		rds_iw_set_ack(ic, state.ack_next, state.ack_required);
+	if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) {
+		rds_send_drop_acked(conn, state.ack_recv, NULL);
+		ic->i_ack_recv = state.ack_recv;
+	}
+	if (rds_conn_up(conn))
+		rds_iw_attempt_ack(ic);
+
+	/* If we ever end up with a really empty receive ring, we're
+	 * in deep trouble, as the sender will definitely see RNR
+	 * timeouts. */
+	if (rds_iw_ring_empty(&ic->i_recv_ring))
+		rds_iw_stats_inc(s_iw_rx_ring_empty);
+
+	/*
+	 * If the ring is running low, then schedule the thread to refill.
+	 */
+	if (rds_iw_ring_low(&ic->i_recv_ring))
+		queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+}
+
+int rds_iw_recv(struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	int ret = 0;
+
+	rdsdebug("conn %p\n", conn);
+
+	/*
+	 * If we get a temporary posting failure in this context then
+	 * we're really low and we want the caller to back off for a bit.
+	 */
+	mutex_lock(&ic->i_recv_mutex);
+	if (rds_iw_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0))
+		ret = -ENOMEM;
+	else
+		rds_iw_stats_inc(s_iw_rx_refill_from_thread);
+	mutex_unlock(&ic->i_recv_mutex);
+
+	if (rds_conn_up(conn))
+		rds_iw_attempt_ack(ic);
+
+	return ret;
+}
+
+int __init rds_iw_recv_init(void)
+{
+	struct sysinfo si;
+	int ret = -ENOMEM;
+
+	/* Default to 30% of all available RAM for recv memory */
+	si_meminfo(&si);
+	rds_iw_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE;
+
+	rds_iw_incoming_slab = kmem_cache_create("rds_iw_incoming",
+					sizeof(struct rds_iw_incoming),
+					0, 0, NULL);
+	if (rds_iw_incoming_slab == NULL)
+		goto out;
+
+	rds_iw_frag_slab = kmem_cache_create("rds_iw_frag",
+					sizeof(struct rds_page_frag),
+					0, 0, NULL);
+	if (rds_iw_frag_slab == NULL)
+		kmem_cache_destroy(rds_iw_incoming_slab);
+	else
+		ret = 0;
+out:
+	return ret;
+}
+
+void rds_iw_recv_exit(void)
+{
+	kmem_cache_destroy(rds_iw_incoming_slab);
+	kmem_cache_destroy(rds_iw_frag_slab);
+}
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
new file mode 100644
index 0000000..d422d4b
--- /dev/null
+++ b/net/rds/iw_ring.c
@@ -0,0 +1,169 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "iw.h"
+
+/*
+ * Locking for IB rings.
+ * We assume that allocation is always protected by a mutex
+ * in the caller (this is a valid assumption for the current
+ * implementation).
+ *
+ * Freeing always happens in an interrupt, and hence only
+ * races with allocations, but not with other free()s.
+ *
+ * The interaction between allocation and freeing is that
+ * the alloc code has to determine the number of free entries.
+ * To this end, we maintain two counters; an allocation counter
+ * and a free counter. Both are allowed to run freely, and wrap
+ * around.
+ * The number of used entries is always (alloc_ctr - free_ctr) % NR.
+ *
+ * The current implementation makes free_ctr atomic. When the
+ * caller finds an allocation fails, it should set an "alloc fail"
+ * bit and retry the allocation. The "alloc fail" bit essentially tells
+ * the CQ completion handlers to wake it up after freeing some
+ * more entries.
+ */
+
+/*
+ * This only happens on shutdown.
+ */
+DECLARE_WAIT_QUEUE_HEAD(rds_iw_ring_empty_wait);
+
+void rds_iw_ring_init(struct rds_iw_work_ring *ring, u32 nr)
+{
+	memset(ring, 0, sizeof(*ring));
+	ring->w_nr = nr;
+	rdsdebug("ring %p nr %u\n", ring, ring->w_nr);
+}
+
+static inline u32 __rds_iw_ring_used(struct rds_iw_work_ring *ring)
+{
+	u32 diff;
+
+	/* This assumes that atomic_t has at least as many bits as u32 */
+	diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr);
+	BUG_ON(diff > ring->w_nr);
+
+	return diff;
+}
+
+void rds_iw_ring_resize(struct rds_iw_work_ring *ring, u32 nr)
+{
+	/* We only ever get called from the connection setup code,
+	 * prior to creating the QP. */
+	BUG_ON(__rds_iw_ring_used(ring));
+	ring->w_nr = nr;
+}
+
+static int __rds_iw_ring_empty(struct rds_iw_work_ring *ring)
+{
+	return __rds_iw_ring_used(ring) == 0;
+}
+
+u32 rds_iw_ring_alloc(struct rds_iw_work_ring *ring, u32 val, u32 *pos)
+{
+	u32 ret = 0, avail;
+
+	avail = ring->w_nr - __rds_iw_ring_used(ring);
+
+	rdsdebug("ring %p val %u next %u free %u\n", ring, val,
+		 ring->w_alloc_ptr, avail);
+
+	if (val && avail) {
+		ret = min(val, avail);
+		*pos = ring->w_alloc_ptr;
+
+		ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr;
+		ring->w_alloc_ctr += ret;
+	}
+
+	return ret;
+}
+
+void rds_iw_ring_free(struct rds_iw_work_ring *ring, u32 val)
+{
+	ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr;
+	atomic_add(val, &ring->w_free_ctr);
+
+	if (__rds_iw_ring_empty(ring) &&
+	    waitqueue_active(&rds_iw_ring_empty_wait))
+		wake_up(&rds_iw_ring_empty_wait);
+}
+
+void rds_iw_ring_unalloc(struct rds_iw_work_ring *ring, u32 val)
+{
+	ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr;
+	ring->w_alloc_ctr -= val;
+}
+
+int rds_iw_ring_empty(struct rds_iw_work_ring *ring)
+{
+	return __rds_iw_ring_empty(ring);
+}
+
+int rds_iw_ring_low(struct rds_iw_work_ring *ring)
+{
+	return __rds_iw_ring_used(ring) <= (ring->w_nr >> 2);
+}
+
+
+/*
+ * returns the oldest alloced ring entry.  This will be the next one
+ * freed.  This can't be called if there are none allocated.
+ */
+u32 rds_iw_ring_oldest(struct rds_iw_work_ring *ring)
+{
+	return ring->w_free_ptr;
+}
+
+/*
+ * returns the number of completed work requests.
+ */
+
+u32 rds_iw_ring_completed(struct rds_iw_work_ring *ring, u32 wr_id, u32 oldest)
+{
+	u32 ret;
+
+	if (oldest <= (unsigned long long)wr_id)
+		ret = (unsigned long long)wr_id - oldest + 1;
+	else
+		ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1;
+
+	rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret,
+		 wr_id, oldest);
+	return ret;
+}
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
new file mode 100644
index 0000000..22dd38f
--- /dev/null
+++ b/net/rds/iw_send.c
@@ -0,0 +1,975 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/device.h>
+#include <linux/dmapool.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "iw.h"
+
+static void rds_iw_send_rdma_complete(struct rds_message *rm,
+				      int wc_status)
+{
+	int notify_status;
+
+	switch (wc_status) {
+	case IB_WC_WR_FLUSH_ERR:
+		return;
+
+	case IB_WC_SUCCESS:
+		notify_status = RDS_RDMA_SUCCESS;
+		break;
+
+	case IB_WC_REM_ACCESS_ERR:
+		notify_status = RDS_RDMA_REMOTE_ERROR;
+		break;
+
+	default:
+		notify_status = RDS_RDMA_OTHER_ERROR;
+		break;
+	}
+	rds_rdma_send_complete(rm, notify_status);
+}
+
+static void rds_iw_send_unmap_rdma(struct rds_iw_connection *ic,
+				   struct rds_rdma_op *op)
+{
+	if (op->r_mapped) {
+		ib_dma_unmap_sg(ic->i_cm_id->device,
+			op->r_sg, op->r_nents,
+			op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		op->r_mapped = 0;
+	}
+}
+
+static void rds_iw_send_unmap_rm(struct rds_iw_connection *ic,
+			  struct rds_iw_send_work *send,
+			  int wc_status)
+{
+	struct rds_message *rm = send->s_rm;
+
+	rdsdebug("ic %p send %p rm %p\n", ic, send, rm);
+
+	ib_dma_unmap_sg(ic->i_cm_id->device,
+		     rm->m_sg, rm->m_nents,
+		     DMA_TO_DEVICE);
+
+	if (rm->m_rdma_op != NULL) {
+		rds_iw_send_unmap_rdma(ic, rm->m_rdma_op);
+
+		/* If the user asked for a completion notification on this
+		 * message, we can implement three different semantics:
+		 *  1.	Notify when we received the ACK on the RDS message
+		 *	that was queued with the RDMA. This provides reliable
+		 *	notification of RDMA status at the expense of a one-way
+		 *	packet delay.
+		 *  2.	Notify when the IB stack gives us the completion event for
+		 *	the RDMA operation.
+		 *  3.	Notify when the IB stack gives us the completion event for
+		 *	the accompanying RDS messages.
+		 * Here, we implement approach #3. To implement approach #2,
+		 * call rds_rdma_send_complete from the cq_handler. To implement #1,
+		 * don't call rds_rdma_send_complete at all, and fall back to the notify
+		 * handling in the ACK processing code.
+		 *
+		 * Note: There's no need to explicitly sync any RDMA buffers using
+		 * ib_dma_sync_sg_for_cpu - the completion for the RDMA
+		 * operation itself unmapped the RDMA buffers, which takes care
+		 * of synching.
+		 */
+		rds_iw_send_rdma_complete(rm, wc_status);
+
+		if (rm->m_rdma_op->r_write)
+			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
+		else
+			rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes);
+	}
+
+	/* If anyone waited for this message to get flushed out, wake
+	 * them up now */
+	rds_message_unmapped(rm);
+
+	rds_message_put(rm);
+	send->s_rm = NULL;
+}
+
+void rds_iw_send_init_ring(struct rds_iw_connection *ic)
+{
+	struct rds_iw_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		struct ib_sge *sge;
+
+		send->s_rm = NULL;
+		send->s_op = NULL;
+		send->s_mapping = NULL;
+
+		send->s_wr.next = NULL;
+		send->s_wr.wr_id = i;
+		send->s_wr.sg_list = send->s_sge;
+		send->s_wr.num_sge = 1;
+		send->s_wr.opcode = IB_WR_SEND;
+		send->s_wr.send_flags = 0;
+		send->s_wr.ex.imm_data = 0;
+
+		sge = rds_iw_data_sge(ic, send->s_sge);
+		sge->lkey = 0;
+
+		sge = rds_iw_header_sge(ic, send->s_sge);
+		sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = 0;
+
+		send->s_mr = ib_alloc_fast_reg_mr(ic->i_pd, fastreg_message_size);
+		if (IS_ERR(send->s_mr)) {
+			printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_mr failed\n");
+			break;
+		}
+
+		send->s_page_list = ib_alloc_fast_reg_page_list(
+			ic->i_cm_id->device, fastreg_message_size);
+		if (IS_ERR(send->s_page_list)) {
+			printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_page_list failed\n");
+			break;
+		}
+	}
+}
+
+void rds_iw_send_clear_ring(struct rds_iw_connection *ic)
+{
+	struct rds_iw_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		BUG_ON(!send->s_mr);
+		ib_dereg_mr(send->s_mr);
+		BUG_ON(!send->s_page_list);
+		ib_free_fast_reg_page_list(send->s_page_list);
+		if (send->s_wr.opcode == 0xdead)
+			continue;
+		if (send->s_rm)
+			rds_iw_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR);
+		if (send->s_op)
+			rds_iw_send_unmap_rdma(ic, send->s_op);
+	}
+}
+
+/*
+ * The _oldest/_free ring operations here race cleanly with the alloc/unalloc
+ * operations performed in the send path.  As the sender allocs and potentially
+ * unallocs the next free entry in the ring it doesn't alter which is
+ * the next to be freed, which is what this is concerned with.
+ */
+void rds_iw_send_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_iw_send_work *send;
+	u32 completed;
+	u32 oldest;
+	u32 i;
+	int ret;
+
+	rdsdebug("cq %p conn %p\n", cq, conn);
+	rds_iw_stats_inc(s_iw_tx_cq_call);
+	ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	if (ret)
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_iw_stats_inc(s_iw_tx_cq_event);
+
+		if (wc.status != IB_WC_SUCCESS) {
+			printk(KERN_ERR "WC Error:  status = %d opcode = %d\n", wc.status, wc.opcode);
+			break;
+		}
+
+		if (wc.opcode == IB_WC_LOCAL_INV && wc.wr_id == RDS_IW_LOCAL_INV_WR_ID) {
+			ic->i_fastreg_posted = 0;
+			continue;
+		}
+
+		if (wc.opcode == IB_WC_FAST_REG_MR && wc.wr_id == RDS_IW_FAST_REG_WR_ID) {
+			ic->i_fastreg_posted = 1;
+			continue;
+		}
+
+		if (wc.wr_id == RDS_IW_ACK_WR_ID) {
+			if (ic->i_ack_queued + HZ/2 < jiffies)
+				rds_iw_stats_inc(s_iw_tx_stalled);
+			rds_iw_ack_send_complete(ic);
+			continue;
+		}
+
+		oldest = rds_iw_ring_oldest(&ic->i_send_ring);
+
+		completed = rds_iw_ring_completed(&ic->i_send_ring, wc.wr_id, oldest);
+
+		for (i = 0; i < completed; i++) {
+			send = &ic->i_sends[oldest];
+
+			/* In the error case, wc.opcode sometimes contains garbage */
+			switch (send->s_wr.opcode) {
+			case IB_WR_SEND:
+				if (send->s_rm)
+					rds_iw_send_unmap_rm(ic, send, wc.status);
+				break;
+			case IB_WR_FAST_REG_MR:
+			case IB_WR_RDMA_WRITE:
+			case IB_WR_RDMA_READ:
+			case IB_WR_RDMA_READ_WITH_INV:
+				/* Nothing to be done - the SG list will be unmapped
+				 * when the SEND completes. */
+				break;
+			default:
+				if (printk_ratelimit())
+					printk(KERN_NOTICE
+						"RDS/IW: %s: unexpected opcode 0x%x in WR!\n",
+						__func__, send->s_wr.opcode);
+				break;
+			}
+
+			send->s_wr.opcode = 0xdead;
+			send->s_wr.num_sge = 1;
+			if (send->s_queued + HZ/2 < jiffies)
+				rds_iw_stats_inc(s_iw_tx_stalled);
+
+			/* If a RDMA operation produced an error, signal this right
+			 * away. If we don't, the subsequent SEND that goes with this
+			 * RDMA will be canceled with ERR_WFLUSH, and the application
+			 * never learn that the RDMA failed. */
+			if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) {
+				struct rds_message *rm;
+
+				rm = rds_send_get_message(conn, send->s_op);
+				if (rm)
+					rds_iw_send_rdma_complete(rm, wc.status);
+			}
+
+			oldest = (oldest + 1) % ic->i_send_ring.w_nr;
+		}
+
+		rds_iw_ring_free(&ic->i_send_ring, completed);
+
+		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
+		 || test_bit(0, &conn->c_map_queued))
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+		/* We expect errors as the qp is drained during shutdown */
+		if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) {
+			rds_iw_conn_error(conn,
+				"send completion on %pI4 "
+				"had status %u, disconnecting and reconnecting\n",
+				&conn->c_faddr, wc.status);
+		}
+	}
+}
+
+/*
+ * This is the main function for allocating credits when sending
+ * messages.
+ *
+ * Conceptually, we have two counters:
+ *  -	send credits: this tells us how many WRs we're allowed
+ *	to submit without overruning the reciever's queue. For
+ *	each SEND WR we post, we decrement this by one.
+ *
+ *  -	posted credits: this tells us how many WRs we recently
+ *	posted to the receive queue. This value is transferred
+ *	to the peer as a "credit update" in a RDS header field.
+ *	Every time we transmit credits to the peer, we subtract
+ *	the amount of transferred credits from this counter.
+ *
+ * It is essential that we avoid situations where both sides have
+ * exhausted their send credits, and are unable to send new credits
+ * to the peer. We achieve this by requiring that we send at least
+ * one credit update to the peer before exhausting our credits.
+ * When new credits arrive, we subtract one credit that is withheld
+ * until we've posted new buffers and are ready to transmit these
+ * credits (see rds_iw_send_add_credits below).
+ *
+ * The RDS send code is essentially single-threaded; rds_send_xmit
+ * grabs c_send_lock to ensure exclusive access to the send ring.
+ * However, the ACK sending code is independent and can race with
+ * message SENDs.
+ *
+ * In the send path, we need to update the counters for send credits
+ * and the counter of posted buffers atomically - when we use the
+ * last available credit, we cannot allow another thread to race us
+ * and grab the posted credits counter.  Hence, we have to use a
+ * spinlock to protect the credit counter, or use atomics.
+ *
+ * Spinlocks shared between the send and the receive path are bad,
+ * because they create unnecessary delays. An early implementation
+ * using a spinlock showed a 5% degradation in throughput at some
+ * loads.
+ *
+ * This implementation avoids spinlocks completely, putting both
+ * counters into a single atomic, and updating that atomic using
+ * atomic_add (in the receive path, when receiving fresh credits),
+ * and using atomic_cmpxchg when updating the two counters.
+ */
+int rds_iw_send_grab_credits(struct rds_iw_connection *ic,
+			     u32 wanted, u32 *adv_credits, int need_posted)
+{
+	unsigned int avail, posted, got = 0, advertise;
+	long oldval, newval;
+
+	*adv_credits = 0;
+	if (!ic->i_flowctl)
+		return wanted;
+
+try_again:
+	advertise = 0;
+	oldval = newval = atomic_read(&ic->i_credits);
+	posted = IB_GET_POST_CREDITS(oldval);
+	avail = IB_GET_SEND_CREDITS(oldval);
+
+	rdsdebug("rds_iw_send_grab_credits(%u): credits=%u posted=%u\n",
+			wanted, avail, posted);
+
+	/* The last credit must be used to send a credit update. */
+	if (avail && !posted)
+		avail--;
+
+	if (avail < wanted) {
+		struct rds_connection *conn = ic->i_cm_id->context;
+
+		/* Oops, there aren't that many credits left! */
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		got = avail;
+	} else {
+		/* Sometimes you get what you want, lalala. */
+		got = wanted;
+	}
+	newval -= IB_SET_SEND_CREDITS(got);
+
+	/*
+	 * If need_posted is non-zero, then the caller wants
+	 * the posted regardless of whether any send credits are
+	 * available.
+	 */
+	if (posted && (got || need_posted)) {
+		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
+		newval -= IB_SET_POST_CREDITS(advertise);
+	}
+
+	/* Finally bill everything */
+	if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval)
+		goto try_again;
+
+	*adv_credits = advertise;
+	return got;
+}
+
+void rds_iw_send_add_credits(struct rds_connection *conn, unsigned int credits)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	if (credits == 0)
+		return;
+
+	rdsdebug("rds_iw_send_add_credits(%u): current=%u%s\n",
+			credits,
+			IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)),
+			test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : "");
+
+	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
+	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
+
+	rds_iw_stats_inc(s_iw_rx_credit_updates);
+}
+
+void rds_iw_advertise_credits(struct rds_connection *conn, unsigned int posted)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	if (posted == 0)
+		return;
+
+	atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits);
+
+	/* Decide whether to send an update to the peer now.
+	 * If we would send a credit update for every single buffer we
+	 * post, we would end up with an ACK storm (ACK arrives,
+	 * consumes buffer, we refill the ring, send ACK to remote
+	 * advertising the newly posted buffer... ad inf)
+	 *
+	 * Performance pretty much depends on how often we send
+	 * credit updates - too frequent updates mean lots of ACKs.
+	 * Too infrequent updates, and the peer will run out of
+	 * credits and has to throttle.
+	 * For the time being, 16 seems to be a good compromise.
+	 */
+	if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16)
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+}
+
+static inline void
+rds_iw_xmit_populate_wr(struct rds_iw_connection *ic,
+		struct rds_iw_send_work *send, unsigned int pos,
+		unsigned long buffer, unsigned int length,
+		int send_flags)
+{
+	struct ib_sge *sge;
+
+	WARN_ON(pos != send - ic->i_sends);
+
+	send->s_wr.send_flags = send_flags;
+	send->s_wr.opcode = IB_WR_SEND;
+	send->s_wr.num_sge = 2;
+	send->s_wr.next = NULL;
+	send->s_queued = jiffies;
+	send->s_op = NULL;
+
+	if (length != 0) {
+		sge = rds_iw_data_sge(ic, send->s_sge);
+		sge->addr = buffer;
+		sge->length = length;
+		sge->lkey = rds_iw_local_dma_lkey(ic);
+
+		sge = rds_iw_header_sge(ic, send->s_sge);
+	} else {
+		/* We're sending a packet with no payload. There is only
+		 * one SGE */
+		send->s_wr.num_sge = 1;
+		sge = &send->s_sge[0];
+	}
+
+	sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header));
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = rds_iw_local_dma_lkey(ic);
+}
+
+/*
+ * This can be called multiple times for a given message.  The first time
+ * we see a message we map its scatterlist into the IB device so that
+ * we can provide that mapped address to the IB scatter gather entries
+ * in the IB work requests.  We translate the scatterlist into a series
+ * of work requests that fragment the message.  These work requests complete
+ * in order so we pass ownership of the message to the completion handler
+ * once we send the final fragment.
+ *
+ * The RDS core uses the c_send_lock to only enter this function once
+ * per connection.  This makes sure that the tx ring alloc/unalloc pairs
+ * don't get out of sync and confuse the ring.
+ */
+int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct rds_iw_send_work *send = NULL;
+	struct rds_iw_send_work *first;
+	struct rds_iw_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct scatterlist *scat;
+	u32 pos;
+	u32 i;
+	u32 work_alloc;
+	u32 credit_alloc;
+	u32 posted;
+	u32 adv_credits = 0;
+	int send_flags = 0;
+	int sent;
+	int ret;
+	int flow_controlled = 0;
+
+	BUG_ON(off % RDS_FRAG_SIZE);
+	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
+
+	/* Fastreg support */
+	if (rds_rdma_cookie_key(rm->m_rdma_cookie)
+	 && !ic->i_fastreg_posted) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	/* FIXME we may overallocate here */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0)
+		i = 1;
+	else
+		i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE);
+
+	work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc == 0) {
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		rds_iw_stats_inc(s_iw_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	credit_alloc = work_alloc;
+	if (ic->i_flowctl) {
+		credit_alloc = rds_iw_send_grab_credits(ic, work_alloc, &posted, 0);
+		adv_credits += posted;
+		if (credit_alloc < work_alloc) {
+			rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
+			work_alloc = credit_alloc;
+			flow_controlled++;
+		}
+		if (work_alloc == 0) {
+			rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+			rds_iw_stats_inc(s_iw_tx_throttle);
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/* map the message the first time we see it */
+	if (ic->i_rm == NULL) {
+		/*
+		printk(KERN_NOTICE "rds_iw_xmit prep msg dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(rm->m_inc.i_hdr.h_dport),
+				rm->m_inc.i_hdr.h_flags,
+				be32_to_cpu(rm->m_inc.i_hdr.h_len));
+		   */
+		if (rm->m_nents) {
+			rm->m_count = ib_dma_map_sg(dev,
+					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
+			rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count);
+			if (rm->m_count == 0) {
+				rds_iw_stats_inc(s_iw_tx_sg_mapping_failure);
+				rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+				ret = -ENOMEM; /* XXX ? */
+				goto out;
+			}
+		} else {
+			rm->m_count = 0;
+		}
+
+		ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs;
+		ic->i_unsignaled_bytes = rds_iw_sysctl_max_unsig_bytes;
+		rds_message_addref(rm);
+		ic->i_rm = rm;
+
+		/* Finalize the header */
+		if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED;
+		if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED;
+
+		/* If it has a RDMA op, tell the peer we did it. This is
+		 * used by the peer to release use-once RDMA MRs. */
+		if (rm->m_rdma_op) {
+			struct rds_ext_header_rdma ext_hdr;
+
+			ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key);
+			rds_message_add_extension(&rm->m_inc.i_hdr,
+					RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr));
+		}
+		if (rm->m_rdma_cookie) {
+			rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr,
+					rds_rdma_cookie_key(rm->m_rdma_cookie),
+					rds_rdma_cookie_offset(rm->m_rdma_cookie));
+		}
+
+		/* Note - rds_iw_piggyb_ack clears the ACK_REQUIRED bit, so
+		 * we should not do this unless we have a chance of at least
+		 * sticking the header into the send ring. Which is why we
+		 * should call rds_iw_ring_alloc first. */
+		rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_iw_piggyb_ack(ic));
+		rds_message_make_checksum(&rm->m_inc.i_hdr);
+
+		/*
+		 * Update adv_credits since we reset the ACK_REQUIRED bit.
+		 */
+		rds_iw_send_grab_credits(ic, 0, &posted, 1);
+		adv_credits += posted;
+		BUG_ON(adv_credits > 255);
+	} else if (ic->i_rm != rm)
+		BUG();
+
+	send = &ic->i_sends[pos];
+	first = send;
+	prev = NULL;
+	scat = &rm->m_sg[sg];
+	sent = 0;
+	i = 0;
+
+	/* Sometimes you want to put a fence between an RDMA
+	 * READ and the following SEND.
+	 * We could either do this all the time
+	 * or when requested by the user. Right now, we let
+	 * the application choose.
+	 */
+	if (rm->m_rdma_op && rm->m_rdma_op->r_fence)
+		send_flags = IB_SEND_FENCE;
+
+	/*
+	 * We could be copying the header into the unused tail of the page.
+	 * That would need to be changed in the future when those pages might
+	 * be mapped userspace pages or page cache pages.  So instead we always
+	 * use a second sge and our long-lived ring of mapped headers.  We send
+	 * the header after the data so that the data payload can be aligned on
+	 * the receiver.
+	 */
+
+	/* handle a 0-len message */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) {
+		rds_iw_xmit_populate_wr(ic, send, pos, 0, 0, send_flags);
+		goto add_header;
+	}
+
+	/* if there's data reference it with a chain of work reqs */
+	for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) {
+		unsigned int len;
+
+		send = &ic->i_sends[pos];
+
+		len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off);
+		rds_iw_xmit_populate_wr(ic, send, pos,
+				ib_sg_dma_address(dev, scat) + off, len,
+				send_flags);
+
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time
+		 * on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		ic->i_unsignaled_bytes -= len;
+		if (ic->i_unsignaled_bytes <= 0) {
+			ic->i_unsignaled_bytes = rds_iw_sysctl_max_unsig_bytes;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		/*
+		 * Always signal the last one if we're stopping due to flow control.
+		 */
+		if (flow_controlled && i == (work_alloc-1))
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			 &send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		sent += len;
+		off += len;
+		if (off == ib_sg_dma_len(dev, scat)) {
+			scat++;
+			off = 0;
+		}
+
+add_header:
+		/* Tack on the header after the data. The header SGE should already
+		 * have been set up to point to the right header buffer. */
+		memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header));
+
+		if (0) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(hdr->h_dport),
+				hdr->h_flags,
+				be32_to_cpu(hdr->h_len));
+		}
+		if (adv_credits) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			/* add credit and redo the header checksum */
+			hdr->h_credit = adv_credits;
+			rds_message_make_checksum(hdr);
+			adv_credits = 0;
+			rds_iw_stats_inc(s_iw_tx_credit_updates);
+		}
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+		prev = send;
+
+		pos = (pos + 1) % ic->i_send_ring.w_nr;
+	}
+
+	/* Account the RDS header in the number of bytes we sent, but just once.
+	 * The caller has no concept of fragmentation. */
+	if (hdr_off == 0)
+		sent += sizeof(struct rds_header);
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &rm->m_sg[rm->m_count]) {
+		prev->s_rm = ic->i_rm;
+		prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		ic->i_rm = NULL;
+	}
+
+	if (i < work_alloc) {
+		rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+	if (ic->i_flowctl && i < credit_alloc)
+		rds_iw_send_add_credits(conn, credit_alloc - i);
+
+	/* XXX need to worry about failed_wr and partial sends. */
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IW: ib_post_send to %pI4 "
+		       "returned %d\n", &conn->c_faddr, ret);
+		rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+		if (prev->s_rm) {
+			ic->i_rm = prev->s_rm;
+			prev->s_rm = NULL;
+		}
+		goto out;
+	}
+
+	ret = sent;
+out:
+	BUG_ON(adv_credits);
+	return ret;
+}
+
+static void rds_iw_build_send_fastreg(struct rds_iw_device *rds_iwdev, struct rds_iw_connection *ic, struct rds_iw_send_work *send, int nent, int len, u64 sg_addr)
+{
+	BUG_ON(nent > send->s_page_list->max_page_list_len);
+	/*
+	 * Perform a WR for the fast_reg_mr. Each individual page
+	 * in the sg list is added to the fast reg page list and placed
+	 * inside the fast_reg_mr WR.
+	 */
+	send->s_wr.opcode = IB_WR_FAST_REG_MR;
+	send->s_wr.wr.fast_reg.length = len;
+	send->s_wr.wr.fast_reg.rkey = send->s_mr->rkey;
+	send->s_wr.wr.fast_reg.page_list = send->s_page_list;
+	send->s_wr.wr.fast_reg.page_list_len = nent;
+	send->s_wr.wr.fast_reg.page_shift = rds_iwdev->page_shift;
+	send->s_wr.wr.fast_reg.access_flags = IB_ACCESS_REMOTE_WRITE;
+	send->s_wr.wr.fast_reg.iova_start = sg_addr;
+
+	ib_update_fast_reg_key(send->s_mr, send->s_remap_count++);
+}
+
+int rds_iw_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+	struct rds_iw_send_work *send = NULL;
+	struct rds_iw_send_work *first;
+	struct rds_iw_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct rds_iw_device *rds_iwdev;
+	struct scatterlist *scat;
+	unsigned long len;
+	u64 remote_addr = op->r_remote_addr;
+	u32 pos, fr_pos;
+	u32 work_alloc;
+	u32 i;
+	u32 j;
+	int sent;
+	int ret;
+	int num_sge;
+
+	rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client);
+
+	/* map the message the first time we see it */
+	if (!op->r_mapped) {
+		op->r_count = ib_dma_map_sg(ic->i_cm_id->device,
+					op->r_sg, op->r_nents, (op->r_write) ?
+					DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count);
+		if (op->r_count == 0) {
+			rds_iw_stats_inc(s_iw_tx_sg_mapping_failure);
+			ret = -ENOMEM; /* XXX ? */
+			goto out;
+		}
+
+		op->r_mapped = 1;
+	}
+
+	if (!op->r_write) {
+		/* Alloc space on the send queue for the fastreg */
+		work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, 1, &fr_pos);
+		if (work_alloc != 1) {
+			rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+			rds_iw_stats_inc(s_iw_tx_ring_full);
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/*
+	 * Instead of knowing how to return a partial rdma read/write we insist that there
+	 * be enough work requests to send the entire message.
+	 */
+	i = ceil(op->r_count, rds_iwdev->max_sge);
+
+	work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc != i) {
+		rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+		rds_iw_stats_inc(s_iw_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	send = &ic->i_sends[pos];
+	if (!op->r_write) {
+		first = prev = &ic->i_sends[fr_pos];
+	} else {
+		first = send;
+		prev = NULL;
+	}
+	scat = &op->r_sg[0];
+	sent = 0;
+	num_sge = op->r_count;
+
+	for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) {
+		send->s_wr.send_flags = 0;
+		send->s_queued = jiffies;
+
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags = IB_SEND_SIGNALED;
+		}
+
+		/* To avoid the need to have the plumbing to invalidate the fastreg_mr used
+		 * for local access after RDS is finished with it, using
+		 * IB_WR_RDMA_READ_WITH_INV will invalidate it after the read has completed.
+		 */
+		if (op->r_write)
+			send->s_wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			send->s_wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+
+		send->s_wr.wr.rdma.remote_addr = remote_addr;
+		send->s_wr.wr.rdma.rkey = op->r_key;
+		send->s_op = op;
+
+		if (num_sge > rds_iwdev->max_sge) {
+			send->s_wr.num_sge = rds_iwdev->max_sge;
+			num_sge -= rds_iwdev->max_sge;
+		} else
+			send->s_wr.num_sge = num_sge;
+
+		send->s_wr.next = NULL;
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+
+		for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) {
+			len = ib_sg_dma_len(ic->i_cm_id->device, scat);
+
+			if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV)
+				send->s_page_list->page_list[j] = ib_sg_dma_address(ic->i_cm_id->device, scat);
+			else {
+				send->s_sge[j].addr = ib_sg_dma_address(ic->i_cm_id->device, scat);
+				send->s_sge[j].length = len;
+				send->s_sge[j].lkey = rds_iw_local_dma_lkey(ic);
+			}
+
+			sent += len;
+			rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr);
+			remote_addr += len;
+
+			scat++;
+		}
+
+		if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV) {
+			send->s_wr.num_sge = 1;
+			send->s_sge[0].addr = conn->c_xmit_rm->m_rs->rs_user_addr;
+			send->s_sge[0].length = conn->c_xmit_rm->m_rs->rs_user_bytes;
+			send->s_sge[0].lkey = ic->i_sends[fr_pos].s_mr->lkey;
+		}
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			&send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		prev = send;
+		if (++send == &ic->i_sends[ic->i_send_ring.w_nr])
+			send = ic->i_sends;
+	}
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &op->r_sg[op->r_count])
+		first->s_wr.send_flags = IB_SEND_SIGNALED;
+
+	if (i < work_alloc) {
+		rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+
+	/* On iWARP, local memory access by a remote system (ie, RDMA Read) is not
+	 * recommended.  Putting the lkey on the wire is a security hole, as it can
+	 * allow for memory access to all of memory on the remote system.  Some
+	 * adapters do not allow using the lkey for this at all.  To bypass this use a
+	 * fastreg_mr (or possibly a dma_mr)
+	 */
+	if (!op->r_write) {
+		rds_iw_build_send_fastreg(rds_iwdev, ic, &ic->i_sends[fr_pos],
+			op->r_count, sent, conn->c_xmit_rm->m_rs->rs_user_addr);
+		work_alloc++;
+	}
+
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IW: rdma ib_post_send to %pI4 "
+		       "returned %d\n", &conn->c_faddr, ret);
+		rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc);
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+void rds_iw_xmit_complete(struct rds_connection *conn)
+{
+	struct rds_iw_connection *ic = conn->c_transport_data;
+
+	/* We may have a pending ACK or window update we were unable
+	 * to send previously (due to flow control). Try again. */
+	rds_iw_attempt_ack(ic);
+}
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
new file mode 100644
index 0000000..ccc7e8f
--- /dev/null
+++ b/net/rds/iw_stats.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+#include "iw.h"
+
+DEFINE_PER_CPU(struct rds_iw_statistics, rds_iw_stats) ____cacheline_aligned;
+
+static char *rds_iw_stat_names[] = {
+	"iw_connect_raced",
+	"iw_listen_closed_stale",
+	"iw_tx_cq_call",
+	"iw_tx_cq_event",
+	"iw_tx_ring_full",
+	"iw_tx_throttle",
+	"iw_tx_sg_mapping_failure",
+	"iw_tx_stalled",
+	"iw_tx_credit_updates",
+	"iw_rx_cq_call",
+	"iw_rx_cq_event",
+	"iw_rx_ring_empty",
+	"iw_rx_refill_from_cq",
+	"iw_rx_refill_from_thread",
+	"iw_rx_alloc_limit",
+	"iw_rx_credit_updates",
+	"iw_ack_sent",
+	"iw_ack_send_failure",
+	"iw_ack_send_delayed",
+	"iw_ack_send_piggybacked",
+	"iw_ack_received",
+	"iw_rdma_mr_alloc",
+	"iw_rdma_mr_free",
+	"iw_rdma_mr_used",
+	"iw_rdma_mr_pool_flush",
+	"iw_rdma_mr_pool_wait",
+	"iw_rdma_mr_pool_depleted",
+};
+
+unsigned int rds_iw_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail)
+{
+	struct rds_iw_statistics stats = {0, };
+	uint64_t *src;
+	uint64_t *sum;
+	size_t i;
+	int cpu;
+
+	if (avail < ARRAY_SIZE(rds_iw_stat_names))
+		goto out;
+
+	for_each_online_cpu(cpu) {
+		src = (uint64_t *)&(per_cpu(rds_iw_stats, cpu));
+		sum = (uint64_t *)&stats;
+		for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++)
+			*(sum++) += *(src++);
+	}
+
+	rds_stats_info_copy(iter, (uint64_t *)&stats, rds_iw_stat_names,
+			    ARRAY_SIZE(rds_iw_stat_names));
+out:
+	return ARRAY_SIZE(rds_iw_stat_names);
+}
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
new file mode 100644
index 0000000..9590678
--- /dev/null
+++ b/net/rds/iw_sysctl.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/sysctl.h>
+#include <linux/proc_fs.h>
+
+#include "iw.h"
+
+static struct ctl_table_header *rds_iw_sysctl_hdr;
+
+unsigned long rds_iw_sysctl_max_send_wr = RDS_IW_DEFAULT_SEND_WR;
+unsigned long rds_iw_sysctl_max_recv_wr = RDS_IW_DEFAULT_RECV_WR;
+unsigned long rds_iw_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE;
+static unsigned long rds_iw_sysctl_max_wr_min = 1;
+/* hardware will fail CQ creation long before this */
+static unsigned long rds_iw_sysctl_max_wr_max = (u32)~0;
+
+unsigned long rds_iw_sysctl_max_unsig_wrs = 16;
+static unsigned long rds_iw_sysctl_max_unsig_wr_min = 1;
+static unsigned long rds_iw_sysctl_max_unsig_wr_max = 64;
+
+unsigned long rds_iw_sysctl_max_unsig_bytes = (16 << 20);
+static unsigned long rds_iw_sysctl_max_unsig_bytes_min = 1;
+static unsigned long rds_iw_sysctl_max_unsig_bytes_max = ~0UL;
+
+unsigned int rds_iw_sysctl_flow_control = 1;
+
+ctl_table rds_iw_sysctl_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_send_wr",
+		.data		= &rds_iw_sysctl_max_send_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_iw_sysctl_max_wr_min,
+		.extra2		= &rds_iw_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_wr",
+		.data		= &rds_iw_sysctl_max_recv_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_iw_sysctl_max_wr_min,
+		.extra2		= &rds_iw_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_wr",
+		.data		= &rds_iw_sysctl_max_unsig_wrs,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_iw_sysctl_max_unsig_wr_min,
+		.extra2		= &rds_iw_sysctl_max_unsig_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_bytes",
+		.data		= &rds_iw_sysctl_max_unsig_bytes,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_iw_sysctl_max_unsig_bytes_min,
+		.extra2		= &rds_iw_sysctl_max_unsig_bytes_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_allocation",
+		.data		= &rds_iw_sysctl_max_recv_allocation,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "flow_control",
+		.data		= &rds_iw_sysctl_flow_control,
+		.maxlen		= sizeof(rds_iw_sysctl_flow_control),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{ .ctl_name = 0}
+};
+
+static struct ctl_path rds_iw_sysctl_path[] = {
+	{ .procname = "net", .ctl_name = CTL_NET, },
+	{ .procname = "rds", .ctl_name = CTL_UNNUMBERED, },
+	{ .procname = "iw", .ctl_name = CTL_UNNUMBERED, },
+	{ }
+};
+
+void rds_iw_sysctl_exit(void)
+{
+	if (rds_iw_sysctl_hdr)
+		unregister_sysctl_table(rds_iw_sysctl_hdr);
+}
+
+int __init rds_iw_sysctl_init(void)
+{
+	rds_iw_sysctl_hdr = register_sysctl_paths(rds_iw_sysctl_path, rds_iw_sysctl_table);
+	if (rds_iw_sysctl_hdr == NULL)
+		return -ENOMEM;
+	return 0;
+}
-- 
1.5.6.3


From andy.grover at oracle.com  Tue Feb 24 17:30:43 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Tue, 24 Feb 2009 17:30:43 -0800
Subject: [ofa-general] [PATCH 26/26] RDS: Add RDS to AF key strings
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <1235525443-9007-27-git-send-email-andy.grover@oracle.com>

Signed-off-by: Andy Grover <andy.grover at oracle.com>
---
 net/core/sock.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 8ee734e..7c6d089 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -155,7 +155,7 @@ static const char *af_family_key_strings[AF_MAX+1] = {
   "sk_lock-27"       , "sk_lock-28"          , "sk_lock-AF_CAN"      ,
   "sk_lock-AF_TIPC"  , "sk_lock-AF_BLUETOOTH", "sk_lock-IUCV"        ,
   "sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN"     , "sk_lock-AF_PHONET"   ,
-  "sk_lock-AF_MAX"
+  "sk_lock-AF_RDS"   , "sk_lock-AF_MAX"
 };
 static const char *af_family_slock_key_strings[AF_MAX+1] = {
   "slock-AF_UNSPEC", "slock-AF_UNIX"     , "slock-AF_INET"     ,
@@ -170,7 +170,7 @@ static const char *af_family_slock_key_strings[AF_MAX+1] = {
   "slock-27"       , "slock-28"          , "slock-AF_CAN"      ,
   "slock-AF_TIPC"  , "slock-AF_BLUETOOTH", "slock-AF_IUCV"     ,
   "slock-AF_RXRPC" , "slock-AF_ISDN"     , "slock-AF_PHONET"   ,
-  "slock-AF_MAX"
+  "slock-AF_RDS"   , "slock-AF_MAX"
 };
 static const char *af_family_clock_key_strings[AF_MAX+1] = {
   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
@@ -185,7 +185,7 @@ static const char *af_family_clock_key_strings[AF_MAX+1] = {
   "clock-27"       , "clock-28"          , "clock-AF_CAN"      ,
   "clock-AF_TIPC"  , "clock-AF_BLUETOOTH", "clock-AF_IUCV"     ,
   "clock-AF_RXRPC" , "clock-AF_ISDN"     , "clock-AF_PHONET"   ,
-  "clock-AF_MAX"
+  "clock-AF_RDS"   , "clock-AF_MAX"
 };
 
 /*
-- 
1.5.6.3


From phillipwils at gmail.com  Tue Feb 24 21:51:31 2009
From: phillipwils at gmail.com (Phillip Wilson)
Date: Tue, 24 Feb 2009 21:51:31 -0800
Subject: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory region)
	function call fails under load when using the mlx4 driver
Message-ID: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com>

The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses the
mlx4_0 driver when the system is under load (memory and cpu).  The system
usually has over 500MB of system memory when “ibv_reg_mr()” call fails.


If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with the
other tools to generate stress the “ibv_reg_mr()” call always passes.  If I
only run the HCA with (DID=0x634A) with the other tools to generate stress
the “ibv_reg_mr()” call will always fails; it usually takes less than 30
minutes for the failure to occur.


The maximum number of memory regions requested at one time is up to 8 (32MB)
with two HCA dual port cards and the maximum size for a memory region is 1
MB.


(i.e. ctx->mr = ibv_reg_mr(ctx->pd,

                                             buffer,  /*malloc 4MB buffer
per process*/

                                             size,      /*2 Bytes to 1MB */

                                             IBV_ACCESS_LOCAL_WRITE);

)


I modified the ibv_rc_pingpong test to use the parent-child paradigm instead
of the current client/server approach for my environment.  The code forks a
parent process and a child process per port which serves the same purpose as
the current client/server approach.  The code also forks a process to run on
a HCA.  Basically, the same code is executed on each HCA except for the user
libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on each
HCA.


Since the code in the user libraries is very similar to each other, I
suspect the issue is in the kernel code or HCA firmware.


Does any one know what kernel patch fixes this issue starting from kernel
2.6.24 through 2.6.28?  Has anyone else seen this issue?


System Information:


The system has 4GB of memory.


uname -a

Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64
unknown


OFED 1.2.5


lspci -d 15b3:


0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev 20)

0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)


lspci -d 15b3: -n

0000:10:00.0 0c06: 15b3:6278 (rev 20)

0000:c3:00.0 0c06: 15b3:634a (rev a0)


ibv_devinfo -v

hca_id: mlx4_0

        fw_ver:                         2.5.000


hca_id: mthca0

        fw_ver:                         4.8.930
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090224/ad6e95f6/attachment.html>

From davem at davemloft.net  Tue Feb 24 23:26:48 2009
From: davem at davemloft.net (David Miller)
Date: Tue, 24 Feb 2009 23:26:48 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 23/26] RDS: Add AF and PF #defines for RDS
	sockets
In-Reply-To: <1235525443-9007-24-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<1235525443-9007-24-git-send-email-andy.grover@oracle.com>
Message-ID: <20090224.232648.84203571.davem@davemloft.net>

From: Andy Grover <andy.grover at oracle.com>
Date: Tue, 24 Feb 2009 17:30:40 -0800

> @@ -191,7 +191,8 @@ struct ucred {
>  #define AF_RXRPC	33	/* RxRPC sockets 		*/
>  #define AF_ISDN		34	/* mISDN sockets 		*/
>  #define AF_PHONET	35	/* Phonet sockets		*/
> -#define AF_MAX		36	/* For now.. */
> +#define AF_RDS		36	/* RDS sockets 			*/
> +#define AF_MAX		37	/* For now.. */
>  
>  /* Protocol families, same as address families. */
>  #define PF_UNSPEC	AF_UNSPEC

Pick an unused number, you don't have to increment AF_MAX
to allocate a value.

And I don't want to hear any whining about how you've
used this value of 36 internally for a long time or
anything like that.


From davem at davemloft.net  Tue Feb 24 23:28:14 2009
From: davem at davemloft.net (David Miller)
Date: Tue, 24 Feb 2009 23:28:14 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <20090224.232814.227017310.davem@davemloft.net>

From: Andy Grover <andy.grover at oracle.com>
Date: Tue, 24 Feb 2009 17:30:17 -0800

> This patchset against net-next adds support for RDS sockets. RDS is an
> Oracle-originated protocol used to send IPC datagrams (up to 1MB)
> reliably, and is used currently in Oracle RAC and Exadata products. 
> 
> I've addressed all the issues from comments on take 1. (thanks!) This patchset
> squashes the changes into the original changeset, but I've also included
> a tree where the un-squashed changes since last time may be reviewed:
> git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6.git
> rds-broken-out-fixes
> 
> Major changes since last time include moving to net/rds, and the
> additional inclusion of iwarp transport support.

This makes RDMA too much of a first-class citizen in the networking
stack.  That's a blocker for me.

Furthermore the port you've choosen for the protocol is arbitrary, not
properly allocated with the appropriate standards committee, and
therefore could conflict with something other people are using.

I'm rejecting these patches, sorry.


From dotanba at gmail.com  Tue Feb 24 23:50:54 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Wed, 25 Feb 2009 09:50:54 +0200
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory
	region) function call fails under load when using the mlx4 driver
In-Reply-To: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com>
References: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com>
Message-ID: <2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com>

Do you execute your program under the root user or under any other user?
(maybe you fail because of the ulimit value of memory which can be pinned)


Dotan

On Wed, Feb 25, 2009 at 7:51 AM, Phillip Wilson <phillipwils at gmail.com> wrote:
> The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses the
> mlx4_0 driver when the system is under load (memory and cpu).  The system
> usually has over 500MB of system memory when “ibv_reg_mr()” call fails.
>
>
>
> If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with the
> other tools to generate stress the “ibv_reg_mr()” call always passes.  If I
> only run the HCA with (DID=0x634A) with the other tools to generate stress
> the “ibv_reg_mr()” call will always fails; it usually takes less than 30
> minutes for the failure to occur.
>
>
>
>
>
> The maximum number of memory regions requested at one time is up to 8 (32MB)
> with two HCA dual port cards and the maximum size for a memory region is 1
> MB.
>
>
>
> (i.e. ctx->mr = ibv_reg_mr(ctx->pd,
>
>                                              buffer,  /*malloc 4MB buffer
> per process*/
>
>                                              size,      /*2 Bytes to 1MB */
>
>                                              IBV_ACCESS_LOCAL_WRITE);
>
> )
>
>
>
> I modified the ibv_rc_pingpong test to use the parent-child paradigm instead
> of the current client/server approach for my environment.  The code forks a
> parent process and a child process per port which serves the same purpose as
> the current client/server approach.  The code also forks a process to run on
> a HCA.  Basically, the same code is executed on each HCA except for the user
> libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on each
> HCA.
>
>
>
> Since the code in the user libraries is very similar to each other, I
> suspect the issue is in the kernel code or HCA firmware.
>
>
>
> Does any one know what kernel patch fixes this issue starting from kernel
> 2.6.24 through 2.6.28?  Has anyone else seen this issue?
>
>
>
> System Information:
>
>
>
> The system has 4GB of memory.
>
>
>
> uname -a
>
> Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64
> unknown
>
>
>
> OFED 1.2.5
>
>
>
> lspci -d 15b3:
>
>
>
> 0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
> (Tavor compatibility mode) (rev 20)
>
> 0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)
>
>
>
> lspci -d 15b3: -n
>
> 0000:10:00.0 0c06: 15b3:6278 (rev 20)
>
> 0000:c3:00.0 0c06: 15b3:634a (rev a0)
>
>
>
> ibv_devinfo -v
>
> hca_id: mlx4_0
>
>         fw_ver:                         2.5.000
>
>
>
> hca_id: mthca0
>
>         fw_ver:                         4.8.930
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From ogerlitz at voltaire.com  Wed Feb 25 00:04:09 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 25 Feb 2009 10:04:09 +0200
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <20090224.232814.227017310.davem@davemloft.net>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
Message-ID: <49A4FB79.1090809@voltaire.com>

David Miller wrote:
>> Major changes since last time include moving to net/rds, and the additional inclusion of iwarp transport support.
>>     
> This makes RDMA too much of a first-class citizen in the networking stack.  That's a blocker for me.
>   
Hi Dave

Can you elaborate a bit further, I wasn't sure to follow if your comment 
was related to the inclusion of iwarp or to something else.

Or.


From davem at davemloft.net  Wed Feb 25 00:06:28 2009
From: davem at davemloft.net (David Miller)
Date: Wed, 25 Feb 2009 00:06:28 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <49A4FB79.1090809@voltaire.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
	<49A4FB79.1090809@voltaire.com>
Message-ID: <20090225.000628.108688119.davem@davemloft.net>

From: Or Gerlitz <ogerlitz at voltaire.com>
Date: Wed, 25 Feb 2009 10:04:09 +0200

> Can you elaborate a bit further, I wasn't sure to follow if your
> comment was related to the inclusion of iwarp or to something else.

It's making real sockets, using the real networking stack,
using up real IP port/address pairs recognized by the rest
of the real networking stack, and doing RDMA over that
connection.

That's not allowed.

We always said that if these RDMA things are in the tree,
they should use their own IP addresses and that are not
visible to the real Linux networking stack.


From phillipwils at gmail.com  Wed Feb 25 00:29:34 2009
From: phillipwils at gmail.com (Phillip Wilson)
Date: Wed, 25 Feb 2009 00:29:34 -0800
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory
	region) function call fails under load when using the mlx4 driver
In-Reply-To: <2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com>
References: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com>
	<2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com>
Message-ID: <6e4f44220902250029r65ba36d8me002e916f638d443@mail.gmail.com>

All programs are executed as the root user.

ulimit -a

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
coredump(blocks)     0
memory(kbytes)       unlimited
locked memory(kbytes) unlimited
process              8063
nofiles              1048576
vmemory(kbytes)      unlimited
locks                unlimited


On Tue, Feb 24, 2009 at 11:50 PM, Dotan Barak <dotanba at gmail.com> wrote:

> Do you execute your program under the root user or under any other user?
> (maybe you fail because of the ulimit value of memory which can be pinned)
>
>
> Dotan
>
> On Wed, Feb 25, 2009 at 7:51 AM, Phillip Wilson <phillipwils at gmail.com>
> wrote:
> > The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses
> the
> > mlx4_0 driver when the system is under load (memory and cpu).  The system
> > usually has over 500MB of system memory when “ibv_reg_mr()” call fails.
> >
> >
> >
> > If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with
> the
> > other tools to generate stress the “ibv_reg_mr()” call always passes.  If
> I
> > only run the HCA with (DID=0x634A) with the other tools to generate
> stress
> > the “ibv_reg_mr()” call will always fails; it usually takes less than 30
> > minutes for the failure to occur.
> >
> >
> >
> >
> >
> > The maximum number of memory regions requested at one time is up to 8
> (32MB)
> > with two HCA dual port cards and the maximum size for a memory region is
> 1
> > MB.
> >
> >
> >
> > (i.e. ctx->mr = ibv_reg_mr(ctx->pd,
> >
> >                                              buffer,  /*malloc 4MB buffer
> > per process*/
> >
> >                                              size,      /*2 Bytes to 1MB
> */
> >
> >                                              IBV_ACCESS_LOCAL_WRITE);
> >
> > )
> >
> >
> >
> > I modified the ibv_rc_pingpong test to use the parent-child paradigm
> instead
> > of the current client/server approach for my environment.  The code forks
> a
> > parent process and a child process per port which serves the same purpose
> as
> > the current client/server approach.  The code also forks a process to run
> on
> > a HCA.  Basically, the same code is executed on each HCA except for the
> user
> > libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on
> each
> > HCA.
> >
> >
> >
> > Since the code in the user libraries is very similar to each other, I
> > suspect the issue is in the kernel code or HCA firmware.
> >
> >
> >
> > Does any one know what kernel patch fixes this issue starting from kernel
> > 2.6.24 through 2.6.28?  Has anyone else seen this issue?
> >
> >
> >
> > System Information:
> >
> >
> >
> > The system has 4GB of memory.
> >
> >
> >
> > uname -a
> >
> > Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64
> > unknown
> >
> >
> >
> > OFED 1.2.5
> >
> >
> >
> > lspci -d 15b3:
> >
> >
> >
> > 0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
> > (Tavor compatibility mode) (rev 20)
> >
> > 0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev
> a0)
> >
> >
> >
> > lspci -d 15b3: -n
> >
> > 0000:10:00.0 0c06: 15b3:6278 (rev 20)
> >
> > 0000:c3:00.0 0c06: 15b3:634a (rev a0)
> >
> >
> >
> > ibv_devinfo -v
> >
> > hca_id: mlx4_0
> >
> >         fw_ver:                         2.5.000
> >
> >
> >
> > hca_id: mthca0
> >
> >         fw_ver:                         4.8.930
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090225/ac9e3913/attachment.html>

From ogerlitz at Voltaire.com  Wed Feb 25 01:46:00 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Wed, 25 Feb 2009 11:46:00 +0200
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <20090225.000628.108688119.davem@davemloft.net>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>	<20090224.232814.227017310.davem@davemloft.net>	<49A4FB79.1090809@voltaire.com>
	<20090225.000628.108688119.davem@davemloft.net>
Message-ID: <49A51358.4080408@Voltaire.com>

David Miller wrote:
> It's making real sockets, using the real networking stack,
> using up real IP port/address pairs recognized by the rest
> of the real networking stack, and doing RDMA over that connection.

The only usage of the network stack done by the RDMA stack (at its rdma connection manager) on behalf of protocols such as RDS is for address resolution. This practice of sending ARPs is supported by the mainline kernel for long time and common also among other technologies / drivers.

Or.


From vlad at lists.openfabrics.org  Wed Feb 25 03:28:20 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 25 Feb 2009 03:28:20 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090225-0200 daily build status
Message-ID: <20090225112820.B22F9E60FC3@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From kliteyn at dev.mellanox.co.il  Wed Feb 25 04:25:08 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 25 Feb 2009 14:25:08 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp
 for the newly discovered port of the known node
In-Reply-To: <20090224143706.GO7641@sashak.voltaire.com>
References: <499AB068.2020205@dev.mellanox.co.il>
	<20090218181955.GX5910@sashak.voltaire.com>
	<499C7E2D.8050301@dev.mellanox.co.il>
	<20090224143706.GO7641@sashak.voltaire.com>
Message-ID: <49A538A4.3070008@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 23:31 Wed 18 Feb     , Yevgeny Kliteynik wrote:
> 
> [snip...]
>> Good point.
>> I'll repost the patch when we finish discussing it.
> 
> Let's go this way now. Please resend the patch.

Will do.

> After looking closer into scenario with SwithInfo/PortInfo race I'm
> thinking about two optimizations there:
> 
> 1. Initialize all switch ports (and not only local and port 0) right on
> first NodeInfo receiving (via osm_node_new()) - this makes your patch
> unnecessary, but it is a bigger change which will definitely require some
> heavy testing, so it is fine IMO to do it subsequently.
>
> 2. Request PortInfo for all switch ports right on first NodeInfo
> receiving (not wait for SwitchInfo), just in parallel with SwitchInfo
> request. This should simplify subnet discovery flow and speed it up.
> And also this will require some heavy testing...
> 
> What do you think about (1) and (2). Could you see any disadvantages?

I don't see any.

The first option looks shorter and somewhat more safe, but the second
option might speed up the discovery a little bit. I'm OK with both options.
In any case, this will have to be seriously tested.

-- Yevgeny

> Sasha
> 


From cameron at harr.org  Wed Feb 25 08:31:18 2009
From: cameron at harr.org (Cameron Harr)
Date: Wed, 25 Feb 2009 09:31:18 -0700
Subject: [Scst-devel]
	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A4812A.8050202@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl
	nb.net>	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
	<49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net>
	<49A4812A.8050202@ha rr.org>
Message-ID: <49A57256.2000005@harr.org>

Cameron Harr wrote:
> Vladislav Bolkhovitin wrote:
>>>>> I ran each test 3 times and took the averages. In order to get a 
>>>>> quick look at performance per run, I added a column in the summary 
>>>>> that sums the IOPs for each test with SRPT thread enabled and then 
>>>>> not enabled. Test 4 seems to give the best results. Here's a brief 
>>>>> summary of that summary with just SRPT thread=0:
>>>>>
>>>>> Baseline: 356226.39
>>>>> Test 1:   371217.6533
>>>>> Test 2:   370553.78
>>>>> Test 3:   373295.2033
>>>>> Test 4:   399385.2233
>>>>> Test 5:   393204.5833
>>>> Linux CPU scheduler does really impressive job!
>>>>
>>>> Interesting, will something change with:
>>>>
>>>> 1. The latest SVN. It has some changes, which might make a difference.
>>> Sorry for the delay.
>>> This is with SVN rev 673. I don't hit the high I hit before, but at 
>>> a 1.8% difference (with test 4), it's statistically noise.
>>>
>>> Test 1: 390631.5133
>>> Test 2: 386125.4133
>>> Test 3: 356268.0267
>>> Test 4: 392237.7867
>>> Test 5: 390012.1467 
> I just ran again, this time with rev 680 and am a little concerned to 
> see the drop in performance. I verified that debug is not on. I'll try 
> to start another run on 680 to see if I get similar results.
>
> Test 1:368342.41
> Test 2:366787.2067
> Test 3:345334.68
> Test 4:372684.58
> Test 5:372184.8333
I re-compiled and re-ran the tests and numbers are a little better but 
performance still seems to have gone down from 673:
Test 1:373751.66
Test 2:371242.6067
Test 3:347988.1467
Test 4:378247.31
Test 5:375616.53


From sashak at voltaire.com  Wed Feb 25 09:53:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Feb 2009 19:53:00 +0200
Subject: [ofa-general] [PATCH] opensm/lid_mgr: fix duplicated lid assignment
Message-ID: <20090225175300.GD11192@sashak.voltaire.com>


When OpenSM is running with '-r' option (reassign lids) it will clean up
all internal free lid ranges and used_lids db, but not guid2lid db. Then
during new lids assignment for ports which don't presented in guid2lid
db LidMgr will ignore the fact that some port can already have the same
lid assigned. As result we will get a subnet with duplicated lids.

The proposed fix is to reassign all lids unconditionally (ignoring
existing guid2lid db and port's current lid value) if '-r' is specified.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_lid_mgr.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index b74aba5..ec7fd86 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -773,6 +773,10 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 	    !osm_switch_sp0_is_lmc_capable(p_port->p_node->sw, p_mgr->p_subn))
 		num_lids = 1;
 
+	if (p_mgr->p_subn->first_time_master_sweep == TRUE &&
+	    p_mgr->p_subn->opt.reassign_lids == TRUE)
+		goto AssignLid;
+
 	/* if the port matches the guid2lid */
 	if (!osm_db_guid2lid_get(p_mgr->p_g2l, guid, &min_lid, &max_lid)) {
 		*p_min_lid = min_lid;
@@ -804,9 +808,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 
 	/* we want to ignore the discovered lid if we are also on first sweep of
 	   reassign lids flow */
-	if (min_lid &&
-	    !((p_mgr->p_subn->first_time_master_sweep == TRUE) &&
-	      (p_mgr->p_subn->opt.reassign_lids == TRUE))) {
+	if (min_lid) {
 		/* make sure lid is valid */
 		if ((num_lids == 1) || ((min_lid & lmc_mask) == min_lid)) {
 			/* is it free */
@@ -831,6 +833,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 				guid, min_lid, min_lid + num_lids - 1);
 	}
 
+AssignLid:
 	/* first cleanup the existing discovered lid range */
 	__osm_lid_mgr_cleanup_discovered_port_lid_range(p_mgr, p_port);
 
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Wed Feb 25 10:02:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Feb 2009 20:02:26 +0200
Subject: [ofa-general] [PATCH] opensm/lid_mgr: simplify lmc_mask
	initialization
Message-ID: <20090225180226.GE11192@sashak.voltaire.com>


Expression '~((1 << lmc) - 1)' has value 0xffff when lmc = 0, so we
don't need to set it up as:

	if (lmc)
		lmc_mask = ~((1 << lmc) - 1);
	else
		lmc_mask = 0xffff;

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_lid_mgr.c |   24 ++++++------------------
 1 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index ec7fd86..ce02b4c 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -146,10 +146,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr)
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
-	if (p_mgr->p_subn->opt.lmc)
-		lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
-	else
-		lmc_mask = 0xffff;
+	lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
 
 	cl_qlist_init(&guids);
 
@@ -327,10 +324,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
-	if (p_mgr->p_subn->opt.lmc)
-		lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
-	else
-		lmc_mask = 0xffff;
+	lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
 
 	/* if we came out of standby we need to discard any previous guid2lid
 	   info we might have.
@@ -667,10 +661,7 @@ __osm_lid_mgr_find_free_lid_range(IN osm_lid_mgr_t * const p_mgr,
 		p_mgr->p_subn->opt.lmc, num_lids);
 
 	lmc_num_lids = (1 << p_mgr->p_subn->opt.lmc);
-	if (p_mgr->p_subn->opt.lmc)
-		lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
-	else
-		lmc_mask = 0xffff;
+	lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
 
 	/*
 	   Search the list of free lid ranges for a range which is big enough
@@ -760,11 +751,6 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
-	if (p_mgr->p_subn->opt.lmc)
-		lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
-	else
-		lmc_mask = 0xffff;
-
 	/* get the lid from the guid2lid */
 	guid = cl_ntoh64(osm_port_get_guid(p_port));
 
@@ -777,6 +763,8 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 	    p_mgr->p_subn->opt.reassign_lids == TRUE)
 		goto AssignLid;
 
+	lmc_mask = ~(num_lids - 1);
+
 	/* if the port matches the guid2lid */
 	if (!osm_db_guid2lid_get(p_mgr->p_g2l, guid, &min_lid, &max_lid)) {
 		*p_min_lid = min_lid;
@@ -810,7 +798,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr,
 	   reassign lids flow */
 	if (min_lid) {
 		/* make sure lid is valid */
-		if ((num_lids == 1) || ((min_lid & lmc_mask) == min_lid)) {
+		if ((min_lid & lmc_mask) == min_lid) {
 			/* is it free */
 			if (__osm_lid_mgr_is_range_not_persistent
 			    (p_mgr, min_lid, num_lids)) {
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Wed Feb 25 10:02:53 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Feb 2009 20:02:53 +0200
Subject: [ofa-general] [PATCH] opensm/sweep: add log message before lid
	assignment
Message-ID: <20090225180253.GF11192@sashak.voltaire.com>


Improve logging - add log message (msg box) between pkey tables and
QoS parameters setup and lid manager.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_state_mgr.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 0a27044..a1efd1a 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1247,6 +1247,9 @@ _repeat_discovery:
 	if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats))
 		return;
 
+	OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE,
+			"PKEY and QOS setup completed - STARTING SM LID CONFIG");
+
 	osm_lid_mgr_process_sm(&sm->lid_mgr);
 	if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats))
 		return;
-- 
1.6.1.2.319.gbd9e


From rdreier at cisco.com  Wed Feb 25 10:16:18 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 25 Feb 2009 10:16:18 -0800
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <20090225.000628.108688119.davem@davemloft.net> (David Miller's
	message of "Wed, 25 Feb 2009 00:06:28 -0800 (PST)")
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
	<49A4FB79.1090809@voltaire.com>
	<20090225.000628.108688119.davem@davemloft.net>
Message-ID: <adaeixmtma5.fsf@cisco.com>

 > It's making real sockets, using the real networking stack,
 > using up real IP port/address pairs recognized by the rest
 > of the real networking stack, and doing RDMA over that
 > connection.
 > 
 > That's not allowed.
 > 
 > We always said that if these RDMA things are in the tree,
 > they should use their own IP addresses and that are not
 > visible to the real Linux networking stack.

How is what the RDS code is doing any different than what the (upstream)
NFS/RDMA and iSER code does?  It uses the same rdma_xxx() interfaces for
handling connections.

 - R.


From andy.grover at gmail.com  Wed Feb 25 10:43:27 2009
From: andy.grover at gmail.com (Andrew Grover)
Date: Wed, 25 Feb 2009 10:43:27 -0800
Subject: [ofa-general] ***SPAM*** Re: [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <20090224.232814.227017310.davem@davemloft.net>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
Message-ID: <c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>

On Tue, Feb 24, 2009 at 11:28 PM, David Miller <davem at davemloft.net> wrote:
> This makes RDMA too much of a first-class citizen in the networking
> stack.  That's a blocker for me.

RDS is not an RDMA protocol, it is a protocol that supports RDMA. RDS
is not an IB protocol, it is a protocol that supports IB transport.

RDS's reliable-datagram socket implementation has a modular interface
to the transport (e.g. tcp, udp, or ib) and works fine over transports
that do not support RDMA. (Most users also do not use RDMA.)

OK so we have:

1) RDS socket code
  must go in net/rds, it's socket code
2) RDS core rdma support
  move to drivers/infiniband?
3) RDS IB/iwarp transport
  keep the non-RDMA support in net/rds or move to d/i? It's not RDMA it's IB
4) IB/iwarp transport's rdma support
  move to d/i
5) RDS TCP transport (impl. but not incl. in patchset)
  net/rds
6) RDS UDP/DCB transport (not impl. yet)
  net/rds

Does this look right? Right now it sounds like you're saying 1, 5, and
6 go in net/rds, 2-4 go in drivers/infiniband.

I'd personally prefer to not split it up, or to split it on the
natural core/transport boundary, but I can make it work whatever you
decide. :-)

> Furthermore the port you've choosen for the protocol is arbitrary, not
> properly allocated with the appropriate standards committee, and
> therefore could conflict with something other people are using.

I'm sure allocating the port won't be too big an issue.

Regards -- Andy


From purdy at sgi.com  Wed Feb 25 13:09:32 2009
From: purdy at sgi.com (Dale Purdy)
Date: Wed, 25 Feb 2009 15:09:32 -0600
Subject: [ofa-general] [PATCH] opensm: Implement weighted routing
Message-ID: <20090225210932.GA6098@sgi.com>


Implement a weighted routing scheme for fine tuning the lid matrix for
routing engines that use the lid matrix.  An optional file containing
a switch_guid port and weighing factor combination per line can be
supplied to override a default hop weight factor of 1 for each switch
output port in computing the lid matrix.  This allows one to alter the
min hop paths for things like routes to I/O.

Signed-off-by: Dale Purdy <purdy at sgi.com>
---
 opensm/include/opensm/osm_port.h   |    4 ++
 opensm/include/opensm/osm_subnet.h |    1 +
 opensm/man/opensm.8.in             |    7 +++
 opensm/opensm/main.c               |   13 +++++-
 opensm/opensm/osm_subnet.c         |    7 +++
 opensm/opensm/osm_ucast_mgr.c      |   82 ++++++++++++++++++++++++++++++++++--
 6 files changed, 109 insertions(+), 5 deletions(-)

diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h
index 3dda541..ae54c9f 100644
--- a/opensm/include/opensm/osm_port.h
+++ b/opensm/include/opensm/osm_port.h
@@ -115,6 +115,7 @@ typedef struct osm_physp {
 	osm_pkey_tbl_t pkeys;
 	ib_vl_arb_table_t vl_arb[4];
 	cl_ptr_vector_t slvl_by_port;
+	uint8_t hop_wf;
 } osm_physp_t;
 /*
 * FIELDS
@@ -171,6 +172,9 @@ typedef struct osm_physp {
 *		Switches have an entry for every other input port (inc SMA=0).
 *		On CAs only one per port.
 *
+*	hop_wf
+*		Hop weighting factor to be used in the routing.
+*
 * SEE ALSO
 *	Port
 *********/
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 2dfccda..6353d22 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -181,6 +181,7 @@ typedef struct osm_subn_opt {
 	char *console;
 	uint16_t console_port;
 	char *port_prof_ignore_file;
+	char *hop_weights_file;
 	boolean_t port_profile_switch_nodes;
 	boolean_t sweep_on_trap;
 	char *routing_engine_names;
diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 7690980..c77ecab 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -31,6 +31,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
 [\-console [off | local | socket | loopback]]
 [\-console-port <port>]
 [\-i(gnore-guids) <equalize-ignore-guids-file>]
+[\-w | \-\-hop_weights_file <path to file>]
 [\-f <log file path> | \-\-log_file <log file path> ]
 [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)]
 [\-P(config) <partition config file> ]
@@ -233,6 +234,12 @@ This option provides the means to define a set of ports
 (by node guid and port number) that will be ignored by the link load
 equalization algorithm.
 .TP
+\fB\-w\fR, \fB\-\-hop_weights_file\fR <path to file>
+This option provides weighting factors per port representing a hop
+cost in computing the lid matrix.  The file consists of lines
+containing a switch GUID, output port, and weighting factor.  Any port
+not listed in the file defaults to a weighting factor of 1.
+.TP
 \fB\-x\fR, \fB\-\-honor_guid2lid\fR
 This option forces OpenSM to honor the guid2lid file,
 when it comes out of Standby state, if such file exists
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 47fd658..f145dab 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -255,6 +255,10 @@ static void show_usage(void)
 	       "          This option provides the means to define a set of ports\n"
 	       "          (by guid) that will be ignored by the link load\n"
 	       "          equalization algorithm.\n\n");
+	printf("--hop_weights_file, -w <path to file>\n"
+	       "          This option provides the means to define a weighting\n"
+	       "          factor per port for customizing the least weight\n"
+	       "          hops for the routing.\n\n");
 	printf("--honor_guid2lid, -x\n"
 	       "          This option forces OpenSM to honor the guid2lid file,\n"
 	       "          when it comes out of Standby state, if such file exists\n"
@@ -524,7 +528,7 @@ int main(int argc, char *argv[])
 	char *conf_template = NULL, *config_file = NULL;
 	uint32_t val;
 	const char *const short_option =
-	    "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:";
+	    "F:c:i:w:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:";
 
 	/*
 	   In the array below, the 2nd parameter specifies the number
@@ -540,6 +544,7 @@ int main(int argc, char *argv[])
 		{"debug", 1, NULL, 'd'},
 		{"guid", 1, NULL, 'g'},
 		{"ignore_guids", 1, NULL, 'i'},
+		{"hop_weights_file", 1, NULL, 'w'},
 		{"lmc", 1, NULL, 'l'},
 		{"sweep", 1, NULL, 's'},
 		{"timeout", 1, NULL, 't'},
@@ -664,6 +669,12 @@ int main(int argc, char *argv[])
 			       opt.port_prof_ignore_file);
 			break;
 
+		case 'w':
+			opt.hop_weights_file = optarg;
+			printf(" Hop Weights File = %s\n",
+			       opt.hop_weights_file);
+			break;
+
 		case 'g':
 			/*
 			   Specifies port guid with which to bind.
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index b3100a4..26e4481 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -322,6 +322,7 @@ static const opt_rec_t opt_tbl[] = {
 	{ "polling_retry_number", OPT_OFFSET(polling_retry_number), opts_parse_uint32, NULL, 1 },
 	{ "force_heavy_sweep", OPT_OFFSET(force_heavy_sweep), opts_parse_boolean, NULL, 1 },
 	{ "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_parse_charp, NULL, 0 },
+	{ "hop_weights_file", OPT_OFFSET(hop_weights_file), opts_parse_charp, NULL, 0 },
 	{ "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_parse_boolean, NULL, 1 },
 	{ "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_parse_boolean, NULL, 1 },
 	{ "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, NULL, 0 },
@@ -727,6 +728,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->qos_policy_file = strdup(OSM_DEFAULT_QOS_POLICY_FILE);
 	p_opt->accum_log_file = TRUE;
 	p_opt->port_prof_ignore_file = NULL;
+	p_opt->hop_weights_file = NULL;
 	p_opt->port_profile_switch_nodes = FALSE;
 	p_opt->sweep_on_trap = TRUE;
 	p_opt->use_ucast_cache = FALSE;
@@ -1359,6 +1361,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		p_opts->port_prof_ignore_file : null_str);
 
 	fprintf(out,
+ 		"# The file holding routing weighting factors per output port\n"
+ 		"hop_weights_file %s\n\n",
+ 		p_opts->hop_weights_file ? p_opts->hop_weights_file : null_str);
+ 
+ 	fprintf(out,
 		"# Routing engine\n"
 		"# Multiple routing engines can be specified separated by\n"
 		"# commas so that specific ordering of routing algorithms will\n"
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index e404c91..81c3604 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -125,11 +125,11 @@ __osm_ucast_mgr_process_hop_0_1(IN cl_map_item_t * const p_map_item,
 
 		if (p_remote_node && p_remote_node->sw &&
 		    (p_remote_node != p_sw->p_node)) {
+			osm_physp_t *p = osm_node_get_physp_ptr(p_sw->p_node, i);
+
 			remote_lid = osm_node_get_base_lid(p_remote_node, 0);
 			remote_lid = cl_ntoh16(remote_lid);
-			osm_switch_set_hops(p_sw, remote_lid, i, 1);
-			osm_switch_set_hops(p_remote_node->sw, lid, remote_port,
-					    1);
+			osm_switch_set_hops(p_sw, remote_lid, i, p->hop_wf);
 		}
 	}
 }
@@ -146,6 +146,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr,
 	osm_switch_t *p_sw, *p_next_sw;
 	uint16_t lid_ho;
 	uint8_t hops;
+	osm_physp_t *p;
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
@@ -156,6 +157,8 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr,
 		cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_node)),
 		port_num, remote_port_num);
 
+	p = osm_node_get_physp_ptr(p_this_sw->p_node, port_num);
+
 	p_next_sw = (osm_switch_t *) cl_qmap_head(&p_mgr->p_subn->sw_guid_tbl);
 	while (p_next_sw !=
 	       (osm_switch_t *) cl_qmap_end(&p_mgr->p_subn->sw_guid_tbl)) {
@@ -166,7 +169,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr,
 		hops = osm_switch_get_least_hops(p_remote_sw, lid_ho);
 		if (hops == OSM_NO_PATH)
 			continue;
-		hops++;
+		hops += p->hop_wf;
 		if (hops <
 		    osm_switch_get_hop_count(p_this_sw, lid_ho, port_num)) {
 			if (osm_switch_set_hops
@@ -573,6 +576,61 @@ __osm_ucast_mgr_process_neighbors(IN cl_map_item_t * const p_map_item,
 
 /**********************************************************************
  **********************************************************************/
+static int set_hop_wf(void *ctx, uint64_t guid, char *p)
+{
+	osm_ucast_mgr_t *m = ctx;
+	osm_node_t *node = osm_get_node_by_guid(m->p_subn, cl_hton64(guid));
+	osm_physp_t *physp;
+	unsigned port, hop_wf;
+	char *e;
+
+	if (!node || !node->sw) {
+		OSM_LOG(m->p_log, OSM_LOG_DEBUG,
+			"switch with guid 0x%016" PRIx64 " is not found\n",
+			guid);
+		return 0;
+	}
+
+	if (!p || !*p || !(port = strtoul(p, &e, 0)) || (p == e) ||
+	    port >= node->sw->num_ports) {
+		OSM_LOG(m->p_log, OSM_LOG_DEBUG,
+			"bad port specified for guid 0x%016" PRIx64 "\n", guid);
+		return 0;
+	}
+
+	p = e + 1;
+
+	if (!*p || !(hop_wf = strtoul(p, &e, 0)) || (p == e) ||
+		(hop_wf >= 0x100)) {
+		OSM_LOG(m->p_log, OSM_LOG_DEBUG,
+			"bad hop weight factor specified for guid 0x%016" PRIx64 "port %u\n",
+			guid, port);
+		return 0;
+	}
+
+	physp = osm_node_get_physp_ptr(node, port);
+	if (!physp)
+		return 0;
+
+	physp->hop_wf = hop_wf;
+
+	return 0;
+}
+
+static void set_default_hop_wf(cl_map_item_t * const p_map_item, void *ctx)
+{
+	osm_switch_t *sw = (osm_switch_t *)p_map_item;
+	int i;
+
+	for (i = 1; i < sw->num_ports; i++) {
+		osm_physp_t *p = osm_node_get_physp_ptr(sw->p_node, i);
+		if (p)
+			p->hop_wf = 1;
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr)
 {
 	uint32_t i;
@@ -585,6 +643,22 @@ int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr)
 		"Starting switches' Min Hop Table Assignment\n");
 
 	/*
+	   Set up the weighting factors for the routing.
+	*/
+	cl_qmap_apply_func(p_sw_guid_tbl, set_default_hop_wf, NULL);
+	if (p_mgr->p_subn->opt.hop_weights_file) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
+			"Fetching hop weight factor file \'%s\'\n",
+			p_mgr->p_subn->opt.hop_weights_file);
+		if (parse_node_map(p_mgr->p_subn->opt.hop_weights_file,
+				   set_hop_wf, p_mgr)) {
+			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : cannot "
+				"parse hop_weights_file \'%s\'\n",
+				p_mgr->p_subn->opt.hop_weights_file);
+		}
+	}
+
+	/*
 	   Set the switch matrices for each switch's own port 0 LID(s)
 	   then set the lid matrices for the each switch's leaf nodes.
 	 */
-- 
1.5.6.5


From davem at davemloft.net  Wed Feb 25 13:45:12 2009
From: davem at davemloft.net (David Miller)
Date: Wed, 25 Feb 2009 13:45:12 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
	<c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>
Message-ID: <20090225.134512.90879325.davem@davemloft.net>

From: Andrew Grover <andy.grover at gmail.com>
Date: Wed, 25 Feb 2009 10:43:27 -0800

> RDS's reliable-datagram socket implementation has a modular interface
> to the transport (e.g. tcp, udp, or ib) and works fine over transports
> that do not support RDMA. (Most users also do not use RDMA.)

Ok, let me look over the patches again.


From ralph.campbell at qlogic.com  Wed Feb 25 16:36:03 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 25 Feb 2009 16:36:03 -0800
Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in
	local_completions()
Message-ID: <1235608563.3948.199.camel@chromite.mv.qlogic.com>

handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
on the mad_agent_priv->local_work work queue with
local->mad_priv == NULL if device->process_mad() returns
IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
(!ib_response_mad(&mad_priv->mad.mad) ||
 !mad_agent_priv->agent.recv_handler).

In this case, local_completions() will be called with
local->mad_priv == NULL. The code does check for this
case and skips calling recv_mad_agent->agent.recv_handler()
but recv == 0 so kmem_cache_free() is called with a
NULL pointer.

Also, since recv isn't reinitialized each time through the loop,
it can cause a memory leak if recv should have been zero.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5c54fc2..8388e5e 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work)
 	struct ib_mad_local_private *local;
 	struct ib_mad_agent_private *recv_mad_agent;
 	unsigned long flags;
-	int recv = 0;
+	int recv;
 	struct ib_wc wc;
 	struct ib_mad_send_wc mad_send_wc;
 
@@ -2370,14 +2370,15 @@ static void local_completions(struct work_struct *work)
 				   completion_list);
 		list_del(&local->completion_list);
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		recv = 1;
 		if (local->mad_priv) {
 			recv_mad_agent = local->recv_mad_agent;
 			if (!recv_mad_agent) {
 				printk(KERN_ERR PFX "No receive MAD agent for local completion\n");
+				recv = 0;
 				goto local_send_completion;
 			}
 
-			recv = 1;
 			/*
 			 * Defined behavior is to complete response
 			 * before request


From rdreier at cisco.com  Wed Feb 25 16:53:30 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 25 Feb 2009 16:53:30 -0800
Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in
	local_completions()
In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com> (Ralph
	Campbell's message of "Wed, 25 Feb 2009 16:36:03 -0800")
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
Message-ID: <aday6vuqar9.fsf@cisco.com>

This looks fine to me.  Hal and/or Sean, any comment?


From rdreier at cisco.com  Wed Feb 25 16:53:58 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 25 Feb 2009 16:53:58 -0800
Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in
	local_completions()
In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com> (Ralph
	Campbell's message of "Wed, 25 Feb 2009 16:36:03 -0800")
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
Message-ID: <adatz6iqaqh.fsf@cisco.com>

By the way, I didn't pay close attention to the previous discussion
about this.  Did you and Hal reach agreement about the approach?

 - R.


From ralph.campbell at qlogic.com  Wed Feb 25 17:03:40 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 25 Feb 2009 17:03:40 -0800
Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in
	local_completions()
In-Reply-To: <adatz6iqaqh.fsf@cisco.com>
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
	<adatz6iqaqh.fsf@cisco.com>
Message-ID: <1235610220.3948.206.camel@chromite.mv.qlogic.com>

On Wed, 2009-02-25 at 16:53 -0800, Roland Dreier wrote:
> By the way, I didn't pay close attention to the previous discussion
> about this.  Did you and Hal reach agreement about the approach?
> 
>  - R.

The earlier patch I posted wasn't correct. I was looking for
comments about how kmem_cache_free() is called when
recv_mad_agent->agent.recv_handler() is called.
Hal didn't answer directly so I checked the code and I see that
the receive handler is responsible for calling ib_free_recv_mad()
which does the work.

Hal just wanted me to test it, which I did.


From sean.hefty at intel.com  Wed Feb 25 17:22:26 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 25 Feb 2009 17:22:26 -0800
Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference
	in	local_completions()
In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
Message-ID: <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com>

>handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
>on the mad_agent_priv->local_work work queue with
>local->mad_priv == NULL if device->process_mad() returns
>IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
>(!ib_response_mad(&mad_priv->mad.mad) ||
> !mad_agent_priv->agent.recv_handler).
>
>In this case, local_completions() will be called with
>local->mad_priv == NULL. The code does check for this
>case and skips calling recv_mad_agent->agent.recv_handler()
>but recv == 0 so kmem_cache_free() is called with a
>NULL pointer.
>
>Also, since recv isn't reinitialized each time through the loop,
>it can cause a memory leak if recv should have been zero.
>
>Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
>
>diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
>index 5c54fc2..8388e5e 100644
>--- a/drivers/infiniband/core/mad.c
>+++ b/drivers/infiniband/core/mad.c
>@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work)
> 	struct ib_mad_local_private *local;
> 	struct ib_mad_agent_private *recv_mad_agent;
> 	unsigned long flags;
>-	int recv = 0;
>+	int recv;

With this change, I think it would be better to rename the 'recv' flag.  The
logic itself looks correct to me.

- Sean


From ralph.campbell at qlogic.com  Wed Feb 25 17:43:58 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 25 Feb 2009 17:43:58 -0800
Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in
	local_completions()
In-Reply-To: <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com>
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
	<0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com>
Message-ID: <1235612638.3948.211.camel@chromite.mv.qlogic.com>

On Wed, 2009-02-25 at 17:22 -0800, Sean Hefty wrote:
> >handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
> >on the mad_agent_priv->local_work work queue with
> >local->mad_priv == NULL if device->process_mad() returns
> >IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
> >(!ib_response_mad(&mad_priv->mad.mad) ||
> > !mad_agent_priv->agent.recv_handler).
> >
> >In this case, local_completions() will be called with
> >local->mad_priv == NULL. The code does check for this
> >case and skips calling recv_mad_agent->agent.recv_handler()
> >but recv == 0 so kmem_cache_free() is called with a
> >NULL pointer.
> >
> >Also, since recv isn't reinitialized each time through the loop,
> >it can cause a memory leak if recv should have been zero.
> >
> >Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> >
> >diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> >index 5c54fc2..8388e5e 100644
> >--- a/drivers/infiniband/core/mad.c
> >+++ b/drivers/infiniband/core/mad.c
> >@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work)
> > 	struct ib_mad_local_private *local;
> > 	struct ib_mad_agent_private *recv_mad_agent;
> > 	unsigned long flags;
> >-	int recv = 0;
> >+	int recv;
> 
> With this change, I think it would be better to rename the 'recv' flag.  The
> logic itself looks correct to me.
> 
> - Sean

OK, how about "free" or "free_mad"?


From keshetti.mahesh at gmail.com  Wed Feb 25 20:51:43 2009
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 26 Feb 2009 10:21:43 +0530
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Implement weighted
	routing
Message-ID: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com>

Hello Dale Purdy,

I have a requirement where I have to set the some hop's weight
factor to zero. Is this supported by your patch ?
I have implemented something similar to it before but it lead to
loops in the routing table. Does your patch take care of those things ?

-Mahesh


From sashak at voltaire.com  Wed Feb 25 21:10:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 07:10:12 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <f0e08f230902180720w25f74a8cs8c659757f331d425@mail.gmail.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
	<20090218003355.GX7189@sashak.voltaire.com>
	<f0e08f230902180720w25f74a8cs8c659757f331d425@mail.gmail.com>
Message-ID: <20090226051012.GH11192@sashak.voltaire.com>

Hi Hal,

On 10:20 Wed 18 Feb     , Hal Rosenstock wrote:
> On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 18:21 Tue 17 Feb     , Hal Rosenstock wrote:
> >> >
> >> > For utilities which run once through I think the old functions work just
> >> > fine.
> >>
> >> Well, sort of... Aren't mad_portid "collisions" possible when multiple
> >> programs are run concurrently ?
> >
> > No.
> 
> With the old API, mad_portid can be overwritten by another process or
> thread. Another thread is not an expected use case but it is possible.

Yes, but you asked about "collisions" between different programs
(processes) run.

Sasha


From sashak at voltaire.com  Wed Feb 25 21:18:36 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 07:18:36 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_inform.c: Fix sense of zero GID
	compare in __match_inf_rec
In-Reply-To: <20090218151015.GA6482@comcast.net>
References: <20090218151015.GA6482@comcast.net>
Message-ID: <20090226051836.GJ11192@sashak.voltaire.com>

On 10:10 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 21:19:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 07:19:12 +0200
Subject: [ofa-general] Re: [PATCH] management/libibmad.txt: Remove
	madrpc_lock/unlock
In-Reply-To: <20090218152728.GA8489@comcast.net>
References: <20090218152728.GA8489@comcast.net>
Message-ID: <20090226051912.GK11192@sashak.voltaire.com>

On 10:27 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 21:51:24 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 07:51:24 +0200
Subject: [ofa-general] Re: [PATCH] opensm/man/opensm.8.in: Indicate
	ROUTER_EXP deprecated
In-Reply-To: <20090218152913.GC8489@comcast.net>
References: <20090218152913.GC8489@comcast.net>
Message-ID: <20090226055117.GM11192@sashak.voltaire.com>

On 10:29 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 21:58:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 07:58:42 +0200
Subject: [ofa-general] Re: opensm/osm_console.c: Improve perfmgr
	print_counters error message
In-Reply-To: <20090218153227.GF8489@comcast.net>
References: <20090218153227.GF8489@comcast.net>
Message-ID: <20090226055842.GN11192@sashak.voltaire.com>

On 10:32 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 22:01:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:01:32 +0200
Subject: [ofa-general] Re: [PATCH] infiniband-diags/smpdump.c: Fix usage
	examples
In-Reply-To: <20090218155537.GA8762@comcast.net>
References: <20090218155537.GA8762@comcast.net>
Message-ID: <20090226060132.GO11192@sashak.voltaire.com>

On 10:55 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 22:03:47 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:03:47 +0200
Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/smpdump.c: Release umad
	resources on exit
In-Reply-To: <20090218171932.GA15139@comcast.net>
References: <20090218171932.GA15139@comcast.net>
Message-ID: <20090226060347.GP11192@sashak.voltaire.com>

On 12:19 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 22:15:51 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:15:51 +0200
Subject: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr
	print_counters for better nodenames
In-Reply-To: <20090219130653.GA29318@comcast.net>
References: <20090219130653.GA29318@comcast.net>
Message-ID: <20090226061551.GQ11192@sashak.voltaire.com>

On 08:06 Thu 19 Feb     , Hal Rosenstock wrote:
> 
> nodenames can have spaces in them
> Also, no need for next_token being inlined
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied with changes noted below. Thanks.

[snip...]

> diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
> index 3babe3a..8766f93 100644
> --- a/opensm/opensm/osm_perfmgr.c
> +++ b/opensm/opensm/osm_perfmgr.c
> @@ -1304,9 +1304,9 @@ void
>  osm_perfmgr_print_counters(osm_perfmgr_t *pm, char *nodename, FILE *fp)
>  {
>  	uint64_t guid = strtoull(nodename, NULL, 0);
> -	if (guid == 0 && errno == EINVAL)
> +	if (guid == 0 && errno)	// name
>  		perfmgr_db_print_by_name(pm->db, nodename, fp);
> -	else
> +	else		// guid
>  		perfmgr_db_print_by_guid(pm->db, guid, fp);

Such comments are not really helpful - it is pretty clear from the code
(flow itself and function names too) what is going on there, so I'm
removing this.

And in general I think it is better to use C-style comments - /* ... */,
in C code and not C++-style // ... .

Sasha


From sashak at voltaire.com  Wed Feb 25 22:24:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:24:45 +0200
Subject: [ofa-general] Re: [PATCHv2] opensm/man/opensm.8.in: Indicate
	ROUTER_EXP obsoleted
In-Reply-To: <20090219184415.GA29943@comcast.net>
References: <20090219184415.GA29943@comcast.net>
Message-ID: <20090226062445.GR11192@sashak.voltaire.com>

On 13:44 Thu 19 Feb     , Hal Rosenstock wrote:
> 
> Pointed out by Rolf
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 22:27:46 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:27:46 +0200
Subject: [ofa-general] Re: [PATCH] libibmad/fields.c: Dump LIDs as unsigned
	decimal
In-Reply-To: <20090220215845.GA7360@comcast.net>
References: <20090220215845.GA7360@comcast.net>
Message-ID: <20090226062746.GS11192@sashak.voltaire.com>

On 16:58 Fri 20 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed Feb 25 22:28:48 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 08:28:48 +0200
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Convert more
	LID prints to unsigned decimal
In-Reply-To: <20090220215938.GB7360@comcast.net>
References: <20090220215938.GB7360@comcast.net>
Message-ID: <20090226062848.GT11192@sashak.voltaire.com>

On 16:59 Fri 20 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From klakshman03 at hotmail.com  Wed Feb 25 22:38:42 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Thu, 26 Feb 2009 12:08:42 +0530
Subject: [ofa-general] ***SPAM*** Problem in IB network without Switch
Message-ID: <BAY101-W416E0676BEEE94FD609CE5B8AD0@phx.gbl>


 Hi All

  I have been trying to enable the IPoIB communication between two machines. The machines has been conncted with a Back-to-Back Infiniband Cable since I dont have IB switch. Installation of drivers and IP configuration has been done in both the machines. Subnet manager (opensmd) running on one machine.

The problem is communication has not been happening through IB."ibstatus" output shows port is in "Down" State in both the machines. What could be the problem, Iam unable to figure out where is the problem.

Operating System  :  Rocks 5.0 (RHEL 5.0)
OFED : Cisco OFED roll 5.0 (OFED 1.3)
HCA cards : Mellanox SDR

Please check the following commands output.

[root at mattool ~]# ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c901:08cd:13c1
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            2.5 Gb/sec (1X)
Infiniband device 'mthca0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c901:08cd:13c2
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            2.5 Gb/sec (1X)

[root at mattool ~]# /etc/init.d/openibd status

  HCA driver loaded
Configured devices:
ib0 ib1
Currently active devices:
ib0
ib1
The following OFED modules are loaded:
  rdma_ucm
  qlgc_vnic
  ib_sdp
  rdma_cm
  ib_addr
  ib_ipoib
  ib_ipath
  mlx4_core
  mlx4_ib
  ib_mthca
  ib_uverbs
  ib_umad
  ib_sa
  ib_cm
  ib_mad
  ib_core
  iw_cxgb3
[root at mattool ~]# 


 Thanks 

Laxman

_________________________________________________________________
Find a better job. We have plenty. Visit MSN Jobs
http://www.in.msn.com/jobs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090226/aa871429/attachment.html>

From sean.hefty at intel.com  Wed Feb 25 22:43:35 2009
From: sean.hefty at intel.com (Hefty, Sean)
Date: Wed, 25 Feb 2009 22:43:35 -0800
Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in
	local_completions()
In-Reply-To: <1235612638.3948.211.camel@chromite.mv.qlogic.com>
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
	<0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com>
	<1235612638.3948.211.camel@chromite.mv.qlogic.com>
Message-ID: <CF9C39F99A89134C9CF9C4CCB68B8DDF2857B182@orsmsx501.amr.corp.intel.com>

>OK, how about "free" or "free_mad"?

Sure - free_mad sounds good to me.


From sashak at voltaire.com  Wed Feb 25 23:06:29 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 09:06:29 +0200
Subject: [ofa-general] Re: [PATCH] Add pkey table support to
	osm_get_all_port_attrs
In-Reply-To: <20090218153016.GD8489@comcast.net>
References: <20090218153016.GD8489@comcast.net>
Message-ID: <20090226070629.GU11192@sashak.voltaire.com>

Hi Hal,

On 10:30 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Only supported in osm_vendor_ibumad.c (separate patch for other
> vendor layers)
> Also, update applications using this (osmtest, opensm)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++-----
>  opensm/opensm/main.c                 |    6 ++++++
>  opensm/osmtest/main.c                |   11 +++++++++++
>  opensm/osmtest/osmtest.c             |    7 +++++++
>  4 files changed, 43 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
> index 734a860..861bfbe 100644
> --- a/opensm/libvendor/osm_vendor_ibumad.c
> +++ b/opensm/libvendor/osm_vendor_ibumad.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	umad_ca_t ca;
>  	ib_port_attr_t *attr = p_attr_array;
>  	unsigned done = 0;
> -	int r, i, j;
> +	int r, i, j, k;
>  
>  	OSM_LOG_ENTER(p_vend->p_log);
>  
>  	CL_ASSERT(p_vend && p_num_ports);
>  
> +	r = 0;
>  	if (!*p_num_ports) {
>  		r = IB_INVALID_PARAMETER;
>  		OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	}
>  
>  	for (i = 0; i < p_vend->ca_count && !done; i++) {
> -		/*
> -		 * For each CA, retrieve the port guids
> -		 */
> +		/* For each CA, retrieve the port attributes */
>  		if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
>  			if (ca.node_type < 1 || ca.node_type > 3)
>  				continue;
> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  				attr->port_num = ca.ports[j]->portnum;
>  				attr->sm_lid = ca.ports[j]->sm_lid;
>  				attr->link_state = ca.ports[j]->state;
> +				attr->num_pkeys = ca.ports[j]->pkeys_size;
> +				if (attr->num_pkeys && attr->p_pkey_table) {
> +					if (attr->num_pkeys < ca.ports[j]->pkeys_size) {

You are doing:

	attr->num_pkeys = ca.ports[j]->pkeys_size;

, just two lines above, so this check will be always false.

> +						r = IB_INSUFFICIENT_MEMORY;
> +						OSM_LOG(p_vend->p_log,
> +							OSM_LOG_ERROR,
> +							"ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
> +							j,
> +							ca.ports[j]->pkeys_size);

Also should it be an error? May be it is just enough to fill requested
pkey entries?

> +						goto Exit;
> +					}
> +					for (k = 0; k < attr->num_pkeys; k++)
> +						attr->p_pkey_table[k] =
> +							cl_hton16(ca.ports[j]->pkeys[k]);
> +				}
>  				attr++;
>  				if (attr - p_attr_array > *p_num_ports) {
>  					done = 1;
> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	}
>  
>  	*p_num_ports = attr - p_attr_array;
> -	r = 0;
>  
>  Exit:
>  	OSM_LOG_EXIT(p_vend->p_log);
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 73a6274..503d7fa 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
>  	uint32_t i, choice = 0;
>  	ib_api_status_t status;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +

Here and below. Just

	memset(attr_array, 0, sizeof(attr_array));

would be enough.

Sasha

>  	/* Call the transport layer for a list of local port GUID values */
>  	status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array,
>  					      &num_ports);
> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
> index b360af6..83c1e13 100644
> --- a/opensm/osmtest/main.c
> +++ b/opensm/osmtest/main.c
> @@ -1,6 +1,7 @@
>  /*
>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>  	int i;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	   Call the transport layer for a list of local port
>  	   GUID values.
> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>  	int i;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	   Call the transport layer for a list of local port
>  	   GUID values.
> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
> index a7b343f..986a8d2 100644
> --- a/opensm/osmtest/osmtest.c
> +++ b/opensm/osmtest/osmtest.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
>  	ib_api_status_t status;
>  	uint32_t num_ports = MAX_LOCAL_IBPORTS;
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
> +	int i;
>  
>  	OSM_LOG_ENTER(&p_osmt->log);
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	 * Call the transport layer for a list of local port
>  	 * GUID values.
> -- 
> 1.5.6.4
> 


From sashak at voltaire.com  Wed Feb 25 23:10:59 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 09:10:59 +0200
Subject: [ofa-general] Re: [PATCH] Add pkey table support to
	osm_get_all_port_attrs
In-Reply-To: <20090218153016.GD8489@comcast.net>
References: <20090218153016.GD8489@comcast.net>
Message-ID: <20090226071059.GV11192@sashak.voltaire.com>

On 10:30 Wed 18 Feb     , Hal Rosenstock wrote:
> 
> Only supported in osm_vendor_ibumad.c (separate patch for other
> vendor layers)
> Also, update applications using this (osmtest, opensm)

It looks that ibutils (ibis) requires same fix (attr_array
initialization) too.

Sasha

> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++-----
>  opensm/opensm/main.c                 |    6 ++++++
>  opensm/osmtest/main.c                |   11 +++++++++++
>  opensm/osmtest/osmtest.c             |    7 +++++++
>  4 files changed, 43 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
> index 734a860..861bfbe 100644
> --- a/opensm/libvendor/osm_vendor_ibumad.c
> +++ b/opensm/libvendor/osm_vendor_ibumad.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	umad_ca_t ca;
>  	ib_port_attr_t *attr = p_attr_array;
>  	unsigned done = 0;
> -	int r, i, j;
> +	int r, i, j, k;
>  
>  	OSM_LOG_ENTER(p_vend->p_log);
>  
>  	CL_ASSERT(p_vend && p_num_ports);
>  
> +	r = 0;
>  	if (!*p_num_ports) {
>  		r = IB_INVALID_PARAMETER;
>  		OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	}
>  
>  	for (i = 0; i < p_vend->ca_count && !done; i++) {
> -		/*
> -		 * For each CA, retrieve the port guids
> -		 */
> +		/* For each CA, retrieve the port attributes */
>  		if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
>  			if (ca.node_type < 1 || ca.node_type > 3)
>  				continue;
> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  				attr->port_num = ca.ports[j]->portnum;
>  				attr->sm_lid = ca.ports[j]->sm_lid;
>  				attr->link_state = ca.ports[j]->state;
> +				attr->num_pkeys = ca.ports[j]->pkeys_size;
> +				if (attr->num_pkeys && attr->p_pkey_table) {
> +					if (attr->num_pkeys < ca.ports[j]->pkeys_size) {
> +						r = IB_INSUFFICIENT_MEMORY;
> +						OSM_LOG(p_vend->p_log,
> +							OSM_LOG_ERROR,
> +							"ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
> +							j,
> +							ca.ports[j]->pkeys_size);
> +						goto Exit;
> +					}
> +					for (k = 0; k < attr->num_pkeys; k++)
> +						attr->p_pkey_table[k] =
> +							cl_hton16(ca.ports[j]->pkeys[k]);
> +				}
>  				attr++;
>  				if (attr - p_attr_array > *p_num_ports) {
>  					done = 1;
> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>  	}
>  
>  	*p_num_ports = attr - p_attr_array;
> -	r = 0;
>  
>  Exit:
>  	OSM_LOG_EXIT(p_vend->p_log);
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 73a6274..503d7fa 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
>  	uint32_t i, choice = 0;
>  	ib_api_status_t status;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/* Call the transport layer for a list of local port GUID values */
>  	status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array,
>  					      &num_ports);
> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
> index b360af6..83c1e13 100644
> --- a/opensm/osmtest/main.c
> +++ b/opensm/osmtest/main.c
> @@ -1,6 +1,7 @@
>  /*
>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>  	int i;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	   Call the transport layer for a list of local port
>  	   GUID values.
> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>  	int i;
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	   Call the transport layer for a list of local port
>  	   GUID values.
> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
> index a7b343f..986a8d2 100644
> --- a/opensm/osmtest/osmtest.c
> +++ b/opensm/osmtest/osmtest.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
>  	ib_api_status_t status;
>  	uint32_t num_ports = MAX_LOCAL_IBPORTS;
>  	ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
> +	int i;
>  
>  	OSM_LOG_ENTER(&p_osmt->log);
>  
> +	for (i = 0; i < num_ports; i++) {
> +		attr_array[i].num_pkeys = 0;
> +		attr_array[i].p_pkey_table = NULL;
> +	}
> +
>  	/*
>  	 * Call the transport layer for a list of local port
>  	 * GUID values.
> -- 
> 1.5.6.4
> 


From keshetti.mahesh at gmail.com  Thu Feb 26 00:31:07 2009
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 26 Feb 2009 14:01:07 +0530
Subject: [ofa-general] ***SPAM*** Re: Problem in IB network without Switch
Message-ID: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>

Hi,

> phys state:      2: Polling

On both machines physical state is 'Polling' i.e. the physical
connectivity of the two is not proper. Check the connectivity first.
Only after it becomes

phys state:      5: LinkUp

you will be able to enable any IB communication on this interface.

-Mahesh


From ogerlitz at voltaire.com  Thu Feb 26 00:57:45 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Feb 2009 10:57:45 +0200 (IST)
Subject: [ofa-general] [PATCH] ib/iser: remove hard setting of mtu
Message-ID: <Pine.LNX.4.64.0902261056440.26368@zuben.voltaire.com>

Remove hard setting of the IB MTU used by iser's RC queue-pair to 1K, as this was
done due to inter-op issues with an old iser target which is not used any more.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linus-linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linus-linux-2.6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linus-linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -401,13 +401,6 @@ static void iser_route_handler(struct rd
 	if (ret)
 		goto failure;

-	iser_dbg("path.mtu is %d setting it to %d\n",
-		 cma_id->route.path_rec->mtu, IB_MTU_1024);
-
-	/* we must set the MTU to 1024 as this is what the target is assuming */
-	if (cma_id->route.path_rec->mtu > IB_MTU_1024)
-		cma_id->route.path_rec->mtu = IB_MTU_1024;
-
 	memset(&conn_param, 0, sizeof conn_param);
 	conn_param.responder_resources = 4;
 	conn_param.initiator_depth     = 1;


From jackm at dev.mellanox.co.il  Thu Feb 26 01:26:59 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 26 Feb 2009 11:26:59 +0200
Subject: [ofa-general] ***SPAM*** Problem in IB network without Switch
In-Reply-To: <BAY101-W416E0676BEEE94FD609CE5B8AD0@phx.gbl>
References: <BAY101-W416E0676BEEE94FD609CE5B8AD0@phx.gbl>
Message-ID: <200902261126.59440.jackm@dev.mellanox.co.il>

"DOWN" means that you do not have a physical link between the ports.  Check your cables -- they may be bad, or badly inserted.

- Jack 


On Thursday 26 February 2009 08:38, lakshmana swamy wrote:
> 
>  Hi All
> 
>   I have been trying to enable the IPoIB communication between two machines. The machines has been conncted with a Back-to-Back Infiniband Cable since I dont have IB switch. Installation of drivers and IP configuration has been done in both the machines. Subnet manager (opensmd) running on one machine.
> 
> The problem is communication has not been happening through IB."ibstatus" output shows port is in "Down" State in both the machines. What could be the problem, Iam unable to figure out where is the problem.
> 
> Operating System  :  Rocks 5.0 (RHEL 5.0)
> OFED : Cisco OFED roll 5.0 (OFED 1.3)
> HCA cards : Mellanox SDR
> 
> Please check the following commands output.
> 
> [root at mattool ~]# ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c901:08cd:13c1
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      2: Polling
>         rate:            2.5 Gb/sec (1X)
> Infiniband device 'mthca0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0002:c901:08cd:13c2
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      2: Polling
>         rate:            2.5 Gb/sec (1X)
> 
> [root at mattool ~]# /etc/init.d/openibd status
> 
>   HCA driver loaded
> Configured devices:
> ib0 ib1
> Currently active devices:
> ib0
> ib1
> The following OFED modules are loaded:
>   rdma_ucm
>   qlgc_vnic
>   ib_sdp
>   rdma_cm
>   ib_addr
>   ib_ipoib
>   ib_ipath
>   mlx4_core
>   mlx4_ib
>   ib_mthca
>   ib_uverbs
>   ib_umad
>   ib_sa
>   ib_cm
>   ib_mad
>   ib_core
>   iw_cxgb3
> [root at mattool ~]# 
> 
> 
>  Thanks 
> 
> Laxman
> 
> _________________________________________________________________
> Find a better job. We have plenty. Visit MSN Jobs
> http://www.in.msn.com/jobs


From sashak at voltaire.com  Thu Feb 26 02:05:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 12:05:21 +0200
Subject: [ofa-general] Re: [PATCH 1/6] [ib-diag] ibnetdiscover: add support
	for WinOF
In-Reply-To: <16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com>
Message-ID: <20090226100521.GA11192@sashak.voltaire.com>

Hi Sean,

On 17:46 Wed 18 Feb     , Sean Hefty wrote:
> Mainly fixing datatypes to avoid type mismatches.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> Also attaching patch in case my mailer wraps the lines.
> 
>  infiniband-diags/src/grouping.c      |   28 ++++++++++++++--------------
>  infiniband-diags/src/ibnetdiscover.c |    8 ++++----
>  2 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c
> index 0ea139f..0266af4 100644
> --- a/infiniband-diags/src/grouping.c
> +++ b/infiniband-diags/src/grouping.c
> @@ -265,20 +265,20 @@ int is_chassis_switch(Node *node)
>  }
>  
>  /* these structs help find Line (Anafa) slot number while using spine portnum */
> -int line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
> -int anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
> -int line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
> -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
> +char line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
> +char anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
> +char line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
> +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
>  
>  /* IPR FCR modules connectivity while using sFB4 port as reference */
> -int ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
> +char ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
>  
>  /* these structs help find Spine (Anafa) slot number while using spine portnum */
> -int spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> -int spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> -/*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 };
> */
> +char spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +char spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +/* reference                       { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */
>  
>  static void get_sfb_slot(Node *node, Port *lineport)
>  {
> @@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport)
>  static void get_router_slot(Node *node, Port *spineport)
>  {
>  	ChassisRecord *ch = node->chrecord;
> -	int guessnum = 0;
> +	uint64_t guessnum = 0;
>  
>  	if (!ch) {
>  		if (!(node->chrecord = calloc(1, sizeof(ChassisRecord))))
> @@ -460,7 +460,7 @@ static void insert_line_router(Node *node, ChassisList *chassislist)
>  		return;		/* already filled slot */
>  
>  	chassislist->linenode[i] = node;
> -	node->chrecord->chassisnum = chassislist->chassisnum;
> +	node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum;
>  }
>  
>  static void insert_spine(Node *node, ChassisList *chassislist)
> @@ -471,7 +471,7 @@ static void insert_spine(Node *node, ChassisList *chassislist)
>  		return;		/* already filled slot */
>  
>  	chassislist->spinenode[i] = node;
> -	node->chrecord->chassisnum = chassislist->chassisnum;
> +	node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum;

Wouldn't it be better to try to fix data definitions and minimize such
and similar castings? For instance could slightly modified patch like
below compile cleanly in WinOF environment (I cannot test, sorry)?

Sasha


>From 8e8556ba011dab628723736aa32191f54cca4cb5 Mon Sep 17 00:00:00 2001
From: Sean Hefty <sean.hefty at intel.com>
Date: Wed, 18 Feb 2009 17:46:05 -0800
Subject: [PATCH] ibnetdiscover: add support for WinOF

Mainly fixing datatypes to avoid type mismatches.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/include/grouping.h  |    6 +++---
 infiniband-diags/src/grouping.c      |   22 +++++++++++-----------
 infiniband-diags/src/ibnetdiscover.c |    8 ++++----
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h
index e54efef..811e372 100644
--- a/infiniband-diags/include/grouping.h
+++ b/infiniband-diags/include/grouping.h
@@ -48,9 +48,9 @@ typedef struct AllChassisList AllChassisList;
 struct ChassisList {
 	ChassisList *next;
 	uint64_t chassisguid;
-	int chassisnum;
-	int chassistype;
-	int nodecount;		/* used for grouping by SystemImageGUID */
+	unsigned char chassisnum;
+	unsigned char chassistype;
+	unsigned int nodecount;	  /* used for grouping by SystemImageGUID */
 	Node *spinenode[SPINES_MAX_NUM + 1];
 	Node *linenode[LINES_MAX_NUM + 1];
 };
diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c
index 0ea139f..048efc7 100644
--- a/infiniband-diags/src/grouping.c
+++ b/infiniband-diags/src/grouping.c
@@ -265,20 +265,20 @@ int is_chassis_switch(Node *node)
 }
 
 /* these structs help find Line (Anafa) slot number while using spine portnum */
-int line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
-int anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
-int line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
-int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
+char line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
+char anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
+char line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
+char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
 
 /* IPR FCR modules connectivity while using sFB4 port as reference */
-int ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
+char ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
 
 /* these structs help find Spine (Anafa) slot number while using spine portnum */
-int spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-/*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
+char spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+/* reference                       { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */
 
 static void get_sfb_slot(Node *node, Port *lineport)
 {
@@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport)
 static void get_router_slot(Node *node, Port *spineport)
 {
 	ChassisRecord *ch = node->chrecord;
-	int guessnum = 0;
+	uint64_t guessnum = 0;
 
 	if (!ch) {
 		if (!(node->chrecord = calloc(1, sizeof(ChassisRecord))))
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 948a79d..6946fd7 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -47,7 +47,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibnetdiscover.h"
 #include "grouping.h"
@@ -215,7 +215,7 @@ extend_dpath(ib_dr_path_t *path, int nextport)
 	++path->cnt;
 	if (path->cnt > maxhops_discovered)
 		maxhops_discovered = path->cnt;
-	path->p[path->cnt] = nextport;
+	path->p[path->cnt] = (uint8_t) nextport;
 	return path->cnt;
 }
 
@@ -515,7 +515,7 @@ out_ids(Node *node, int group, char *chname)
 }
 
 uint64_t
-out_chassis(int chassisnum)
+out_chassis(unsigned char chassisnum)
 {
 	uint64_t guid;
 
@@ -967,7 +967,7 @@ int main(int argc, char **argv)
 		{ "Router_list", 'R', 0, NULL, "list of connected routers" },
 		{ "node-name-map", 1, 1, "<file>", "node name map file" },
 		{ "ports", 'p', 0, NULL, "obtain a ports report" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "[topology-file]";
 
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Thu Feb 26 02:11:44 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 12:11:44 +0200
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
Message-ID: <20090226101144.GB11192@sashak.voltaire.com>

On 17:46 Wed 18 Feb     , Sean Hefty wrote:
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> 
>  infiniband-diags/src/ibroute.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
> index 144d1b2..d1049ad 100644
> --- a/infiniband-diags/src/ibroute.c
> +++ b/infiniband-diags/src/ibroute.c
> @@ -45,7 +45,7 @@
>  
>  #include <infiniband/umad.h>
>  #include <infiniband/mad.h>
> -#include <infiniband/complib/cl_nodenamemap.h>
> +#include <complib/cl_nodenamemap.h>
>  
>  #include "ibdiag_common.h"
>  
> @@ -327,7 +327,7 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
>  
>  		for (;i < e; i++) {
>  			unsigned outport = lft[i % IB_SMP_DATA_SIZE];
> -			unsigned valid = (outport <= nports);
> +			unsigned valid = (outport <= (unsigned) nports);

Similar question.

Sasha


>From 7127f00d9020b261819d2205557646016fdd6b36 Mon Sep 17 00:00:00 2001
From: Sean Hefty <sean.hefty at intel.com>
Date: Wed, 18 Feb 2009 17:46:38 -0800
Subject: [PATCH] ibroute: add support for WinOF

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/src/ibroute.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
index 144d1b2..235d122 100644
--- a/infiniband-diags/src/ibroute.c
+++ b/infiniband-diags/src/ibroute.c
@@ -45,7 +45,7 @@
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -54,7 +54,7 @@ static int brief, dump_all, multicast;
 /*******************************************/
 
 char *
-check_switch(ib_portid_t *portid, int *nports, uint64_t *guid,
+check_switch(ib_portid_t *portid, unsigned int *nports, uint64_t *guid,
 	     uint8_t *sw, char *nd)
 {
 	uint8_t ni[IB_SMP_DATA_SIZE] = {0};
@@ -289,7 +289,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
 	uint8_t sw[IB_SMP_DATA_SIZE];
 	char str[200], *s;
 	uint64_t nodeguid;
-	int block, i, e, nports, top;
+	int block, i, e, top;
+	unsigned nports;
 	int n = 0, startblock, endblock;
 
 	if ((s = check_switch(portid, &nports, &nodeguid, sw, nd)))
@@ -370,7 +371,7 @@ int main(int argc, char **argv)
 		{ "all", 'a', 0, NULL, "show all lids, even invalid entries" },
 		{ "no_dests", 'n', 0, NULL, "do not try to resolve destinations" },
 		{ "Multicast", 'M', 0, NULL, "show multicast forwarding tables" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "[<dest dr_path|lid|guid> [<startlid> [<endlid>]]]";
 	const char *usage_examples[] = {
-- 
1.6.1.2.319.gbd9e


From jackm at dev.mellanox.co.il  Thu Feb 26 02:38:26 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 26 Feb 2009 12:38:26 +0200
Subject: [ofa-general] [PATCH] mlx4_core: Add device IDs for MT25458 10GigE
	devices
Message-ID: <200902261238.26437.jackm@dev.mellanox.co.il>

Adds device IDs for Mellanox' MT25458
ConnectX+10-GBaseT 10GigE Ethernet devices.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 6ef2490..84db33b 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -1230,6 +1230,8 @@ static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIe gen2 */
 	{ PCI_VDEVICE(MELLANOX, 0x6368) }, /* MT25408 "Hermon" EN 10GigE */
 	{ PCI_VDEVICE(MELLANOX, 0x6750) }, /* MT25408 "Hermon" EN 10GigE PCIe gen2 */
+	{ PCI_VDEVICE(MELLANOX, 0x6372) }, /* MT25458 ConnectX EN 10GBASE-T 10GigE */
+	{ PCI_VDEVICE(MELLANOX, 0x675a) }, /* MT25458 ConnectX EN 10GBASE-T+Gen2 10GigE */
 	{ 0, }
 };
 

From klakshman03 at hotmail.com  Thu Feb 26 02:59:57 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Thu, 26 Feb 2009 16:29:57 +0530
Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch
In-Reply-To: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
Message-ID: <BAY101-W33CDAC8F19E0F0E12CBE24B8AD0@phx.gbl>


 Hi Jack and Mahesh

 ThanQ for your response.

I have channged the HCA card as well as IB cables also..Ooooops   no use.


 How can I  perform diagnostics. Please Help me out.

ThanQ

 Laxman


> Date: Thu, 26 Feb 2009 14:01:07 +0530
> Subject: Re: Problem in IB network without Switch
> From: keshetti.mahesh at gmail.com
> To: klakshman03 at hotmail.com
> CC: general at lists.openfabrics.org
> 
> Hi,
> 
> > phys state:      2: Polling
> 
> On both machines physical state is 'Polling' i.e. the physical
> connectivity of the two is not proper. Check the connectivity first.
> Only after it becomes
> 
> phys state:      5: LinkUp
> 
> you will be able to enable any IB communication on this interface.
> 
> -Mahesh

_________________________________________________________________
Wish to Marry Now? Join MSN Matrimony FREE!
http://www.in.msn.com/matrimony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090226/83c50110/attachment.html>

From vlad at lists.openfabrics.org  Thu Feb 26 03:18:58 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 26 Feb 2009 03:18:58 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090226-0200 daily build status
Message-ID: <20090226111858.D3C13E60CB0@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From jackm at dev.mellanox.co.il  Thu Feb 26 03:35:59 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 26 Feb 2009 13:35:59 +0200
Subject: [ofa-general] Re: Problem in IB network without Switch
In-Reply-To: <BAY101-W33CDAC8F19E0F0E12CBE24B8AD0@phx.gbl>
References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
	<BAY101-W33CDAC8F19E0F0E12CBE24B8AD0@phx.gbl>
Message-ID: <200902261335.59927.jackm@dev.mellanox.co.il>

On Thursday 26 February 2009 12:59, lakshmana swamy wrote:

Please send me the output of console command:  ibstat
Maybe you have old FW.

- Jack

> 
>  Hi Jack and Mahesh
> 
>  ThanQ for your response.
> 
> I have channged the HCA card as well as IB cables also..Ooooops   no use.
> 
> 
>  How can I  perform diagnostics. Please Help me out.
> 
> ThanQ
> 
>  Laxman
> 
> 
> 
> > Date: Thu, 26 Feb 2009 14:01:07 +0530
> > Subject: Re: Problem in IB network without Switch
> > From: keshetti.mahesh at gmail.com
> > To: klakshman03 at hotmail.com
> > CC: general at lists.openfabrics.org
> > 
> > Hi,
> > 
> > > phys state:      2: Polling
> > 
> > On both machines physical state is 'Polling' i.e. the physical
> > connectivity of the two is not proper. Check the connectivity first.
> > Only after it becomes
> > 
> > phys state:      5: LinkUp
> > 
> > you will be able to enable any IB communication on this interface.
> > 
> > -Mahesh
> 
> _________________________________________________________________
> Wish to Marry Now? Join MSN Matrimony FREE!
> http://www.in.msn.com/matrimony


From klakshman03 at hotmail.com  Thu Feb 26 03:53:30 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Thu, 26 Feb 2009 17:23:30 +0530
Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch
In-Reply-To: <200902261335.59927.jackm@dev.mellanox.co.il>
References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
	<BAY101-W33CDAC8F19E0F0E12CBE24B8AD0@phx.gbl>
	<200902261335.59927.jackm@dev.mellanox.co.il>
Message-ID: <BAY101-W3170561A2230014F98DBC1B8AD0@phx.gbl>


 Hi Jack,

Please find the output of ibstat on both the nodes, .

[root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed 

---- Performing InfiniBand HCA Self Test ----
Number of HCAs Detected ................ 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... OFED-1.3 1.3-2.6.18_53.1.14.el5 
Host Driver RPM Check .................. PASS
HCA Type of HCA #0 ..................... Cougar
/opt/ofed/extras/hca_self_test.ofed: line 227: [: =: unary operator expected
HCA Firmware Check ..................... FAIL
    REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917)
Host Driver Initialization ............. PASS
Number of HCA Ports Active ............. 0
Port State of Port #0 on HCA #0 ........ DOWN
Port State of Port #1 on HCA #0 ........ DOWN
Error Counter Check on HCA #0 .......... PASS
Kernel Syslog Check .................... PASS
Node GUID .............................. 00:02:c9:01:08:cd:13:c0
------------------ DONE ---------------------

[root at mattool ~]# 

************ IBSTAT output ******************


[root at mattool ~]# ibstat
CA 'mthca0'
        CA type: MT23108
        Number of ports: 2
        Firmware version: 3.1.0
        Hardware version: a1
        Node GUID: 0x0002c90108cd13c0
        System image GUID: 0x0002c90108cd13c0
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 2
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00510a68
                Port GUID: 0x0002c90108cd13c1
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 2
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00510a6a
                Port GUID: 0x0002c90108cd13c2
[root at mattool ~]# 

[root at compute-0-0 ~]# ibstat
CA 'mthca0'
        CA type: MT23108
        Number of ports: 2
        Firmware version: 3.0.0
        Hardware version: a1
        Node GUID: 0x0002c9020000114c
        System image GUID: 0x0002c9020000114f
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 2
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00110a68
                Port GUID: 0x0002c9020000114d
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 2
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00110a68
                Port GUID: 0x0002c9020000114e
[root at compute-0-0 ~]# 

  Thanking you

  laxman


> From: jackm at dev.mellanox.co.il
> To: klakshman03 at hotmail.com
> Subject: Re: Problem in IB network without Switch
> Date: Thu, 26 Feb 2009 13:35:59 +0200
> CC: keshetti.mahesh at gmail.com; general at lists.openfabrics.org
> 
> On Thursday 26 February 2009 12:59, lakshmana swamy wrote:
> 
> Please send me the output of console command:  ibstat
> Maybe you have old FW.
> 
> - Jack
> 
> > 
> >  Hi Jack and Mahesh
> > 
> >  ThanQ for your response.
> > 
> > I have channged the HCA card as well as IB cables also..Ooooops   no use.
> > 
> > 
> >  How can I  perform diagnostics. Please Help me out.
> > 
> > ThanQ
> > 
> >  Laxman
> > 
> > 
> > 
> > > Date: Thu, 26 Feb 2009 14:01:07 +0530
> > > Subject: Re: Problem in IB network without Switch
> > > From: keshetti.mahesh at gmail.com
> > > To: klakshman03 at hotmail.com
> > > CC: general at lists.openfabrics.org
> > > 
> > > Hi,
> > > 
> > > > phys state:      2: Polling
> > > 
> > > On both machines physical state is 'Polling' i.e. the physical
> > > connectivity of the two is not proper. Check the connectivity first.
> > > Only after it becomes
> > > 
> > > phys state:      5: LinkUp
> > > 
> > > you will be able to enable any IB communication on this interface.
> > > 
> > > -Mahesh
> > 
> > _________________________________________________________________
> > Wish to Marry Now? Join MSN Matrimony FREE!
> > http://www.in.msn.com/matrimony

_________________________________________________________________
Chose your Life Partner! Join MSN Matrimony FREE
http://www.in.msn.com/matrimony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090226/4e42b035/attachment.html>

From hal.rosenstock at gmail.com  Thu Feb 26 04:03:02 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 07:03:02 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to 
	osm_get_all_port_attrs
In-Reply-To: <20090226070629.GU11192@sashak.voltaire.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
Message-ID: <f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>

Sasha,

On Thu, Feb 26, 2009 at 2:06 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 10:30 Wed 18 Feb     , Hal Rosenstock wrote:
>>
>> Only supported in osm_vendor_ibumad.c (separate patch for other
>> vendor layers)
>> Also, update applications using this (osmtest, opensm)
>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> ---
>>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++-----
>>  opensm/opensm/main.c                 |    6 ++++++
>>  opensm/osmtest/main.c                |   11 +++++++++++
>>  opensm/osmtest/osmtest.c             |    7 +++++++
>>  4 files changed, 43 insertions(+), 5 deletions(-)
>>
>> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
>> index 734a860..861bfbe 100644
>> --- a/opensm/libvendor/osm_vendor_ibumad.c
>> +++ b/opensm/libvendor/osm_vendor_ibumad.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       umad_ca_t ca;
>>       ib_port_attr_t *attr = p_attr_array;
>>       unsigned done = 0;
>> -     int r, i, j;
>> +     int r, i, j, k;
>>
>>       OSM_LOG_ENTER(p_vend->p_log);
>>
>>       CL_ASSERT(p_vend && p_num_ports);
>>
>> +     r = 0;
>>       if (!*p_num_ports) {
>>               r = IB_INVALID_PARAMETER;
>>               OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
>> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       }
>>
>>       for (i = 0; i < p_vend->ca_count && !done; i++) {
>> -             /*
>> -              * For each CA, retrieve the port guids
>> -              */
>> +             /* For each CA, retrieve the port attributes */
>>               if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
>>                       if (ca.node_type < 1 || ca.node_type > 3)
>>                               continue;
>> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>                               attr->port_num = ca.ports[j]->portnum;
>>                               attr->sm_lid = ca.ports[j]->sm_lid;
>>                               attr->link_state = ca.ports[j]->state;
>> +                             attr->num_pkeys = ca.ports[j]->pkeys_size;
>> +                             if (attr->num_pkeys && attr->p_pkey_table) {
>> +                                     if (attr->num_pkeys < ca.ports[j]->pkeys_size) {
>
> You are doing:
>
>        attr->num_pkeys = ca.ports[j]->pkeys_size;
>
> , just two lines above, so this check will be always false.

Oops; I'll fix in the next version.

>> +                                             r = IB_INSUFFICIENT_MEMORY;
>> +                                             OSM_LOG(p_vend->p_log,
>> +                                                     OSM_LOG_ERROR,
>> +                                                     "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
>> +                                                     j,
>> +                                                     ca.ports[j]->pkeys_size);
>
> Also should it be an error? May be it is just enough to fill requested
> pkey entries?

I agree that being more forgiving is better but then how would it be
known if the pkeys are being truncated ?

Also, it seems to be the style of the API (what is done for ports).
Can't just request an individal port but all ports.

>> +                                             goto Exit;
>> +                                     }
>> +                                     for (k = 0; k < attr->num_pkeys; k++)
>> +                                             attr->p_pkey_table[k] =
>> +                                                     cl_hton16(ca.ports[j]->pkeys[k]);
>> +                             }
>>                               attr++;
>>                               if (attr - p_attr_array > *p_num_ports) {
>>                                       done = 1;
>> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       }
>>
>>       *p_num_ports = attr - p_attr_array;
>> -     r = 0;
>>
>>  Exit:
>>       OSM_LOG_EXIT(p_vend->p_log);
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index 73a6274..503d7fa 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
>>       uint32_t i, choice = 0;
>>       ib_api_status_t status;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>
> Here and below. Just
>
>        memset(attr_array, 0, sizeof(attr_array));
>
> would be enough.

Sure; next version.

-- Hal

> Sasha
>
>>       /* Call the transport layer for a list of local port GUID values */
>>       status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array,
>>                                             &num_ports);
>> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
>> index b360af6..83c1e13 100644
>> --- a/opensm/osmtest/main.c
>> +++ b/opensm/osmtest/main.c
>> @@ -1,6 +1,7 @@
>>  /*
>>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>>       int i;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>          Call the transport layer for a list of local port
>>          GUID values.
>> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>>       int i;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>          Call the transport layer for a list of local port
>>          GUID values.
>> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
>> index a7b343f..986a8d2 100644
>> --- a/opensm/osmtest/osmtest.c
>> +++ b/opensm/osmtest/osmtest.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
>>       ib_api_status_t status;
>>       uint32_t num_ports = MAX_LOCAL_IBPORTS;
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>> +     int i;
>>
>>       OSM_LOG_ENTER(&p_osmt->log);
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>        * Call the transport layer for a list of local port
>>        * GUID values.
>> --
>> 1.5.6.4
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Thu Feb 26 04:03:12 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 07:03:12 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to 
	osm_get_all_port_attrs
In-Reply-To: <20090226071059.GV11192@sashak.voltaire.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226071059.GV11192@sashak.voltaire.com>
Message-ID: <f0e08f230902260403tc20661fmdcba5156dc40fe90@mail.gmail.com>

On Thu, Feb 26, 2009 at 2:10 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 10:30 Wed 18 Feb     , Hal Rosenstock wrote:
>>
>> Only supported in osm_vendor_ibumad.c (separate patch for other
>> vendor layers)
>> Also, update applications using this (osmtest, opensm)
>
> It looks that ibutils (ibis) requires same fix (attr_array
> initialization) too.

Yes, I'm aware but didn't want to send those until these were accepted.

-- Hal

>
> Sasha
>
>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> ---
>>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++-----
>>  opensm/opensm/main.c                 |    6 ++++++
>>  opensm/osmtest/main.c                |   11 +++++++++++
>>  opensm/osmtest/osmtest.c             |    7 +++++++
>>  4 files changed, 43 insertions(+), 5 deletions(-)
>>
>> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
>> index 734a860..861bfbe 100644
>> --- a/opensm/libvendor/osm_vendor_ibumad.c
>> +++ b/opensm/libvendor/osm_vendor_ibumad.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       umad_ca_t ca;
>>       ib_port_attr_t *attr = p_attr_array;
>>       unsigned done = 0;
>> -     int r, i, j;
>> +     int r, i, j, k;
>>
>>       OSM_LOG_ENTER(p_vend->p_log);
>>
>>       CL_ASSERT(p_vend && p_num_ports);
>>
>> +     r = 0;
>>       if (!*p_num_ports) {
>>               r = IB_INVALID_PARAMETER;
>>               OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
>> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       }
>>
>>       for (i = 0; i < p_vend->ca_count && !done; i++) {
>> -             /*
>> -              * For each CA, retrieve the port guids
>> -              */
>> +             /* For each CA, retrieve the port attributes */
>>               if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
>>                       if (ca.node_type < 1 || ca.node_type > 3)
>>                               continue;
>> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>                               attr->port_num = ca.ports[j]->portnum;
>>                               attr->sm_lid = ca.ports[j]->sm_lid;
>>                               attr->link_state = ca.ports[j]->state;
>> +                             attr->num_pkeys = ca.ports[j]->pkeys_size;
>> +                             if (attr->num_pkeys && attr->p_pkey_table) {
>> +                                     if (attr->num_pkeys < ca.ports[j]->pkeys_size) {
>> +                                             r = IB_INSUFFICIENT_MEMORY;
>> +                                             OSM_LOG(p_vend->p_log,
>> +                                                     OSM_LOG_ERROR,
>> +                                                     "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
>> +                                                     j,
>> +                                                     ca.ports[j]->pkeys_size);
>> +                                             goto Exit;
>> +                                     }
>> +                                     for (k = 0; k < attr->num_pkeys; k++)
>> +                                             attr->p_pkey_table[k] =
>> +                                                     cl_hton16(ca.ports[j]->pkeys[k]);
>> +                             }
>>                               attr++;
>>                               if (attr - p_attr_array > *p_num_ports) {
>>                                       done = 1;
>> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
>>       }
>>
>>       *p_num_ports = attr - p_attr_array;
>> -     r = 0;
>>
>>  Exit:
>>       OSM_LOG_EXIT(p_vend->p_log);
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index 73a6274..503d7fa 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
>>       uint32_t i, choice = 0;
>>       ib_api_status_t status;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /* Call the transport layer for a list of local port GUID values */
>>       status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array,
>>                                             &num_ports);
>> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
>> index b360af6..83c1e13 100644
>> --- a/opensm/osmtest/main.c
>> +++ b/opensm/osmtest/main.c
>> @@ -1,6 +1,7 @@
>>  /*
>>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>>       int i;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>          Call the transport layer for a list of local port
>>          GUID values.
>> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>>       int i;
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>          Call the transport layer for a list of local port
>>          GUID values.
>> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
>> index a7b343f..986a8d2 100644
>> --- a/opensm/osmtest/osmtest.c
>> +++ b/opensm/osmtest/osmtest.c
>> @@ -2,6 +2,7 @@
>>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>>   * licenses.  You may choose to be licensed under the terms of the GNU
>> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
>>       ib_api_status_t status;
>>       uint32_t num_ports = MAX_LOCAL_IBPORTS;
>>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
>> +     int i;
>>
>>       OSM_LOG_ENTER(&p_osmt->log);
>>
>> +     for (i = 0; i < num_ports; i++) {
>> +             attr_array[i].num_pkeys = 0;
>> +             attr_array[i].p_pkey_table = NULL;
>> +     }
>> +
>>       /*
>>        * Call the transport layer for a list of local port
>>        * GUID values.
>> --
>> 1.5.6.4
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Thu Feb 26 04:03:32 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 07:03:32 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console: Enhance
	perfmgr print_counters for better nodenames
In-Reply-To: <20090226061551.GQ11192@sashak.voltaire.com>
References: <20090219130653.GA29318@comcast.net>
	<20090226061551.GQ11192@sashak.voltaire.com>
Message-ID: <f0e08f230902260403o2e266802t43fb893f0dd6ade0@mail.gmail.com>

On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:

[snip...]

> And in general I think it is better to use C-style comments - /* ... */,
> in C code and not C++-style // ... .

Is this going to be enforced uniformly across OpenSM ?

-- Hal

> Sasha


From hal.rosenstock at gmail.com  Thu Feb 26 04:04:39 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 07:04:39 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove
	functions which use pthread
In-Reply-To: <20090226051012.GH11192@sashak.voltaire.com>
References: <20081231170244.GC21950@sashak.voltaire.com>
	<20081231170413.GD21950@sashak.voltaire.com>
	<f0e08f230902160652t44e13ce7tc15bec3c34dd626b@mail.gmail.com>
	<20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov>
	<f0e08f230902171312u74d5effew4b35253faa7b5c4b@mail.gmail.com>
	<20090217142859.9e7a7e22.weiny2@llnl.gov>
	<f0e08f230902171521w25c3da8ft62fc05206800f49b@mail.gmail.com>
	<20090218003355.GX7189@sashak.voltaire.com>
	<f0e08f230902180720w25f74a8cs8c659757f331d425@mail.gmail.com>
	<20090226051012.GH11192@sashak.voltaire.com>
Message-ID: <f0e08f230902260404w2484edefl35c4227fd03504bb@mail.gmail.com>

Sasha,

On Thu, Feb 26, 2009 at 12:10 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 10:20 Wed 18 Feb     , Hal Rosenstock wrote:
>> On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> > On 18:21 Tue 17 Feb     , Hal Rosenstock wrote:
>> >> >
>> >> > For utilities which run once through I think the old functions work just
>> >> > fine.
>> >>
>> >> Well, sort of... Aren't mad_portid "collisions" possible when multiple
>> >> programs are run concurrently ?
>> >
>> > No.
>>
>> With the old API, mad_portid can be overwritten by another process or
>> thread. Another thread is not an expected use case but it is possible.
>
> Yes, but you asked about "collisions" between different programs
> (processes) run.

Another language issue.

-- Hal

>
> Sasha
>


From hal.rosenstock at gmail.com  Thu Feb 26 04:22:40 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 07:22:40 -0500
Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference 
	in local_completions()
In-Reply-To: <aday6vuqar9.fsf@cisco.com>
References: <1235608563.3948.199.camel@chromite.mv.qlogic.com>
	<aday6vuqar9.fsf@cisco.com>
Message-ID: <f0e08f230902260422t36ebaf04waa297f04d7437e9a@mail.gmail.com>

On Wed, Feb 25, 2009 at 7:53 PM, Roland Dreier <rdreier at cisco.com> wrote:
> This looks fine to me.  Hal and/or Sean, any comment?

This looks right to me too.

-- Hal


From kliteyn at dev.mellanox.co.il  Thu Feb 26 04:22:14 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 26 Feb 2009 14:22:14 +0200
Subject: [ofa-general] [PATCH v2] opensm/osm_node_info_rcv.c: create physp
 for the newly discovered port of the known node
Message-ID: <49A68976.6000404@dev.mellanox.co.il>

Hi Sasha,

[v2: adding CL_ASSERT() and changing comments]

This patch fixes bugzilla issue #1515.

The bug was discovered and analyzed by Line Holen.

Topology:
                 |---------------|
                 |      SW2      |
                 |---------------|
                   |x |y    |z |v
              |----|  |     |  |----|
              |       |     |       |
              |  |----|     |----|  |
              |  |               |  |
             a| b|              c| d|
      |---------------|     |---------------|
      |       SW1     |     |     SW3       |
      |---------------|     |---------------|
          |                             |
          |                             |
       HCA with SM                      HCA

During the discovery:

SM sends NodeInfo request to SW1
SM sends NodeInfo request to SW2 through link a->x
SM discovers new node SW2:
  - updates DR to SW2 to go through link a->x
  - creates physp x
SM sends NodeInfo request to SW2 through link b->y
SM discovers a known node SW2
  - DOES NOT create physp y
  - updates DR to SW2 to go through link b->y

>From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
port y any more, leaving it uninitialized (no physp object for this port).

The fix is to create physp for the newly discovered port of the known
switch node, same way as it is done for HCAs.
I also added one log message for the case that showed the problem - when
one of the link sides is uninitialized (no valid ports check). Perhaps
this log message should be an error message instead?

Debugged-by: Line Holen <Line.Holen at Sun.COM>
Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

---
 opensm/opensm/osm_node_info_rcv.c |   35 ++++++++++++++++++++++++++---------
 1 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index c52c0d5..4d3724c 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -154,18 +154,17 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
 		goto _exit;
 	}

-	/*
-	   We have seen this neighbor node before, but we might
-	   not have seen this port on the neighbor node before.
-	   We should not set links to an uninitialized port on the
-	   neighbor, so check validity up front.  If it's not
-	   valid, do nothing, since we'll see this link again
-	   when we probe the neighbor.
-	 */
+	/* When setting the link, ports on both
+	   sides of the link should be initialized */
 	if (!osm_node_link_has_valid_ports(p_node, port_num,
 					   p_neighbor_node,
-					   p_ni_context->port_num))
+					   p_ni_context->port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
+			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
+		CL_ASSERT(0);
 		goto _exit;
+	}

 	if (osm_node_link_exists(p_node, port_num,
 				 p_neighbor_node, p_ni_context->port_num)) {
@@ -537,8 +536,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
 				     IN osm_node_t * const p_node,
 				     IN const osm_madw_t * const p_madw)
 {
+
+	ib_smp_t *p_smp;
+	ib_node_info_t *p_ni;
+	uint8_t port_num;
+
 	OSM_LOG_ENTER(sm->p_log);

+	p_smp = osm_madw_get_smp_ptr(p_madw);
+	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	port_num = ib_node_info_get_local_port_num(p_ni);
+
+	if (!osm_node_get_physp_ptr(p_node, port_num)) {
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
+			"Creating physp for node GUID:0x%"
+			PRIx64 ", port %u\n",
+			cl_ntoh64(osm_node_get_node_guid(p_node)),
+			port_num);
+		osm_node_init_physp(p_node, p_madw);
+	}
+
 	/*
 	   If this switch has already been probed during this sweep,
 	   then don't bother reprobing it.
-- 
1.5.1.4


From ogerlitz at voltaire.com  Thu Feb 26 04:38:09 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Feb 2009 14:38:09 +0200 (IST)
Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL /
	PortRcvDataSL support
Message-ID: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>

libimad implementation of PortXmtDataSL (IBA A13.6.5) / PortRcvDataSL
(IBA A13.6.6) reading and resetting

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: management/libibmad/include/infiniband/mad.h
===================================================================
--- management.orig/libibmad/include/infiniband/mad.h
+++ management/libibmad/include/infiniband/mad.h
@@ -153,7 +153,8 @@ enum GSI_ATTR_ID {
 	IB_GSI_PORT_SAMPLES_RESULT = 0x11,
 	IB_GSI_PORT_COUNTERS = 0x12,
 	IB_GSI_PORT_COUNTERS_EXT = 0x1D,
-
+	IB_GSI_PORT_XMIT_DATA_SL = 0x36,
+	IB_GSI_PORT_RCV_DATA_SL  = 0x37,
 	IB_GSI_ATTR_LAST
 };

@@ -421,6 +422,28 @@ enum MAD_FIELDS {
 	IB_PC_XMT_WAIT_F,
 	IB_PC_LAST_F,

+	IB_PC_XMT_DATA_SL_FIRST_F,
+	IB_PC_XMT_DATA_SL0_F = IB_PC_XMT_DATA_SL_FIRST_F,
+	IB_PC_XMT_DATA_SL1_F,
+	IB_PC_XMT_DATA_SL2_F,
+	IB_PC_XMT_DATA_SL3_F,
+	IB_PC_XMT_DATA_SL4_F,
+	IB_PC_XMT_DATA_SL5_F,
+	IB_PC_XMT_DATA_SL6_F,
+	IB_PC_XMT_DATA_SL7_F,
+	IB_PC_XMT_DATA_SL_LAST_F,
+
+	IB_PC_RCV_DATA_SL_FIRST_F,
+	IB_PC_RCV_DATA_SL0_F = IB_PC_RCV_DATA_SL_FIRST_F,
+	IB_PC_RCV_DATA_SL1_F,
+	IB_PC_RCV_DATA_SL2_F,
+	IB_PC_RCV_DATA_SL3_F,
+	IB_PC_RCV_DATA_SL4_F,
+	IB_PC_RCV_DATA_SL5_F,
+	IB_PC_RCV_DATA_SL6_F,
+	IB_PC_RCV_DATA_SL7_F,
+	IB_PC_RCV_DATA_SL_LAST_F,
+
 	/*
 	 * SMInfo
 	 */
@@ -793,6 +816,16 @@ MAD_EXPORT uint8_t *port_performance_ext
 MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest,
 					       int port, unsigned mask,
 					       unsigned timeout);
+MAD_EXPORT uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest,
+					       int port, unsigned timeout);
+MAD_EXPORT uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest,
+					       int port, unsigned timeout);
+MAD_EXPORT uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest,
+					       int port, unsigned mask,
+					       unsigned timeout);
+MAD_EXPORT uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest,
+					       int port, unsigned mask,
+					       unsigned timeout);
 MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
 					       int port, unsigned timeout);
 MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
@@ -830,7 +863,8 @@ MAD_EXPORT ib_mad_dump_fn
     mad_dump_mtu, mad_dump_vlcap, mad_dump_opervls,
     mad_dump_node_type, mad_dump_sltovl, mad_dump_vlarbitration,
     mad_dump_nodedesc, mad_dump_nodeinfo, mad_dump_portinfo,
-    mad_dump_switchinfo, mad_dump_perfcounters, mad_dump_perfcounters_ext;
+    mad_dump_switchinfo, mad_dump_perfcounters, mad_dump_perfcounters_ext,
+    mad_dump_perfcounters_xmt_sl, mad_dump_perfcounters_rcv_sl;

 extern int ibdebug;

Index: management/libibmad/src/fields.c
===================================================================
--- management.orig/libibmad/src/fields.c
+++ management/libibmad/src/fields.c
@@ -262,6 +262,26 @@ static const ib_field_t ib_mad_f[] = {
 	{320, 32, "XmtWait", mad_dump_uint},
 	{0, 0},			/* IB_PC_LAST_F */

+	{32,  32, "XmtDataSL0", mad_dump_uint},
+	{64,  32, "XmtDataSL1", mad_dump_uint},
+	{96,  32, "XmtDataSL2", mad_dump_uint},
+	{128, 32, "XmtDataSL3", mad_dump_uint},
+	{160, 32, "XmtDataSL4", mad_dump_uint},
+	{196, 32, "XmtDataSL5", mad_dump_uint},
+	{224, 32, "XmtDataSL6", mad_dump_uint},
+	{256, 32, "XmtDataSL7", mad_dump_uint},
+	{0, 0},			/* IB_PC_XMT_DATA_SL_LAST_F */
+
+	{32,  32, "RcvDataSL0", mad_dump_uint},
+	{64,  32, "RcvDataSL1", mad_dump_uint},
+	{96,  32, "RcvDataSL2", mad_dump_uint},
+	{128, 32, "RcvDataSL3", mad_dump_uint},
+	{160, 32, "RcvDataSL4", mad_dump_uint},
+	{196, 32, "RcvDataSL5", mad_dump_uint},
+	{224, 32, "RcvDataSL6", mad_dump_uint},
+	{256, 32, "RcvDataSL7", mad_dump_uint},
+	{0, 0},			/* IB_PC_RCV_DATA_SL_LAST_F */
+
 	/*
 	 * SMInfo
 	 */
Index: management/libibmad/src/gs.c
===================================================================
--- management.orig/libibmad/src/gs.c
+++ management/libibmad/src/gs.c
@@ -193,6 +193,18 @@ uint8_t *port_performance_ext_query(void
 	return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS_EXT);
 }

+uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest, int port,
+					unsigned timeout)
+{
+	return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_XMIT_DATA_SL);
+}
+
+uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest, int port,
+					unsigned timeout)
+{
+	return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_RCV_DATA_SL);
+}
+
 uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned mask,
 					unsigned timeout, const void *srcport)
@@ -208,6 +220,20 @@ uint8_t *port_performance_ext_reset(void
 				 IB_GSI_PORT_COUNTERS_EXT);
 }

+uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest, int port,
+				    unsigned mask, unsigned timeout)
+{
+	return performance_reset(rcvbuf, dest, port, mask, timeout,
+				 IB_GSI_PORT_XMIT_DATA_SL);
+}
+
+uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest, int port,
+				    unsigned mask, unsigned timeout)
+{
+	return performance_reset(rcvbuf, dest, port, mask, timeout,
+				 IB_GSI_PORT_RCV_DATA_SL);
+}
+
 uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
 					int port, unsigned timeout,
 					const void *srcport)
Index: management/libibmad/src/libibmad.map
===================================================================
--- management.orig/libibmad/src/libibmad.map
+++ management/libibmad/src/libibmad.map
@@ -22,6 +22,8 @@ IBMAD_1.3 {
 		mad_dump_opervls;
 		mad_dump_perfcounters;
 		mad_dump_perfcounters_ext;
+		mad_dump_perfcounters_xmt_sl;
+		mad_dump_perfcounters_rcv_sl;
 		mad_dump_physportstate;
 		mad_dump_portcapmask;
 		mad_dump_portinfo;
@@ -45,6 +47,10 @@ IBMAD_1.3 {
 		port_performance_reset;
 		port_performance_ext_query;
 		port_performance_ext_reset;
+		port_performance_xmt_sl_query;
+		port_performance_rcv_sl_query;
+		port_performance_xmt_sl_reset;
+		port_performance_rcv_sl_reset;
 		port_samples_control_query;
 		port_samples_result_query;
 		mad_build_pkt;
Index: management/libibmad/src/dump.c
===================================================================
--- management.orig/libibmad/src/dump.c
+++ management/libibmad/src/dump.c
@@ -699,6 +699,16 @@ void mad_dump_perfcounters_ext(char *buf
 	_dump_fields(buf, bufsz, val, IB_PC_EXT_FIRST_F, IB_PC_EXT_LAST_F);
 }

+void mad_dump_perfcounters_xmt_sl(char *buf, int bufsz, void *val, int valsz)
+{
+	_dump_fields(buf, bufsz, val, IB_PC_XMT_DATA_SL_FIRST_F, IB_PC_XMT_DATA_SL_LAST_F);
+}
+
+void mad_dump_perfcounters_rcv_sl(char *buf, int bufsz, void *val, int valsz)
+{
+	_dump_fields(buf, bufsz, val, IB_PC_RCV_DATA_SL_FIRST_F, IB_PC_RCV_DATA_SL_LAST_F);
+}
+
 void xdump(FILE * file, char *msg, void *p, int size)
 {
 #define HEX(x)  ((x) < 10 ? '0' + (x) : 'a' + ((x) -10))


From ogerlitz at voltaire.com  Thu Feb 26 04:39:50 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Feb 2009 14:39:50 +0200 (IST)
Subject: [ofa-general] [PATCH 2/2] perfquery: add PortXmtDataSL/PortRcvDataSL
 read and reset
In-Reply-To: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0902261438200.29061@zuben.voltaire.com>

perfquery PortXmtDataSL/PortRcvDataSL (IBA A13.6.5/6) support

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: management/infiniband-diags/src/perfquery.c
===================================================================
--- management.orig/infiniband-diags/src/perfquery.c
+++ management/infiniband-diags/src/perfquery.c
@@ -307,7 +307,50 @@ static void reset_counters(int extended,
 	}
 }

-static int reset, reset_only, all_ports, loop_ports, port, extended;
+static int reset, reset_only, all_ports, loop_ports, port, extended, xmt_sl, rcv_sl;
+
+void xmt_sl_query(ib_portid_t *portid, int port, int mask)
+{
+	char buf[1024];
+
+	if (reset_only) {
+		if (!port_performance_xmt_sl_reset(pc, portid, port, mask, ibd_timeout))
+			IBERROR("perfslreset");
+		return;
+	}
+
+	if (!port_performance_xmt_sl_query(pc, portid, port, ibd_timeout))
+		IBERROR("perfslquery");
+
+	mad_dump_perfcounters_xmt_sl(buf, sizeof buf, pc, sizeof pc);
+	printf("# Port counters: %s port %d\n%s", portid2str(portid), port, buf);
+
+	if(reset)
+		if (!port_performance_xmt_sl_reset(pc, portid, port, mask, ibd_timeout))
+			IBERROR("perfslreset");
+}
+
+void rcv_sl_query(ib_portid_t *portid, int port, int mask)
+{
+	char buf[1024];
+
+	if (reset_only) {
+		if (!port_performance_rcv_sl_reset(pc, portid, port, mask, ibd_timeout))
+			IBERROR("perfslreset");
+		return;
+	}
+
+	if (!port_performance_rcv_sl_query(pc, portid, port, ibd_timeout))
+		IBERROR("perfslquery");
+
+	mad_dump_perfcounters_rcv_sl(buf, sizeof buf, pc, sizeof pc);
+	printf("# Port counters: %s port %d\n%s", portid2str(portid), port, buf);
+
+	if(reset)
+		if (!port_performance_rcv_sl_reset(pc, portid, port, mask, ibd_timeout))
+			IBERROR("perfslreset");
+}
+

 static int process_opt(void *context, int ch, char *optarg)
 {
@@ -315,6 +358,12 @@ static int process_opt(void *context, in
 	case 'x':
 		extended = 1;
 		break;
+	case 's':
+		xmt_sl = 1;
+		break;
+	case 'S':
+		rcv_sl = 1;
+		break;
 	case 'a':
 		all_ports++;
 		port = ALL_PORTS;
@@ -349,6 +398,8 @@ int main(int argc, char **argv)

 	const struct ibdiag_opt opts[] = {
 		{ "extended", 'x', 0, NULL, "show extended port counters" },
+		{ "xmtsl", 's', 0, NULL, "show Xmt SL port counters" },
+		{ "rcvsl", 'S', 0, NULL, "show Rcv SL port counters" },
 		{ "all_ports", 'a', 0, NULL, "show aggregated counters" },
 		{ "loop_ports", 'l', 0, NULL, "iterate through each port" },
 		{ "reset_after_read", 'r', 0, NULL, "reset counters after read" },
@@ -405,6 +456,16 @@ int main(int argc, char **argv)
 			all_ports_loop = 1;
 	}

+	if (xmt_sl) {
+		xmt_sl_query(&portid, port, mask);
+		exit(0);
+	}
+
+	if (rcv_sl) {
+		rcv_sl_query(&portid, port, mask);
+		exit(0);
+	}
+
 	if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) {
 		if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0)
 			IBERROR("smp query nodeinfo failed");


From ogerlitz at voltaire.com  Thu Feb 26 04:41:40 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Feb 2009 14:41:40 +0200 (IST)
Subject: [ofa-general] Re: [PATCH 2/2] perfquery: add
 PortXmtDataSL/PortRcvDataSL read and reset
In-Reply-To: <Pine.LNX.4.64.0902261438200.29061@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
	<Pine.LNX.4.64.0902261438200.29061@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0902261441010.29110@zuben.voltaire.com>

Hi Sasha,

For some reason the Xmt SL help is printed twice, any idea why?

Or.

./infiniband-diags/src/perfquery -h

Usage: ./infiniband-diags/src/perfquery [options]  [<lid|guid> [[port]
[reset_mask]]]

Options:
  --extended, -x          show extended port counters
  --xmtsl, -s             show Xmt SL port counters
  --rcvsl, -S             show Rcv SL port counters
  --all_ports, -a         show aggregated counters
  --loop_ports, -l        iterate through each port
  --reset_after_read, -r  reset counters after read
  --Reset_only, -R        only reset counters
  --Ca, -C <ca>           Ca name to use
  --Port, -P <port>       Ca port number to use
  --Lid, -L               use LID address argument
  --Guid, -G              use GUID address argument
  --timeout, -t <ms>      timeout in ms
  --xmtsl, -s             show Xmt SL port counters
  --errors, -e            show send and receive errors
  --verbose, -v           increase verbosity level
  --debug, -d             raise debug level
  --usage, -u             usage message
  --help, -h              help message
  --version, -V           show version


From hal.rosenstock at gmail.com  Thu Feb 26 06:06:16 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 09:06:16 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL
	/ PortRcvDataSL support
In-Reply-To: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
Message-ID: <f0e08f230902260606s499f4717ua9b664f1e5c0a4a1@mail.gmail.com>

On Thu, Feb 26, 2009 at 7:38 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> libimad implementation of PortXmtDataSL (IBA A13.6.5) / PortRcvDataSL
> (IBA A13.6.6) reading and resetting
>
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>
> Index: management/libibmad/include/infiniband/mad.h
> ===================================================================
> --- management.orig/libibmad/include/infiniband/mad.h
> +++ management/libibmad/include/infiniband/mad.h
> @@ -153,7 +153,8 @@ enum GSI_ATTR_ID {
>        IB_GSI_PORT_SAMPLES_RESULT = 0x11,
>        IB_GSI_PORT_COUNTERS = 0x12,
>        IB_GSI_PORT_COUNTERS_EXT = 0x1D,
> -
> +       IB_GSI_PORT_XMIT_DATA_SL = 0x36,
> +       IB_GSI_PORT_RCV_DATA_SL  = 0x37,
>        IB_GSI_ATTR_LAST
>  };
>
> @@ -421,6 +422,28 @@ enum MAD_FIELDS {
>        IB_PC_XMT_WAIT_F,
>        IB_PC_LAST_F,
>
> +       IB_PC_XMT_DATA_SL_FIRST_F,
> +       IB_PC_XMT_DATA_SL0_F = IB_PC_XMT_DATA_SL_FIRST_F,
> +       IB_PC_XMT_DATA_SL1_F,
> +       IB_PC_XMT_DATA_SL2_F,
> +       IB_PC_XMT_DATA_SL3_F,
> +       IB_PC_XMT_DATA_SL4_F,
> +       IB_PC_XMT_DATA_SL5_F,
> +       IB_PC_XMT_DATA_SL6_F,
> +       IB_PC_XMT_DATA_SL7_F,
> +       IB_PC_XMT_DATA_SL_LAST_F,
> +
> +       IB_PC_RCV_DATA_SL_FIRST_F,
> +       IB_PC_RCV_DATA_SL0_F = IB_PC_RCV_DATA_SL_FIRST_F,
> +       IB_PC_RCV_DATA_SL1_F,
> +       IB_PC_RCV_DATA_SL2_F,
> +       IB_PC_RCV_DATA_SL3_F,
> +       IB_PC_RCV_DATA_SL4_F,
> +       IB_PC_RCV_DATA_SL5_F,
> +       IB_PC_RCV_DATA_SL6_F,
> +       IB_PC_RCV_DATA_SL7_F,
> +       IB_PC_RCV_DATA_SL_LAST_F,
> +

Any reason to restrict this to SL0-7 rather than the complete SL range ?

-- Hal

[snip...]


From purdy at sgi.com  Thu Feb 26 06:31:27 2009
From: purdy at sgi.com (Dale Purdy)
Date: Thu, 26 Feb 2009 08:31:27 -0600
Subject: [ofa-general] Re: [PATCH] opensm: Implement weighted routing
In-Reply-To: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com>
References: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com>
Message-ID: <20090226143127.GA28285@sgi.com>

On Thu, Feb 26, 2009 at 10:21:43AM +0530, Keshetti Mahesh wrote:
> Hello Dale Purdy,
> 
> I have a requirement where I have to set the some hop's weight
> factor to zero. Is this supported by your patch ?
> I have implemented something similar to it before but it lead to
> loops in the routing table. Does your patch take care of those things ?
> 
> -Mahesh

No, the accepted values for the hop weight are 1 - 0xff.  I suppose
one could allow a value of zero though.  Or one could raise the weight
factor for the other ports on the switch to a large value so that the
one you are trying to force traffic through is highly favored in
comparison.  Whenever you are manipulating the hop weight factors, you
better know what you are doing since it alters the behavior of the
routing engines and could then induce credit loops.  In our case we
are using this to separate MPI traffic from I/O traffic and at the
same time eliminate credit loops.

-- 
Dale


From ogerlitz at voltaire.com  Thu Feb 26 06:54:59 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Feb 2009 16:54:59 +0200
Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL /
	PortRcvDataSL support
In-Reply-To: <f0e08f230902260606s499f4717ua9b664f1e5c0a4a1@mail.gmail.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
	<f0e08f230902260606s499f4717ua9b664f1e5c0a4a1@mail.gmail.com>
Message-ID: <49A6AD43.4000706@voltaire.com>

Hal Rosenstock wrote:
> Any reason to restrict this to SL0-7 rather than the complete SL range?
>   
Not really, I can fix that.

Or.


From dorfman.eli at gmail.com  Thu Feb 26 07:32:40 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 26 Feb 2009 17:32:40 +0200
Subject: [ofa-general] ***SPAM*** [PATCH 1/2] include/opensm/osm_opensm.h add
 setup function to routing engine.
Message-ID: <49A6B618.1090300@gmail.com>

 add setup function to routing engine.
 call it only when we want to use this routing engine.
 this will save allocation for routing algorithms that are not used.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/include/opensm/osm_opensm.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h
index c121be4..6191530 100644
--- a/opensm/include/opensm/osm_opensm.h
+++ b/opensm/include/opensm/osm_opensm.h
@@ -122,6 +122,8 @@ typedef enum _osm_routing_engine_type {
 struct osm_routing_engine {
 	const char *name;
 	void *context;
+	int initialized;
+	int (*setup) (void *re, void *p_osm);
 	int (*build_lid_matrices) (void *context);
 	int (*ucast_build_fwd_tables) (void *context);
 	void (*ucast_dump_tables) (void *context);
-- 
1.5.5


From dorfman.eli at gmail.com  Thu Feb 26 07:36:11 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 26 Feb 2009 17:36:11 +0200
Subject: [ofa-general] [PATCH 2/2] opensm: setup routing engine when in use
 and delete when fail
In-Reply-To: <49A6B618.1090300@gmail.com>
References: <49A6B618.1090300@gmail.com>
Message-ID: <49A6B6EB.80700@gmail.com>

 setup routing engine when in use and delete when fail
 setup routing engine before use.
 delete resources when routing algorithm fails
 this will save allocation for routing algorithms that are not used.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_opensm.c    |   20 ++++++--------------
 opensm/opensm/osm_ucast_mgr.c |   34 +++++++++++++++++++++++++++++++++-
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c
index 7de2e5b..a2620d5 100644
--- a/opensm/opensm/osm_opensm.c
+++ b/opensm/opensm/osm_opensm.c
@@ -169,21 +169,14 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name)
 			memset(re, 0, sizeof(struct osm_routing_engine));
 
 			re->name = m->name;
-			if (m->setup(re, osm)) {
-				OSM_LOG(&osm->log, OSM_LOG_VERBOSE,
-					"setup of routing"
-					" engine \'%s\' failed\n", name);
-				return;
-			}
-			OSM_LOG(&osm->log, OSM_LOG_DEBUG,
-				"\'%s\' routing engine set up\n", re->name);
+			re->setup = m->setup;
 			append_routing_engine(osm, re);
 			return;
 		}
 	}
 
 	OSM_LOG(&osm->log, OSM_LOG_ERROR,
-		"cannot find or setup routing engine \'%s\'", name);
+		"cannot find or setup routing engine \'%s\'\n", name);
 }
 
 static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names)
@@ -224,18 +217,17 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm)
 
 /**********************************************************************
  **********************************************************************/
-static void destroy_routing_engines(osm_opensm_t *osm)
+static void destroy_routing_engines(struct osm_routing_engine **re)
 {
 	struct osm_routing_engine *r, *next;
 
-	next = osm->routing_engine_list;
+	next = *re;
 	while (next) {
 		r = next;
 		next = r->next;
-		if (r->delete)
-			r->delete(r->context);
 		free(r);
 	}
+	*re = NULL;
 }
 
 /**********************************************************************
@@ -289,7 +281,7 @@ void osm_opensm_destroy(IN osm_opensm_t * const p_osm)
 
 	/* do the destruction in reverse order as init */
 	destroy_plugins(p_osm);
-	destroy_routing_engines(p_osm);
+	destroy_routing_engines(&p_osm->routing_engine_list);
 	osm_sa_destroy(&p_osm->sa);
 	osm_sm_destroy(&p_osm->sm);
 #ifdef ENABLE_OSM_PERF_MGR
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index e404c91..7175926 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -886,7 +886,6 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)
 
 	p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl;
 	p_osm = p_mgr->p_subn->p_osm;
-	p_routing_eng = p_osm->routing_engine_list;
 
 	CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock);
 
@@ -897,10 +896,30 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)
 	    ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0)
 		goto Exit;
 
+	/* update the entry in active list */
+
 	p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE;
+	p_routing_eng = p_osm->routing_engine_list;
 	while (p_routing_eng) {
+		if (!p_routing_eng->initialized && 
+			p_routing_eng->setup(p_routing_eng, p_osm)) {
+			OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE,
+				"setup of routing engine \'%s\' failed\n", 
+					p_routing_eng->name);
+			p_routing_eng = p_routing_eng->next;
+			continue;
+		}
+		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
+			"\'%s\' routing engine set up\n", p_routing_eng->name);
+		p_routing_eng->initialized = 1;
+
 		if (!ucast_mgr_route(p_routing_eng, p_osm))
 			break;
+
+		/* delete unused routing engine */
+		if (p_routing_eng->delete)
+			p_routing_eng->delete(p_routing_eng->context);
+
 		p_routing_eng = p_routing_eng->next;
 	}
 
@@ -911,6 +930,19 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)
 		p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
 	}
 
+	/* if for some reason different routing engine is used */
+	/* cleanup unused routing engine */
+	p_routing_eng = p_osm->routing_engine_list;
+	while (p_routing_eng) {
+		if (p_routing_eng->initialized &&
+			p_osm->routing_engine_used != 
+				osm_routing_engine_type(p_routing_eng->name) &&
+			p_routing_eng->delete) 
+			p_routing_eng->delete(p_routing_eng->context);
+
+		p_routing_eng = p_routing_eng->next;
+	}
+
 	OSM_LOG(p_mgr->p_log, OSM_LOG_INFO,
 		"%s tables configured on all switches\n",
 		osm_routing_engine_type_str(p_osm->routing_engine_used));
-- 
1.5.5


From dorfman.eli at gmail.com  Thu Feb 26 07:43:31 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 26 Feb 2009 17:43:31 +0200
Subject: [ofa-general] ***SPAM*** [PATCH 1/2] include/opensm/osm_opensm.h
	support routing engine update
Message-ID: <49A6B8A3.2020703@gmail.com>

 support routing engine update.
 add prev routing engine list.
 save active routing engine list as prev routing engine list.
 this is used to cleanup used routing engine allocation if needed
 and only after new routing engine was configured.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/include/opensm/osm_opensm.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h
index 6191530..c8b91a0 100644
--- a/opensm/include/opensm/osm_opensm.h
+++ b/opensm/include/opensm/osm_opensm.h
@@ -185,6 +185,7 @@ typedef struct osm_opensm {
 	cl_dispatcher_t disp;
 	cl_plock_t lock;
 	struct osm_routing_engine *routing_engine_list;
+	struct osm_routing_engine *prev_routing_engine_list;
 	osm_routing_engine_type_t routing_engine_used;
 	osm_stats_t stats;
 	osm_console_t console;
@@ -525,5 +526,7 @@ extern volatile unsigned int osm_exit_flag;
 *  Set to one to cause all threads to leave
 *********/
 
+void update_routing_engines(osm_opensm_t *osm, const char *engine_names);
+
 END_C_DECLS
 #endif				/* _OSM_OPENSM_H_ */
-- 
1.5.5


From dorfman.eli at gmail.com  Thu Feb 26 07:49:02 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 26 Feb 2009 17:49:02 +0200
Subject: [ofa-general] [PATCH 2/2] opensm routing engine update
In-Reply-To: <49A6B8A3.2020703@gmail.com>
References: <49A6B8A3.2020703@gmail.com>
Message-ID: <49A6B9EE.7000008@gmail.com>

 support routing engine update.
 save active routing engine list as prev routing engine list.
 this is used to cleanup used routing engine allocation if needed
 and only after new routing engine was configured.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_opensm.c    |    9 +++++++++
 opensm/opensm/osm_subnet.c    |   10 +++++++++-
 opensm/opensm/osm_ucast_mgr.c |   22 +++++++++++++++++++++-
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c
index a2620d5..6ab28be 100644
--- a/opensm/opensm/osm_opensm.c
+++ b/opensm/opensm/osm_opensm.c
@@ -230,6 +230,15 @@ static void destroy_routing_engines(struct osm_routing_engine **re)
 	*re = NULL;
 }
 
+void update_routing_engines(osm_opensm_t *osm, const char *engine_names)
+{
+	/* cleanup prev routing engine list and replace with current list */
+	destroy_routing_engines(&osm->prev_routing_engine_list);
+	osm->prev_routing_engine_list = osm->routing_engine_list;
+	osm->routing_engine_list = NULL;
+	setup_routing_engines(osm, engine_names);
+}
+
 /**********************************************************************
  **********************************************************************/
 static void destroy_plugins(osm_opensm_t *osm)
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index b3100a4..1ba5c91 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -151,6 +151,14 @@ static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val)
 	osm_set_sm_priority(p_sm, sm_priority);
 }
 
+static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val)
+{
+	osm_opensm_t *p_osm = p_subn->p_osm;
+	char *engines = (char *) p_val;
+
+	update_routing_engines(p_osm, engines);
+}
+
 static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key,
 			     IN char *p_val_str, void *p_v1, void *p_v2,
 			     void (*pfn)(osm_subn_t *, void *))
@@ -324,7 +332,7 @@ static const opt_rec_t opt_tbl[] = {
 	{ "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_parse_charp, NULL, 0 },
 	{ "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_parse_boolean, NULL, 1 },
 	{ "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_parse_boolean, NULL, 1 },
-	{ "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, NULL, 0 },
+	{ "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, opts_setup_routing_engine, 1 },
 	{ "connect_roots", OPT_OFFSET(connect_roots), opts_parse_boolean, NULL, 1 },
 	{ "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_parse_boolean, NULL, 1 },
 	{ "log_file", OPT_OFFSET(log_file), opts_parse_charp, NULL, 0 },
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 7175926..cda9f34 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -879,7 +879,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t *osm)
 int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)
 {
 	osm_opensm_t *p_osm;
-	struct osm_routing_engine *p_routing_eng;
+	struct osm_routing_engine *p_routing_eng, *r;
 	cl_qmap_t *p_sw_guid_tbl;
 
 	OSM_LOG_ENTER(p_mgr->p_log);
@@ -896,6 +896,26 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)
 	    ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0)
 		goto Exit;
 
+	/* find used routing engine in previous list */
+	r = p_osm->prev_routing_engine_list;
+	while (r) {
+		if (p_osm->routing_engine_used == 
+			osm_routing_engine_type(r->name))
+		{
+			p_routing_eng = p_osm->routing_engine_list;
+			while (p_routing_eng) {
+				if (p_osm->routing_engine_used == 
+					osm_routing_engine_type(p_routing_eng->name)) {
+					memcpy(p_routing_eng, r, sizeof(*p_routing_eng));
+					break;
+				}
+				p_routing_eng = p_routing_eng->next;
+			}
+			break;
+		}
+		r = r->next;
+	}
+
 	/* update the entry in active list */
 
 	p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE;
-- 
1.5.5


From ramachandra.kuchimanchi at qlogic.com  Thu Feb 26 08:38:38 2009
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 26 Feb 2009 22:08:38 +0530 (IST)
Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation
Message-ID: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>

Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.
In the 8-bit field of the RMPP header, the first 5 bits
are RRespTime and next 3 bits are RMPPFlags. Hence to retain
the first 5 bits, the mask should be 0xF8 instead of 0xF1.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
---

diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 5f6c40f..1a0f409 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -290,7 +290,7 @@ static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime)
  */
 static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags)
 {
-	rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) |
+	rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF8) |
 				     (flags & 0x7);
 }
 

From jackm at dev.mellanox.co.il  Thu Feb 26 08:49:40 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 26 Feb 2009 18:49:40 +0200
Subject: [ofa-general] Re: Problem in IB network without Switch
In-Reply-To: <BAY101-W3170561A2230014F98DBC1B8AD0@phx.gbl>
References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
	<200902261335.59927.jackm@dev.mellanox.co.il>
	<BAY101-W3170561A2230014F98DBC1B8AD0@phx.gbl>
Message-ID: <200902261849.40448.jackm@dev.mellanox.co.il>

You are running VERY old firmware (from 2004), and moreover, on one host
you have 3.0.0, and on the other 3.1.0.

You need to upgrade your firmware.
Contact your Mellanox FAE (support engineer) for instructions.

- Jack

>  Hi Jack,
> 
> Please find the output of ibstat on both the nodes, .
> 
> [root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed 
> HCA Firmware Check ..................... FAIL
>     REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917)
> Host Driver Initialization ............. PASS
> 
> [root at mattool ~]# 
> 
> ************ IBSTAT output ******************
> 
> 
> [root at mattool ~]# ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.1.0

> [root at compute-0-0 ~]# ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.0.0


From cameron at harr.org  Thu Feb 26 09:19:39 2009
From: cameron at harr.org (Cameron Harr)
Date: Thu, 26 Feb 2009 10:19:39 -0700
Subject: [Scst-devel]
	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A57256.2000005@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48ECEA4D.7080504@harr.org>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl
	nb.net>	<4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net>
	<49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net>
	<49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg>
Message-ID: <49A6CF2B.4010002@harr.org>

Cameron Harr wrote:
> Cameron Harr wrote:
> I re-compiled and re-ran the tests and numbers are a little better but 
> performance still seems to have gone down from 673:
> Test 1:373751.66
> Test 2:371242.6067
> Test 3:347988.1467
> Test 4:378247.31
> Test 5:375616.53
I was curious and did a regression test with 673 and those numbers are 
now even worse, so I'll presume there is an issue on my system and not 
the SCST code:
Test 1:365204.3067
Test 2:364152.2067
Test 3:340665.7633
Test 4:369916.8133
Test 5:369093.5833


From hal.rosenstock at gmail.com  Thu Feb 26 10:02:53 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 13:02:53 -0500
Subject: [ofa-general] Re: [ewg] [PATCH] ib_mad: Fix RMPP header RRespTime
	manipulation
In-Reply-To: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
Message-ID: <f0e08f230902261002r4ec2763bvf7ac9992e21605@mail.gmail.com>

On Thu, Feb 26, 2009 at 11:38 AM, Ramachandra K
<ramachandra.kuchimanchi at qlogic.com> wrote:
> Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.
> In the 8-bit field of the RMPP header, the first 5 bits
> are RRespTime and next 3 bits are RMPPFlags. Hence to retain
> the first 5 bits, the mask should be 0xF8 instead of 0xF1.
>
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> ---
>
> diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
> index 5f6c40f..1a0f409 100644
> --- a/include/rdma/ib_mad.h
> +++ b/include/rdma/ib_mad.h
> @@ -290,7 +290,7 @@ static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime)
>  */
>  static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags)
>  {
> -       rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) |
> +       rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF8) |

Looks right to me. Sean ?

-- Hal

>                                     (flags & 0x7);
>  }
>
>
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>


From sean.hefty at intel.com  Thu Feb 26 10:07:36 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 10:07:36 -0800
Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime
	manipulation
In-Reply-To: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
Message-ID: <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com>

>Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.
>In the 8-bit field of the RMPP header, the first 5 bits
>are RRespTime and next 3 bits are RMPPFlags. Hence to retain
>the first 5 bits, the mask should be 0xF8 instead of 0xF1.
>
>Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

Good catch.

Acked-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Thu Feb 26 10:13:00 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 10:13:00 -0800
Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL
	/	PortRcvDataSL support
In-Reply-To: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
Message-ID: <3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com>

>+MAD_EXPORT uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t *
>dest,
>+                                              int port, unsigned timeout);
>+MAD_EXPORT uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t *
>dest,
>+                                              int port, unsigned timeout);
>+MAD_EXPORT uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t *
>dest,
>+                                              int port, unsigned mask,
>+                                              unsigned timeout);
>+MAD_EXPORT uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t *
>dest,
>+                                              int port, unsigned mask,
>+                                              unsigned timeout);
> MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *
>dest,
>                                               int port, unsigned timeout);
> MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *
>dest,

{snip}

>+uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest, int
>port,
>+                                       unsigned timeout)
>+{
>+       return pma_query(rcvbuf, dest, port, timeout,
>IB_GSI_PORT_XMIT_DATA_SL);
>+}
>+
>+uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest, int
>port,
>+                                       unsigned timeout)
>+{
>+       return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_RCV_DATA_SL);
>+}
>+
> uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>                                        int port, unsigned mask,
>                                        unsigned timeout, const void *srcport)
>@@ -208,6 +220,20 @@ uint8_t *port_performance_ext_reset(void
>                                 IB_GSI_PORT_COUNTERS_EXT);
> }
>
>+uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest, int
>port,
>+                                   unsigned mask, unsigned timeout)
>+{
>+       return performance_reset(rcvbuf, dest, port, mask, timeout,
>+                                IB_GSI_PORT_XMIT_DATA_SL);
>+}
>+
>+uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest, int
>port,
>+                                   unsigned mask, unsigned timeout)
>+{
>+       return performance_reset(rcvbuf, dest, port, mask, timeout,
>+                                IB_GSI_PORT_RCV_DATA_SL);
>+}
>+

Rather than continue to add more and more interfaces to the library, can we just
export a couple of more generic calls?

- Sean


From sean.hefty at intel.com  Thu Feb 26 10:19:44 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 10:19:44 -0800
Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <20090226101144.GB11192@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
Message-ID: <5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com>

I'll give your 2 patches a try.  Is there a way that you can enable type
mismatch warnings?  If so, that would probably show the same issues.

- Sean


From Bill.Boas at openfabrics.org  Thu Feb 26 10:06:18 2009
From: Bill.Boas at openfabrics.org (Bill Boas)
Date: Thu, 26 Feb 2009 10:06:18 -0800
Subject: [ofa-general] Sun is asking for a response on Bug No.s 1522, 1523
In-Reply-To: <897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM>
References: <1235592409.22158.80.camel@pc.interlinx.bc.ca>
	<897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM>
Message-ID: <E70D6AD7A0514A038D7DB2DC630E0EB9@BillGWAYLAPTOP>

Dear OFA Maintainers:

 
Sun has contacted me about these bugs and is asking for priority action to
get fixes for them. See the thread below.

 
I'm sending this email to 3 OFA lists because Sun having to contact the
Alliance this way raises many questions about how a vendor like Sun or a
customer like, say, a Wall St bank gets bugs fixed and where do we, the OFA,
publish this information?

 
And if we agree that an OFA maintainer is the right person to fix the bug
how does that maintainer learn that and how do they respond?

 
Thank you for responding to this email and to the Sun team using OFED in
their products.

 
Bill. 

 
Bill Boas

Executive Director and Vice Chair

OpenFabrics Alliance

510-375-8840

Bill.Boas at openfabrics.org

www.openfabrics.org

 
  _____  

From: Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM] 
Sent: Wednesday, February 25, 2009 12:20 PM
To: Bill Boas
Cc: Peter Jones; Brian J. Murrell; Makia Minich
Subject: Fwd: OFED 1.4 for NHM chips

 
Hi Bill,

I found the bug numbers for the Lustre build issues against OFED 1.4.0 that
I mentioned yesterday.

Is there any way to bump up the priority on these? This is blocking our
ability to deliver our new Vayu (Nehalem+QDR) hardware to our customers,
since Mellanox says they'll only support OFED 1.4 with QDR hardware.

 
Thanks,

Bryon

 
Begin forwarded message:


From: "Brian J. Murrell" <Brian.Murrell at Sun.COM>

Date: February 25, 2009 1:06:49 PM MST

To: Bryon Neitzel <Bryon.Neitzel at Sun.COM>

Cc: Peter Jones <Peter.A.Jones at Sun.COM>

Subject: Re: OFED 1.4 for NHM chips

 
On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote:


Hi Brian,   what is the website where these OFED bugs were filed?    


https://bugs.openfabrics.org/show_bug.cgi?id=1522
https://bugs.openfabrics.org/show_bug.cgi?id=1523


Bill Boas tried to find any OFED bugs opened by Sun yesterday,


Yeah.  I opened them a number of days ago.

Not sure if it matters to anyone, but IMHO, 1523 is the correct future
direction and fixing that would implicitly fix 1522.  1523 basically
describes separating the "technology preview" that caused this breakage
out into it's own independent module so that it does not pollute the
OFED core.  As such, it might be resisted as it's a "deeper cut" type of
fix, but one that removes the sick part rather than trying to continue
to bandage it.

b.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090226/6179208e/attachment.html>

From sashak at voltaire.com  Thu Feb 26 11:06:39 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 21:06:39 +0200
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com>
Message-ID: <20090226190639.GI14238@sashak.voltaire.com>

On 10:19 Thu 26 Feb     , Sean Hefty wrote:
> I'll give your 2 patches a try.  Is there a way that you can enable type
> mismatch warnings?  If so, that would probably show the same issues.

Actually yes, I can use -Wsign-compare -Wconversion (although it makes
much more warning :)).

Sasha


From robert.j.woodruff at intel.com  Thu Feb 26 11:05:49 2009
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 26 Feb 2009 11:05:49 -0800
Subject: [ofa-general]
	RE: Sun is asking for a response on Bug No.s 1522, 1523
In-Reply-To: <E70D6AD7A0514A038D7DB2DC630E0EB9@BillGWAYLAPTOP>
References: <1235592409.22158.80.camel@pc.interlinx.bc.ca>
	<897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM>
	<E70D6AD7A0514A038D7DB2DC630E0EB9@BillGWAYLAPTOP>
Message-ID: <382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com>

Bill, since the bug was submitted as P3 and normal,
rather than P1 as critical or blocker, it was probably overlooked as something
that had to be fixed in 1.4. I changed the bug to a blocker P1 bug for OFED 1.4.1,
and thus should show up on the bug tracking list that gets reviewed in the EWG.

In addition to enterring the bug into bugzilla, people may also want to send
an email to the maintaner for blocker bugs, so that they know it needs quick
attention. It is asigned to Jeff Becker, the NFS/RDMA maintainer.
I know that there were some issues with backports for NFS/RDMA in 1.4 and
believe he is trying to get these fixed for OFED 1.4.1.

woody


________________________________
From: Bill Boas [mailto:Bill.Boas at openfabrics.org]
Sent: Thursday, February 26, 2009 10:06 AM
To: general at lists.openfabrics.org; ewg at lists.openfabrics.org; wwg at lists.openfabrics.org
Cc: 'Peter Jones'; 'Brian J. Murrell'; 'Makia Minich'; Bryon.Neitzel at Sun.COM
Subject: Sun is asking for a response on Bug No.s 1522, 1523

Dear OFA Maintainers:

Sun has contacted me about these bugs and is asking for priority action to get fixes for them. See the thread below.

I'm sending this email to 3 OFA lists because Sun having to contact the Alliance this way raises many questions about how a vendor like Sun or a customer like, say, a Wall St bank gets bugs fixed and where do we, the OFA, publish this information?

And if we agree that an OFA maintainer is the right person to fix the bug how does that maintainer learn that and how do they respond?

Thank you for responding to this email and to the Sun team using OFED in their products.

Bill.


Bill Boas

Executive Director and Vice Chair

OpenFabrics Alliance

510-375-8840

Bill.Boas at openfabrics.org<mailto:Bill.Boas at openfabrics.org>

www.openfabrics.org

________________________________
From: Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM]
Sent: Wednesday, February 25, 2009 12:20 PM
To: Bill Boas
Cc: Peter Jones; Brian J. Murrell; Makia Minich
Subject: Fwd: OFED 1.4 for NHM chips

Hi Bill,
I found the bug numbers for the Lustre build issues against OFED 1.4.0 that I mentioned yesterday.
Is there any way to bump up the priority on these? This is blocking our ability to deliver our new Vayu (Nehalem+QDR) hardware to our customers, since Mellanox says they'll only support OFED 1.4 with QDR hardware.

Thanks,
Bryon


Begin forwarded message:


From: "Brian J. Murrell" <Brian.Murrell at Sun.COM<mailto:Brian.Murrell at Sun.COM>>
Date: February 25, 2009 1:06:49 PM MST
To: Bryon Neitzel <Bryon.Neitzel at Sun.COM<mailto:Bryon.Neitzel at Sun.COM>>
Cc: Peter Jones <Peter.A.Jones at Sun.COM<mailto:Peter.A.Jones at Sun.COM>>
Subject: Re: OFED 1.4 for NHM chips

On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote:

Hi Brian,   what is the website where these OFED bugs were filed?

https://bugs.openfabrics.org/show_bug.cgi?id=1522
https://bugs.openfabrics.org/show_bug.cgi?id=1523


Bill Boas tried to find any OFED bugs opened by Sun yesterday,

Yeah.  I opened them a number of days ago.

Not sure if it matters to anyone, but IMHO, 1523 is the correct future
direction and fixing that would implicitly fix 1522.  1523 basically
describes separating the "technology preview" that caused this breakage
out into it's own independent module so that it does not pollute the
OFED core.  As such, it might be resisted as it's a "deeper cut" type of
fix, but one that removes the sick part rather than trying to continue
to bandage it.

b.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090226/b7cd45a2/attachment.html>

From ramachandra.kuchimanchi at qlogic.com  Thu Feb 26 11:09:27 2009
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Fri, 27 Feb 2009 00:39:27 +0530
Subject: ***SPAM*** RE: [ofa-general] [PATCH] ib_mad: Fix RMPP header
	RRespTime manipulation
In-Reply-To: <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com>
References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
	<2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com>
Message-ID: <71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com>

On Thu, Feb 26, 2009 at 11:37 PM, Sean Hefty <sean.hefty at intel.com> wrote:
>>Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime.
>>In the 8-bit field of the RMPP header, the first 5 bits
>>are RRespTime and next 3 bits are RMPPFlags. Hence to retain
>>the first 5 bits, the mask should be 0xF8 instead of 0xF1.
>>
>>Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
>
> Good catch.
>
> Acked-by: Sean Hefty <sean.hefty at intel.com>
>

Just to add some more information -
drivers/infiniband/core/mad_rmpp.c:ack_recv()--->format_ack() calls
ib_set_rmpp_flags() and due to the incorrect ANDing with 0xF1,
RRespTime got changed incorrectly and RMPP
Acks sent back always had a RRespTime of 0x1E (30) which caused the
other end to consider the time outs to be
approximately 4297 seconds (i.e. in the order of 4*2^30) instead of
the usual ~4 seconds (order of 4*2^20).

Regards,
Ram


From vst at vlnb.net  Thu Feb 26 11:49:51 2009
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Thu, 26 Feb 2009 22:49:51 +0300
Subject: [Scst-devel]	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A6CF2B.4010002@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl	nb.net>	<4980B8DE.3060806@harr.org>
	<4995D1EE.4000807@vlnb.net>	<49A42BE9.4030603@har r.org>
	<49A43439.7080405@vlnb.net>	<49A4812A.8050202@harr.org>
	<49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org>
Message-ID: <49A6F25F.8060306@vlnb.net>

Cameron Harr, on 02/26/2009 08:19 PM wrote:
> Cameron Harr wrote:
>> Cameron Harr wrote:
>> I re-compiled and re-ran the tests and numbers are a little better but 
>> performance still seems to have gone down from 673:
>> Test 1:373751.66
>> Test 2:371242.6067
>> Test 3:347988.1467
>> Test 4:378247.31
>> Test 5:375616.53
> I was curious and did a regression test with 673 and those numbers are 
> now even worse, so I'll presume there is an issue on my system and not 
> the SCST code:
> Test 1:365204.3067
> Test 2:364152.2067
> Test 3:340665.7633
> Test 4:369916.8133
> Test 5:369093.5833

It's known that any OS, including Linux, is getting "tired" under load 
with time from boot, which leads to worse performance. I guess, you can 
experience such effect.

Check with r634. R635 has cache locality in data structures related 
change, which intended to improve performance a bit, but might make it 
worse instead.

Vlad


From sean.hefty at intel.com  Thu Feb 26 12:07:58 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 12:07:58 -0800
Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <20090226101144.GB11192@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
Message-ID: <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>

Both of your patches (ibnetdiscover and ibroute) build on winof.  I replaced my
2 patches with yours, updated to the latest codebase, and pushed everything:

git://git.openfabrics.org/~shefty/ib-mgmt.git master

Were there changes to the other patches that you wanted (including saquery,
which wasn't part of the numbered series)?

- Sean


From sashak at voltaire.com  Thu Feb 26 13:02:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 23:02:19 +0200
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
Message-ID: <20090226210211.GK14238@sashak.voltaire.com>

On 12:07 Thu 26 Feb     , Sean Hefty wrote:
> Both of your patches (ibnetdiscover and ibroute) build on winof.  I replaced my
> 2 patches with yours, updated to the latest codebase, and pushed everything:
> 
> git://git.openfabrics.org/~shefty/ib-mgmt.git master
> 
> Were there changes to the other patches that you wanted (including saquery,
> which wasn't part of the numbered series)?

Thanks. I applied everything except ibsysstat.c and saquery.c. Wanted to
clarify some things there:

> From 49f28a63589be21dd7218922ed9d0b2b719a92c2 Mon Sep 17 00:00:00 2001
> From: Sean Hefty <sean.hefty at intel.com>
> Date: Thu, 26 Feb 2009 10:12:07 -0800
> Subject: [PATCH 1/2] [ib-diag] ibsysstat: add support for WinOF
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
>  infiniband-diags/src/ibsysstat.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
> index cc1418d..b9f2f85 100644
> --- a/infiniband-diags/src/ibsysstat.c
> +++ b/infiniband-diags/src/ibsysstat.c
> @@ -183,7 +183,7 @@ static char *ibsystat_serv(void)
>  
>  		DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod);
>  
> -		size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS,
> +		size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS,

What is the reason for such void * to char * casting?

>  				sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS);
>  
>  		if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0)
> @@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr)
>  {
>  	ib_rpc_t rpc = { 0 };
>  	int fd, agent, timeout, len;
> -	void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;
> +	void *data = (char *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;

Ditto.

>  
>  	DEBUG("Sysstat ping..");
>  
> @@ -318,7 +318,7 @@ int main(int argc, char **argv)
>  	const struct ibdiag_opt opts[] = {
>  		{ "oui", 'o', 1, NULL, "use specified OUI number" },
>  		{ "Server", 'S', 0, NULL, "start in server mode" },
> -		{ }
> +		{ 0 }
>  	};
>  	char usage_args[] = "<dest lid|guid> [<op>]";
>  
> -- 
> 1.6.1.2.319.gbd9e
> 
> 
> From 1b9685769339891670df6d9af66e9933794be8a0 Mon Sep 17 00:00:00 2001
> From: Sean Hefty <sean.hefty at intel.com>
> Date: Thu, 26 Feb 2009 10:12:29 -0800
> Subject: [PATCH 2/2] [ib-diag] saquery: add support for WinOF
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
>  infiniband-diags/src/saquery.c |   80 ++++++++++++++++++++++------------------
>  1 files changed, 44 insertions(+), 36 deletions(-)
> 
> diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
> index bcd1f61..4a5cfb8 100644
> --- a/infiniband-diags/src/saquery.c
> +++ b/infiniband-diags/src/saquery.c
> @@ -37,20 +37,25 @@
>   *
>   */
>  
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif /* HAVE_CONFIG_H */
> +
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <arpa/inet.h>
>  #include <ctype.h>
>  #include <string.h>
>  #include <errno.h>
> +#include <assert.h>
>  
>  #define _GNU_SOURCE
>  #include <getopt.h>
>  
>  #include <infiniband/umad.h>
>  #include <infiniband/mad.h>
> -#include <infiniband/iba/ib_types.h>
> -#include <infiniband/complib/cl_nodenamemap.h>
> +#include <iba/ib_types.h>
> +#include <complib/cl_nodenamemap.h>
>  
>  #include "ibdiag_common.h"
>  
> @@ -170,7 +175,7 @@ recv_mad:
>  	if (ibdebug > 1)
>  		xdump(stdout, "SA Response:\n", mad, len);
>  
> -	method = mad_get_field(mad, 0, IB_MAD_METHOD_F);
> +	method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F);
>  	offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
>  	result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
>  	result.p_result_madw = mad;
> @@ -189,12 +194,12 @@ recv_mad:
>  static void *get_query_rec(void *mad, unsigned i)
>  {
>  	int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
> -	return mad + IB_SA_DATA_OFFS + i * (offset << 3);
> +	return (char *) mad + IB_SA_DATA_OFFS + i * (offset << 3);

Ditto.

>  }
>  
>  static unsigned valid_gid(ib_gid_t *gid)
>  {
> -	ib_gid_t zero_gid = { };
> +	ib_gid_t zero_gid = { 0 };
>  	return memcmp(&zero_gid, gid, sizeof(*gid));
>  }
>  
> @@ -442,7 +447,7 @@ static void dump_multicast_member_record(void *data)
>  	char gid_str2[INET6_ADDRSTRLEN];
>  	ib_member_rec_t *p_mcmr = data;
>  	uint16_t mlid = cl_ntoh16(p_mcmr->mlid);
> -	int i = 0;
> +	unsigned i = 0;
>  	char *node_name = "<unknown>";
>  
>  	/* go through the node records searching for a port guid which matches
> @@ -758,7 +763,7 @@ static void dump_one_mft_record(void *data)
>  
>  static void dump_results(struct query_res *r, void (*dump_func) (void *))
>  {
> -	int i;
> +	unsigned i;
>  	for (i = 0; i < r->result_cnt; i++) {
>  		void *data = get_query_rec(r->p_result_madw, i);
>  		dump_func(data);
> @@ -768,7 +773,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *))
>  static void return_mad(void)
>  {
>  	if (result.p_result_madw) {
> -		free(result.p_result_madw - umad_size());
> +		free((char *) result.p_result_madw - umad_size());

Ditto.

>  		result.p_result_madw = NULL;
>  	}
>  }
> @@ -839,7 +844,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid)
>  {
>  	ib_node_record_t *node_record = NULL;
>  	ib_node_info_t *p_ni = NULL;
> -	int i = 0, ret;
> +	unsigned i;
> +	int ret;
>  
>  	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
>  	if (ret)
> @@ -869,7 +875,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name)
>  	if (isalpha(name[0]))
>  		assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS);
>  	else
> -		rc_lid = atoi(name);
> +		rc_lid = (uint16_t) atoi(name);
>  	if (rc_lid == 0)
>  		fprintf(stderr, "Failed to find lid for \"%s\"\n", name);
>  	return rc_lid;
> @@ -917,8 +923,8 @@ static int parse_lid_and_ports(bind_handle_t h,
>  
>  #define cl_hton8(x) (x)
>  #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \
> -	if (val > comp_with) { \
> -		target = cl_hton##size(val); \
> +	if ((uint##size##_t) val > (uint##size##_t) comp_with) { \
> +		target = cl_hton##size((uint##size##_t) val); \
>  		comp_mask |= IB_##name##_COMPMASK_##mask; \
>  	}
>  
> @@ -951,7 +957,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask)
>  
>  static int print_node_records(bind_handle_t h)
>  {
> -	int i = 0, ret;
> +	unsigned i;
> +	int ret;
>  
>  	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
>  	if (ret)
> @@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h,
>  	CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID);
>  	CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT);
>  	CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL);
> -	pr.hop_flow_raw |= cl_hton32(flow << 8);
> +	pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8);

Why this casting is needed? This should be uint32_t to uint32_t
assignment, no?

>  	CHECK_AND_SET_VAL(p->tclass, 8, 0, pr.tclass, PR, TCLASS);
>  	CHECK_AND_SET_VAL(p->reversible, 8, -1, reversible, PR, REVERSIBLE);
>  	CHECK_AND_SET_VAL(p->numb_path, 8, -1, pr.num_path, PR, NUMBPATH);
> @@ -1089,7 +1096,7 @@ static int print_multicast_member_records(bind_handle_t h)
>  
>  return_mc:
>  	if (mc_group_result.p_result_madw)
> -		free(mc_group_result.p_result_madw - umad_size());
> +		free((char *) mc_group_result.p_result_madw - umad_size());

void * to char * casting again.

>  
>  	return ret;
>  }
> @@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
>  	memset(&pktr, 0, sizeof(pktr));
>  	CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID);
>  	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
> -	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
> +	CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK);

This fix is unrelated to porting, right?

The rest looks fine for me.

Sasha

>  
>  	return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0,
>  					comp_mask, &pktr, smkey,
> @@ -1503,13 +1510,13 @@ static int process_opt(void *context, int ch, char *optarg)
>  		query_type = IB_SA_ATTR_LINKRECORD;
>  		break;
>  	case 5:
> -		p->slid = strtoul(optarg, NULL, 0);
> +		p->slid = (uint16_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 6:
> -		p->dlid = strtoul(optarg, NULL, 0);
> +		p->dlid = (uint16_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 7:
> -		p->mlid = strtoul(optarg, NULL, 0);
> +		p->mlid = (uint16_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 14:
>  		if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0)
> @@ -1534,7 +1541,7 @@ static int process_opt(void *context, int ch, char *optarg)
>  		p->numb_path = strtoul(optarg, NULL, 0);
>  		break;
>  	case 18:
> -		p->pkey = strtoul(optarg, NULL, 0);
> +		p->pkey = (uint16_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'Q':
>  		p->qos_class = strtoul(optarg, NULL, 0);
> @@ -1543,19 +1550,19 @@ static int process_opt(void *context, int ch, char *optarg)
>  		p->sl = strtoul(optarg, NULL, 0);
>  		break;
>  	case 'M':
> -		p->mtu = strtoul(optarg, NULL, 0);
> +		p->mtu = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'R':
> -		p->rate = strtoul(optarg, NULL, 0);
> +		p->rate = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 20:
> -		p->pkt_life = strtoul(optarg, NULL, 0);
> +		p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'q':
>  		p->qkey = strtoul(optarg, NULL, 0);
>  		break;
>  	case 'T':
> -		p->tclass = strtoul(optarg, NULL, 0);
> +		p->tclass = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'F':
>  		p->flow_label = strtoul(optarg, NULL, 0);
> @@ -1564,10 +1571,10 @@ static int process_opt(void *context, int ch, char *optarg)
>  		p->hop_limit = strtoul(optarg, NULL, 0);
>  		break;
>  	case 21:
> -		p->scope = strtoul(optarg, NULL, 0);
> +		p->scope = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'J':
> -		p->join_state = strtoul(optarg, NULL, 0);
> +		p->join_state = (uint8_t) strtoul(optarg, NULL, 0);
>  		break;
>  	case 'X':
>  		p->proxy_join = strtoul(optarg, NULL, 0);
> @@ -1582,14 +1589,7 @@ int main(int argc, char **argv)
>  {
>  	char usage_args[1024];
>  	bind_handle_t h;
> -	struct query_params params = {
> -		.hop_limit = -1,
> -		.reversible = -1,
> -		.numb_path = -1,
> -		.qos_class = -1,
> -		.sl = -1,
> -		.proxy_join = -1,
> -	};
> +	struct query_params params;
>  	const struct query_cmd *q;
>  	ib_api_status_t status;
>  	int n;
> @@ -1643,9 +1643,17 @@ int main(int argc, char **argv)
>  		{ "scope", 21, 1, NULL, "Scope (MCMemberRecord)" },
>  		{ "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" },
>  		{ "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" },
> -		{}
> +		{ 0 }
>  	};
>  
> +	memset(&params, 0, sizeof params);
> +	params.hop_limit = -1;
> +	params.reversible = -1;
> +	params.numb_path = -1;
> +	params.qos_class = -1;
> +	params.sl = -1;
> +	params.proxy_join = -1;
> +
>  	n = sprintf(usage_args, "[query-name] [<name> | <lid> | <guid>]\n"
>  		    "\nSupported query names (and aliases):\n");
>  	for (q = query_cmds; q->name; q++) {
> @@ -1680,7 +1688,7 @@ int main(int argc, char **argv)
>  
>  	if (argc) {
>  		if (node_print_desc == NAME_OF_LID) {
> -			requested_lid = strtoul(argv[0], NULL, 0);
> +			requested_lid = (uint16_t) strtoul(argv[0], NULL, 0);
>  			requested_lid_flag++;
>  		} else if (node_print_desc == NAME_OF_GUID) {
>  			requested_guid = strtoul(argv[0], NULL, 0);
> -- 
> 1.6.1.2.319.gbd9e
> 


From Jeffrey.C.Becker at nasa.gov  Thu Feb 26 13:12:38 2009
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Thu, 26 Feb 2009 13:12:38 -0800
Subject: [ofa-general]
	Re: Sun is asking for a response on Bug No.s 1522, 1523
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com>
References: <1235592409.22158.80.camel@pc.interlinx.bc.ca>
	<897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM>
	<E70D6AD7A0514A038D7DB2DC630E0EB9@BillGWAYLAPTOP>
	<382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com>
Message-ID: <49A705C6.3090605@nasa.gov>

Hi

Woodruff, Robert J wrote:
> Bill, since the bug was submitted as P3 and normal,
> rather than P1 as critical or blocker, it was probably overlooked as
> something
> that had to be fixed in 1.4. I changed the bug to a blocker P1 bug for
> OFED 1.4.1,
> and thus should show up on the bug tracking list that gets reviewed in
> the EWG.
>  
> In addition to enterring the bug into bugzilla, people may also want
> to send
> an email to the maintaner for blocker bugs, so that they know it needs
> quick
> attention. It is asigned to Jeff Becker, the NFS/RDMA maintainer.
> I know that there were some issues with backports for NFS/RDMA in 1.4 and
> believe he is trying to get these fixed for OFED 1.4.1.

I'm currently finishing up the SLES11 backports for OFED, and I'll work
on this when I'm done. Thanks.

-jeff

>  
> woody
>  
>
> ------------------------------------------------------------------------
> *From:* Bill Boas [mailto:Bill.Boas at openfabrics.org]
> *Sent:* Thursday, February 26, 2009 10:06 AM
> *To:* general at lists.openfabrics.org; ewg at lists.openfabrics.org;
> wwg at lists.openfabrics.org
> *Cc:* 'Peter Jones'; 'Brian J. Murrell'; 'Makia Minich';
> Bryon.Neitzel at Sun.COM
> *Subject:* Sun is asking for a response on Bug No.s 1522, 1523
>
> Dear OFA Maintainers:
>
>  
>
> Sun has contacted me about these bugs and is asking for priority
> action to get fixes for them. See the thread below.
>
>  
>
> I’m sending this email to 3 OFA lists because Sun having to contact
> the Alliance this way raises many questions about how a vendor like
> Sun or a customer like, say, a Wall St bank gets bugs fixed and where
> do we, the OFA, publish this information?
>
>  
>
> And if we agree that an OFA maintainer is the right person to fix the
> bug how does that maintainer learn that and how do they respond?
>
>  
>
> Thank you for responding to this email and to the Sun team using OFED
> in their products.
>
>  
>
> Bill.
>
>  
>
> Bill Boas
>
> Executive Director and Vice Chair
>
> OpenFabrics Alliance
>
> 510-375-8840
>
> Bill.Boas at openfabrics.org <mailto:Bill.Boas at openfabrics.org>
>
> www.openfabrics.org
>
>  
>
> ------------------------------------------------------------------------
>
> *From:* Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM]
> *Sent:* Wednesday, February 25, 2009 12:20 PM
> *To:* Bill Boas
> *Cc:* Peter Jones; Brian J. Murrell; Makia Minich
> *Subject:* Fwd: OFED 1.4 for NHM chips
>
>  
>
> Hi Bill,
>
> I found the bug numbers for the Lustre build issues against OFED 1.4.0
> that I mentioned yesterday.
>
> Is there any way to bump up the priority on these? This is blocking
> our ability to deliver our new Vayu (Nehalem+QDR) hardware to our
> customers, since Mellanox says they'll only support OFED 1.4 with QDR
> hardware.
>
>  
>
> Thanks,
>
> Bryon
>
>  
>
>  
>
> Begin forwarded message:
>
>
>
> *From: *"Brian J. Murrell" <Brian.Murrell at Sun.COM
> <mailto:Brian.Murrell at Sun.COM>>
>
> *Date: *February 25, 2009 1:06:49 PM MST
>
> *To: *Bryon Neitzel <Bryon.Neitzel at Sun.COM <mailto:Bryon.Neitzel at Sun.COM>>
>
> *Cc: *Peter Jones <Peter.A.Jones at Sun.COM <mailto:Peter.A.Jones at Sun.COM>>
>
> *Subject: **Re: OFED 1.4 for NHM chips*
>
>  
>
> On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote:
>
> Hi Brian,   what is the website where these OFED bugs were filed?    
>
>
> https://bugs.openfabrics.org/show_bug.cgi?id=1522
> https://bugs.openfabrics.org/show_bug.cgi?id=1523
>
>
> Bill Boas tried to find any OFED bugs opened by Sun yesterday,
>
>
> Yeah.  I opened them a number of days ago.
>
> Not sure if it matters to anyone, but IMHO, 1523 is the correct future
> direction and fixing that would implicitly fix 1522.  1523 basically
> describes separating the "technology preview" that caused this breakage
> out into it's own independent module so that it does not pollute the
> OFED core.  As such, it might be resisted as it's a "deeper cut" type of
> fix, but one that removes the sick part rather than trying to continue
> to bandage it.
>
> b.
>
>  
>


From sashak at voltaire.com  Thu Feb 26 13:25:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 23:25:38 +0200
Subject: [ofa-general] Re: [PATCH] Add pkey table support to
	osm_get_all_port_attrs
In-Reply-To: <f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
Message-ID: <20090226212538.GL14238@sashak.voltaire.com>

On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY;
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log,
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR,
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j,
> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size);
> >
> > Also should it be an error? May be it is just enough to fill requested
> > pkey entries?
> 
> I agree that being more forgiving is better but then how would it be
> known if the pkeys are being truncated ?

You could return a real pkeys_size value with table filled up to
provided size.

Otherwise (in case of just an error) how an user could know which pkey
size to provide?

Sasha


From sashak at voltaire.com  Thu Feb 26 13:32:07 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Feb 2009 23:32:07 +0200
Subject: [ofa-general] Re: [PATCH] Add pkey table support to
	osm_get_all_port_attrs
In-Reply-To: <f0e08f230902260403tc20661fmdcba5156dc40fe90@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226071059.GV11192@sashak.voltaire.com>
	<f0e08f230902260403tc20661fmdcba5156dc40fe90@mail.gmail.com>
Message-ID: <20090226213200.GM14238@sashak.voltaire.com>

On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
> On Thu, Feb 26, 2009 at 2:10 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 10:30 Wed 18 Feb ?? ?? , Hal Rosenstock wrote:
> >>
> >> Only supported in osm_vendor_ibumad.c (separate patch for other
> >> vendor layers)
> >> Also, update applications using this (osmtest, opensm)
> >
> > It looks that ibutils (ibis) requires same fix (attr_array
> > initialization) too.
> 
> Yes, I'm aware but didn't want to send those until these were accepted.

attr_array initialization doesn't hurt by itself, so in order to not have
broken version it would be better to apply it before the actual change.

Sasha


From jgunthorpe at obsidianresearch.com  Thu Feb 26 13:30:33 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 26 Feb 2009 14:30:33 -0700
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support
	for WinOF
In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
Message-ID: <20090226213033.GG5127@obsidianresearch.com>

On Thu, Feb 26, 2009 at 11:02:19PM +0200, Sasha Khapyorsky wrote:

> > -		size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS,
> > +		size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS,
> 
> What is the reason for such void * to char * casting?

Math on void* pointers is a gcc extension, I'm surprised you don't get
warnings on linux - it is worth figuring out how to turn those on..

Sean: For this purpose casting to (char *) is somewhat sketchy, it
should be (uint8_t *).. char should only ever be used for strings due
to possible troubles with environments using 16 bit chars for wide
character support.

Jason


From sean.hefty at intel.com  Thu Feb 26 13:39:45 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 13:39:45 -0800
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support	for
	WinOF
In-Reply-To: <20090226213033.GG5127@obsidianresearch.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<20090226213033.GG5127@obsidianresearch.com>
Message-ID: <FCEC1A1AFD8F49078D1B188C7EAA5949@amr.corp.intel.com>

>Sean: For this purpose casting to (char *) is somewhat sketchy, it
>should be (uint8_t *).. char should only ever be used for strings due
>to possible troubles with environments using 16 bit chars for wide
>character support.

I'm not aware of any environments that define char as anything other than a
byte, but I can change this.


From hal.rosenstock at gmail.com  Thu Feb 26 13:43:08 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Feb 2009 16:43:08 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to 
	osm_get_all_port_attrs
In-Reply-To: <20090226212538.GL14238@sashak.voltaire.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<20090226212538.GL14238@sashak.voltaire.com>
Message-ID: <f0e08f230902261343k66e24b00t1bf5b4c228c13f53@mail.gmail.com>

On Thu, Feb 26, 2009 at 4:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY;
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log,
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR,
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j,
>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size);
>> >
>> > Also should it be an error? May be it is just enough to fill requested
>> > pkey entries?
>>
>> I agree that being more forgiving is better but then how would it be
>> known if the pkeys are being truncated ?
>
> You could return a real pkeys_size value with table filled up to
> provided size.
>
> Otherwise (in case of just an error) how an user could know which pkey
> size to provide?

The problem with that is that the user needs to remember how many he
asked for originally. Not hard but just a detail that I expect will
get lost.

-- Hal

> Sasha
>


From sean.hefty at intel.com  Thu Feb 26 13:45:36 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 13:45:36 -0800
Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for
	WinOF
In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
Message-ID: <9E758F38400F48348241220A53295572@amr.corp.intel.com>

>> @@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd
>*q, bind_handle_t h,
>>  	CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID);
>>  	CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT);
>>  	CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL);
>> -	pr.hop_flow_raw |= cl_hton32(flow << 8);
>> +	pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8);
>
>Why this casting is needed? This should be uint32_t to uint32_t
>assignment, no?

Hmm... the cast shouldn't be needed.

>> @@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct
>query_cmd *q,
>>  	memset(&pktr, 0, sizeof(pktr));
>>  	CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID);
>>  	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
>> -	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
>> +	CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK);
>
>This fix is unrelated to porting, right?

Somewhat - this is a real fix, but without it, there's a build error assigning a
uint16 to an 8-bit port_num.

I'll remove the cast above and change the (char *) casts to (uint8_t *) casts
instead.

- Sean


From jgunthorpe at obsidianresearch.com  Thu Feb 26 14:00:13 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 26 Feb 2009 15:00:13 -0700
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support
	for WinOF
In-Reply-To: <FCEC1A1AFD8F49078D1B188C7EAA5949@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<20090226213033.GG5127@obsidianresearch.com>
	<FCEC1A1AFD8F49078D1B188C7EAA5949@amr.corp.intel.com>
Message-ID: <20090226220013.GA16941@obsidianresearch.com>

On Thu, Feb 26, 2009 at 01:39:45PM -0800, Sean Hefty wrote:
> >Sean: For this purpose casting to (char *) is somewhat sketchy, it
> >should be (uint8_t *).. char should only ever be used for strings due
> >to possible troubles with environments using 16 bit chars for wide
> >character support.
> 
> I'm not aware of any environments that define char as anything other than a
> byte, but I can change this.

There are some screwy embedded compilers that do this, not the target
platform for OFA, but if you are improving portability, may as well do
it right, once and for all...

It is good portability practice in general to never use char for
non-string objects because the signedness and width is undefined by
the language, and at least signedness varies by CPU and environment in
the real world.

This is why C99 introduced fixed width types and types like uint8_t
and uintptr_t, because the actual language provides no other
guaranteed type to use :(

Jason


From ralph.campbell at qlogic.com  Thu Feb 26 14:39:28 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 26 Feb 2009 14:39:28 -0800
Subject: [ofa-general] [PATCH v2] IB/core: fix null pointer dereference in
	local_completions()
Message-ID: <1235687968.3948.218.camel@chromite.mv.qlogic.com>

IB/core: fix null pointer dereference in local_completions()

handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local
on the mad_agent_priv->local_work work queue with
local->mad_priv == NULL if device->process_mad() returns
IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and
(!ib_response_mad(&mad_priv->mad.mad) ||
 !mad_agent_priv->agent.recv_handler).

In this case, local_completions() will be called with
local->mad_priv == NULL. The code does check for this
case and skips calling recv_mad_agent->agent.recv_handler()
but recv == 0 so kmem_cache_free() is called with a
NULL pointer.

Also, since recv isn't reinitialized each time through the loop,
it can cause a memory leak if recv should have been zero.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5c54fc2..735ad4e 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work)
 	struct ib_mad_local_private *local;
 	struct ib_mad_agent_private *recv_mad_agent;
 	unsigned long flags;
-	int recv = 0;
+	int free_mad;
 	struct ib_wc wc;
 	struct ib_mad_send_wc mad_send_wc;
 
@@ -2370,14 +2370,15 @@ static void local_completions(struct work_struct *work)
 				   completion_list);
 		list_del(&local->completion_list);
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		free_mad = 0;
 		if (local->mad_priv) {
 			recv_mad_agent = local->recv_mad_agent;
 			if (!recv_mad_agent) {
 				printk(KERN_ERR PFX "No receive MAD agent for local completion\n");
+				free_mad = 1;
 				goto local_send_completion;
 			}
 
-			recv = 1;
 			/*
 			 * Defined behavior is to complete response
 			 * before request
@@ -2422,7 +2423,7 @@ local_send_completion:
 
 		spin_lock_irqsave(&mad_agent_priv->lock, flags);
 		atomic_dec(&mad_agent_priv->refcount);
-		if (!recv)
+		if (free_mad)
 			kmem_cache_free(ib_mad_cache, local->mad_priv);
 		kfree(local);
 	}


From sean.hefty at intel.com  Thu Feb 26 14:41:06 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 14:41:06 -0800
Subject: [ofa-general] [PATCH 1/2] [ib-diag] ibsysstat: add support for WinOF
In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
Message-ID: <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
changes from v1: change (char *) casts to (uint8_t *)

 infiniband-diags/src/ibsysstat.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
index cc1418d..da86d8e 100644
--- a/infiniband-diags/src/ibsysstat.c
+++ b/infiniband-diags/src/ibsysstat.c
@@ -183,7 +183,7 @@ static char *ibsystat_serv(void)
 
 		DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod);
 
-		size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS,
+		size = mk_reply(attr, (uint8_t *) mad + IB_VENDOR_RANGE2_DATA_OFFS,
 				sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS);
 
 		if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0)
@@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr)
 {
 	ib_rpc_t rpc = { 0 };
 	int fd, agent, timeout, len;
-	void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;
+	void *data = (uint8_t *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS;
 
 	DEBUG("Sysstat ping..");
 
@@ -318,7 +318,7 @@ int main(int argc, char **argv)
 	const struct ibdiag_opt opts[] = {
 		{ "oui", 'o', 1, NULL, "use specified OUI number" },
 		{ "Server", 'S', 0, NULL, "start in server mode" },
-		{ }
+		{ 0 }
 	};
 	char usage_args[] = "<dest lid|guid> [<op>]";
 

From sean.hefty at intel.com  Thu Feb 26 14:44:26 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 14:44:26 -0800
Subject: [ofa-general] [PATCH 2/2] [ib-diags] saquery: set correct pkey table
	field
In-Reply-To: <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>	<20090226101144.GB11192@sashak.voltaire.com>	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>	<20090226210211.GK14238@sashak.voltaire.com>
	<45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
Message-ID: <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>

port_num is incorrectly set instead of block_num

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
I will resubmit the changes for saquery to support winof.  I must have done
something wrong with my testing on that patch on linux, since I'm seeing
build warnings now.

 infiniband-diags/src/saquery.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index bcd1f61..3f508b9 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1267,7 +1267,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q,
 	memset(&pktr, 0, sizeof(pktr));
 	CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID);
 	CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT);
-	CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK);
+	CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK);
 
 	return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0,
 					comp_mask, &pktr, smkey,


From sean.hefty at intel.com  Thu Feb 26 15:10:04 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 26 Feb 2009 15:10:04 -0800
Subject: [ofa-general] [PATCH v2] [ib-diag] saquery: add support for WinOF
In-Reply-To: <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>	<20090226101144.GB11192@sashak.voltaire.com>	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>	<20090226210211.GK14238@sashak.voltaire.com>
	<45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
	<1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>
Message-ID: <BA7487E565E84667A1AAB40F6C990DFC@amr.corp.intel.com>

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Ok - that was quicker than I thought it would be...  this patch depends on
saquery: set correct pkey table field.

changes from v1:
  - use (uint8_t *) casts over (char *) casts
  - change initialization of zero_gid to use memset
  - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments
    are unsigned.  This is kind of confusing, but that's how it appears the
    macro is used.  It might be clearer if instead of passing -1 into the
    macro, that a SET_VAL macro be used instead.

 infiniband-diags/src/saquery.c |   77 ++++++++++++++++++++++------------------
 1 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 3f508b9..90ad512 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -37,20 +37,25 @@
  *
  */
 
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
 #include <unistd.h>
 #include <stdio.h>
 #include <arpa/inet.h>
 #include <ctype.h>
 #include <string.h>
 #include <errno.h>
+#include <assert.h>
 
 #define _GNU_SOURCE
 #include <getopt.h>
 
 #include <infiniband/umad.h>
 #include <infiniband/mad.h>
-#include <infiniband/iba/ib_types.h>
-#include <infiniband/complib/cl_nodenamemap.h>
+#include <iba/ib_types.h>
+#include <complib/cl_nodenamemap.h>
 
 #include "ibdiag_common.h"
 
@@ -170,7 +175,7 @@ recv_mad:
 	if (ibdebug > 1)
 		xdump(stdout, "SA Response:\n", mad, len);
 
-	method = mad_get_field(mad, 0, IB_MAD_METHOD_F);
+	method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F);
 	offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
 	result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
 	result.p_result_madw = mad;
@@ -189,12 +194,13 @@ recv_mad:
 static void *get_query_rec(void *mad, unsigned i)
 {
 	int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
-	return mad + IB_SA_DATA_OFFS + i * (offset << 3);
+	return (uint8_t *) mad + IB_SA_DATA_OFFS + i * (offset << 3);
 }
 
 static unsigned valid_gid(ib_gid_t *gid)
 {
-	ib_gid_t zero_gid = { };
+	ib_gid_t zero_gid;
+	memset(&zero_gid, 0, sizeof zero_gid);
 	return memcmp(&zero_gid, gid, sizeof(*gid));
 }
 
@@ -442,7 +448,7 @@ static void dump_multicast_member_record(void *data)
 	char gid_str2[INET6_ADDRSTRLEN];
 	ib_member_rec_t *p_mcmr = data;
 	uint16_t mlid = cl_ntoh16(p_mcmr->mlid);
-	int i = 0;
+	unsigned i = 0;
 	char *node_name = "<unknown>";
 
 	/* go through the node records searching for a port guid which matches
@@ -758,7 +764,7 @@ static void dump_one_mft_record(void *data)
 
 static void dump_results(struct query_res *r, void (*dump_func) (void *))
 {
-	int i;
+	unsigned i;
 	for (i = 0; i < r->result_cnt; i++) {
 		void *data = get_query_rec(r->p_result_madw, i);
 		dump_func(data);
@@ -768,7 +774,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *))
 static void return_mad(void)
 {
 	if (result.p_result_madw) {
-		free(result.p_result_madw - umad_size());
+		free((uint8_t *) result.p_result_madw - umad_size());
 		result.p_result_madw = NULL;
 	}
 }
@@ -839,7 +845,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid)
 {
 	ib_node_record_t *node_record = NULL;
 	ib_node_info_t *p_ni = NULL;
-	int i = 0, ret;
+	unsigned i;
+	int ret;
 
 	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
 	if (ret)
@@ -869,7 +876,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name)
 	if (isalpha(name[0]))
 		assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS);
 	else
-		rc_lid = atoi(name);
+		rc_lid = (uint16_t) atoi(name);
 	if (rc_lid == 0)
 		fprintf(stderr, "Failed to find lid for \"%s\"\n", name);
 	return rc_lid;
@@ -917,8 +924,8 @@ static int parse_lid_and_ports(bind_handle_t h,
 
 #define cl_hton8(x) (x)
 #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \
-	if (val > comp_with) { \
-		target = cl_hton##size(val); \
+	if ((int##size##_t) val > (int##size##_t) comp_with) { \
+		target = cl_hton##size((uint##size##_t) val); \
 		comp_mask |= IB_##name##_COMPMASK_##mask; \
 	}
 
@@ -951,7 +958,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask)
 
 static int print_node_records(bind_handle_t h)
 {
-	int i = 0, ret;
+	unsigned i;
+	int ret;
 
 	ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
 	if (ret)
@@ -1089,7 +1097,7 @@ static int print_multicast_member_records(bind_handle_t h)
 
 return_mc:
 	if (mc_group_result.p_result_madw)
-		free(mc_group_result.p_result_madw - umad_size());
+		free((uint8_t *) mc_group_result.p_result_madw - umad_size());
 
 	return ret;
 }
@@ -1503,13 +1511,13 @@ static int process_opt(void *context, int ch, char *optarg)
 		query_type = IB_SA_ATTR_LINKRECORD;
 		break;
 	case 5:
-		p->slid = strtoul(optarg, NULL, 0);
+		p->slid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 6:
-		p->dlid = strtoul(optarg, NULL, 0);
+		p->dlid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 7:
-		p->mlid = strtoul(optarg, NULL, 0);
+		p->mlid = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 14:
 		if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0)
@@ -1534,7 +1542,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->numb_path = strtoul(optarg, NULL, 0);
 		break;
 	case 18:
-		p->pkey = strtoul(optarg, NULL, 0);
+		p->pkey = (uint16_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'Q':
 		p->qos_class = strtoul(optarg, NULL, 0);
@@ -1543,19 +1551,19 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->sl = strtoul(optarg, NULL, 0);
 		break;
 	case 'M':
-		p->mtu = strtoul(optarg, NULL, 0);
+		p->mtu = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'R':
-		p->rate = strtoul(optarg, NULL, 0);
+		p->rate = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 20:
-		p->pkt_life = strtoul(optarg, NULL, 0);
+		p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'q':
 		p->qkey = strtoul(optarg, NULL, 0);
 		break;
 	case 'T':
-		p->tclass = strtoul(optarg, NULL, 0);
+		p->tclass = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'F':
 		p->flow_label = strtoul(optarg, NULL, 0);
@@ -1564,10 +1572,10 @@ static int process_opt(void *context, int ch, char *optarg)
 		p->hop_limit = strtoul(optarg, NULL, 0);
 		break;
 	case 21:
-		p->scope = strtoul(optarg, NULL, 0);
+		p->scope = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'J':
-		p->join_state = strtoul(optarg, NULL, 0);
+		p->join_state = (uint8_t) strtoul(optarg, NULL, 0);
 		break;
 	case 'X':
 		p->proxy_join = strtoul(optarg, NULL, 0);
@@ -1582,14 +1590,7 @@ int main(int argc, char **argv)
 {
 	char usage_args[1024];
 	bind_handle_t h;
-	struct query_params params = {
-		.hop_limit = -1,
-		.reversible = -1,
-		.numb_path = -1,
-		.qos_class = -1,
-		.sl = -1,
-		.proxy_join = -1,
-	};
+	struct query_params params;
 	const struct query_cmd *q;
 	ib_api_status_t status;
 	int n;
@@ -1643,9 +1644,17 @@ int main(int argc, char **argv)
 		{ "scope", 21, 1, NULL, "Scope (MCMemberRecord)" },
 		{ "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" },
 		{ "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" },
-		{}
+		{ 0 }
 	};
 
+	memset(&params, 0, sizeof params);
+	params.hop_limit = -1;
+	params.reversible = -1;
+	params.numb_path = -1;
+	params.qos_class = -1;
+	params.sl = -1;
+	params.proxy_join = -1;
+
 	n = sprintf(usage_args, "[query-name] [<name> | <lid> | <guid>]\n"
 		    "\nSupported query names (and aliases):\n");
 	for (q = query_cmds; q->name; q++) {
@@ -1680,7 +1689,7 @@ int main(int argc, char **argv)
 
 	if (argc) {
 		if (node_print_desc == NAME_OF_LID) {
-			requested_lid = strtoul(argv[0], NULL, 0);
+			requested_lid = (uint16_t) strtoul(argv[0], NULL, 0);
 			requested_lid_flag++;
 		} else if (node_print_desc == NAME_OF_GUID) {
 			requested_guid = strtoul(argv[0], NULL, 0);


From cameron at harr.org  Thu Feb 26 16:18:44 2009
From: cameron at harr.org (Cameron Harr)
Date: Thu, 26 Feb 2009 17:18:44 -0700
Subject: [Scst-devel]	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A6F25F.8060306@vlnb.net>
References: <48E386F6.5040502@fusionio.com>	<48F79CF8.3010905@vlnb.net>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl	nb.net>	<4980B8DE.3060806@harr.org>
	<4995D1EE.4000807@vlnb.net>	<49A42BE9.4030603@har r.org>
	<49A43439.7080405@vlnb.net>	<49A4812A.8050202@harr.org>
	<49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org>
	<49A6F25F.8060306@vlnb .net>
Message-ID: <49A73164.3010109@harr.org>

Vladislav Bolkhovitin wrote:
> Cameron Harr, on 02/26/2009 08:19 PM wrote:
>> Cameron Harr wrote:
>>> Cameron Harr wrote:
>>> I re-compiled and re-ran the tests and numbers are a little better 
>>> but performance still seems to have gone down from 673:
>>> Test 1:373751.66
>>> Test 2:371242.6067
>>> Test 3:347988.1467
>>> Test 4:378247.31
>>> Test 5:375616.53
>> I was curious and did a regression test with 673 and those numbers 
>> are now even worse, so I'll presume there is an issue on my system 
>> and not the SCST code:
>> Test 1:365204.3067
>> Test 2:364152.2067
>> Test 3:340665.7633
>> Test 4:369916.8133
>> Test 5:369093.5833
>
> It's known that any OS, including Linux, is getting "tired" under load 
> with time from boot, which leads to worse performance. I guess, you 
> can experience such effect.
>
> Check with r634. R635 has cache locality in data structures related 
> change, which intended to improve performance a bit, but might make it 
> worse instead.
>

This is with 634. It's pretty bad:
338316.44
329698.04
307972.7133
345682.4733
344165.08


From klakshman03 at hotmail.com  Thu Feb 26 20:26:38 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Fri, 27 Feb 2009 09:56:38 +0530
Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch
In-Reply-To: <200902261849.40448.jackm@dev.mellanox.co.il>
References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com>
	<200902261335.59927.jackm@dev.mellanox.co.il>
	<BAY101-W3170561A2230014F98DBC1B8AD0@phx.gbl>
	<200902261849.40448.jackm@dev.mellanox.co.il>
Message-ID: <BAY101-W3347574C6AC5AD300CE8BCB8AA0@phx.gbl>


ThanQ Jack

 I will Update the firmware and let you know the status

laxman

> From: jackm at dev.mellanox.co.il
> To: klakshman03 at hotmail.com
> Subject: Re: Problem in IB network without Switch
> Date: Thu, 26 Feb 2009 18:49:40 +0200
> CC: keshetti.mahesh at gmail.com; general at lists.openfabrics.org
> 
> You are running VERY old firmware (from 2004), and moreover, on one host
> you have 3.0.0, and on the other 3.1.0.
> 
> You need to upgrade your firmware.
> Contact your Mellanox FAE (support engineer) for instructions.
> 
> - Jack
> 
> >  Hi Jack,
> > 
> > Please find the output of ibstat on both the nodes, .
> > 
> > [root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed 
> > HCA Firmware Check ..................... FAIL
> >     REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917)
> > Host Driver Initialization ............. PASS
> > 
> > [root at mattool ~]# 
> > 
> > ************ IBSTAT output ******************
> > 
> > 
> > [root at mattool ~]# ibstat
> > CA 'mthca0'
> >         CA type: MT23108
> >         Number of ports: 2
> >         Firmware version: 3.1.0
> 
> > [root at compute-0-0 ~]# ibstat
> > CA 'mthca0'
> >         CA type: MT23108
> >         Number of ports: 2
> >         Firmware version: 3.0.0

_________________________________________________________________
Find a better job. We have plenty. Visit MSN Jobs
http://www.in.msn.com/jobs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090227/01ceff05/attachment.html>

From Jie.Cai at cs.anu.edu.au  Thu Feb 26 22:29:47 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 27 Feb 2009 17:29:47 +1100
Subject: [ofa-general] Bandwidth of performance with multirail IB
In-Reply-To: <200902240941.58634.cap@nsc.liu.se>
References: <20090223211155.730AFE28137@openfabrics.org>
	<49A378BC.5010806@cs.anu.edu.au>
	<200902240941.58634.cap@nsc.liu.se>
Message-ID: <49A7885B.3010005@cs.anu.edu.au>

Hi Peter,

A question on implementation multi-rail with uDAPL connections.

What I did is open 2 IAs (corresponding to the 2 ports on HCAs) on each 
node.
Then create one EP for each IA, and connect those EPs to the corresponding
EP at other node.

Then data been transferred via both EP-connections.

I have been notice that there's a MULTIPATH connection flag for dapl, 
but I did
not use it. What's the use of it?

Cheers,
Jie


-- 
Mr. Jie Cai


Peter Kjellstrom wrote:
> On Tuesday 24 February 2009, Jie Cai wrote:
>   
>> I have implemented a uDAPL program to measure the bandwidth on IB with
>> multirail connections.
>>
>> The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two
>> ports.
>>
>> The program utilize the two port on each node of cluster to build
>> multirail IB connections.
>>
>> The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which
>> is almost the same as single rail connections.
>>     
>
> Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus 
> not being able to keep up with the HCA. Since the bus is holding even a 
> single DDR IB port back you see no improvement with two ports.
>
> To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x 
> 5 GT/s. For one QDR or two DDR you'll need even more...
>
> /Peter
>
>   
>> Does anyone have similar experience?
>>     


From Jie.Cai at cs.anu.edu.au  Thu Feb 26 22:49:09 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Fri, 27 Feb 2009 17:49:09 +1100
Subject: [ofa-general] configuration question: how to support multiple IB
	interfaces?
Message-ID: <49A78CE5.4000506@cs.anu.edu.au>

I have connectX dual port HCAs installed in my system (support pci-e 2.0),
and each port shows as an individual interface on ifconfig messages (ib0 
and ib1).
Communications using OpenMPI and uDAPL via IB connection are fine.

However, I have a question on how to utilize the dual ports? Or do I 
need to
specific configure the system to drive dual ports? (I have set 
net.ipv4.conf.all.arp_ignore et al.)

Can't see a better bandwidth on either Open MPI or uDAPL.

Does anyone got experience with this?

-- 
Mr. Jie Cai


From davem at davemloft.net  Fri Feb 27 00:01:50 2009
From: davem at davemloft.net (David Miller)
Date: Fri, 27 Feb 2009 00:01:50 -0800 (PST)
Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS),
	take 2
In-Reply-To: <c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
	<c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>
Message-ID: <20090227.000150.177646700.davem@davemloft.net>

From: Andrew Grover <andy.grover at gmail.com>
Date: Wed, 25 Feb 2009 10:43:27 -0800

> On Tue, Feb 24, 2009 at 11:28 PM, David Miller <davem at davemloft.net> wrote:
> > Furthermore the port you've choosen for the protocol is arbitrary, not
> > properly allocated with the appropriate standards committee, and
> > therefore could conflict with something other people are using.
> 
> I'm sure allocating the port won't be too big an issue.

Ok, I added the RDS code to the net-next-2.6 tree, changing
AF_RDS to be 21


From sashak at voltaire.com  Fri Feb 27 00:32:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 10:32:00 +0200
Subject: [ofa-general] Re: [PATCH v2] [ib-diag] saquery: add support for
	WinOF
In-Reply-To: <BA7487E565E84667A1AAB40F6C990DFC@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
	<1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>
	<BA7487E565E84667A1AAB40F6C990DFC@amr.corp.intel.com>
Message-ID: <20090227083200.GA7462@sashak.voltaire.com>

On 15:10 Thu 26 Feb     , Sean Hefty wrote:
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

All applied. Thanks.

>   - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments
>     are unsigned.  This is kind of confusing, but that's how it appears the
>     macro is used.  It might be clearer if instead of passing -1 into the
>     macro, that a SET_VAL macro be used instead.

What do you mean? Another macro?

Sasha


From sashak at voltaire.com  Fri Feb 27 00:36:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 10:36:02 +0200
Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support
	for WinOF
In-Reply-To: <20090226213033.GG5127@obsidianresearch.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<20090226213033.GG5127@obsidianresearch.com>
Message-ID: <20090227083602.GB7462@sashak.voltaire.com>

On 14:30 Thu 26 Feb     , Jason Gunthorpe wrote:
> 
> Math on void* pointers is a gcc extension,

Indeed. (I forgot about this a long time ago :)).

> I'm surprised you don't get
> warnings on linux - it is worth figuring out how to turn those on..

Gcc warns when '-pedantic' is used.

Sasha


From sashak at voltaire.com  Fri Feb 27 01:08:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 11:08:45 +0200
Subject: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate unneeded
	umad2sim_dev num
In-Reply-To: <20090219174413.GA29805@comcast.net>
References: <20090219174413.GA29805@comcast.net>
Message-ID: <20090227090838.GC7462@sashak.voltaire.com>

Hi Hal,

On 12:44 Thu 19 Feb     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
> index e13e30a..aaa6260 100644
> --- a/umad2sim/umad2sim.c
> +++ b/umad2sim/umad2sim.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>   *
>   * This file is part of ibsim.
>   *
> @@ -77,7 +78,6 @@ struct ib_user_mad_reg_req {
>  
>  struct umad2sim_dev {
>  	int fd;
> -	unsigned num;

Wouldn't it be useful when more than one CA/host ports will be supported
using umad2sim?

Sasha

>  	char name[32];
>  	uint8_t port;
>  	struct sim_client sim_client;
> @@ -351,15 +351,13 @@ static int dev_sysfs_create(struct umad2sim_dev *dev)
>  	*str = '\0';
>  
>  	/* /sys/class/infiniband_mad/umad0/ */
> -	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir,
> -		 dev->num);
> +	snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, 0);
>  	make_path(path);
>  	file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name);
>  	file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port);
>  
>  	/* /sys/class/infiniband_mad/issm0/ */
> -	snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir,
> -		 dev->num);
> +	snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, 0);
>  	make_path(path);
>  	file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name);
>  	file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port);
> @@ -546,7 +544,7 @@ static int umad2sim_ioctl(struct umad2sim_dev *dev, unsigned long request,
>  	return -1;
>  }
>  
> -static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
> +static struct umad2sim_dev *umad2sim_dev_create(const char *name)
>  {
>  	struct umad2sim_dev *dev;
>  	unsigned i;
> @@ -558,7 +556,6 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
>  		return NULL;
>  	memset(dev, 0, sizeof(*dev));
>  
> -	dev->num = num;
>  	strncpy(dev->name, name, sizeof(dev->name) - 1);
>  
>  	if (sim_client_init(&dev->sim_client) < 0)
> @@ -574,9 +571,9 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name)
>  	dev_sysfs_create(dev);
>  
>  	snprintf(dev->umad_path, sizeof(dev->umad_path), "%s/%s%u",
> -		 umad_dev_dir, "umad", num);
> +		 umad_dev_dir, "umad", 0);
>  	snprintf(dev->issm_path, sizeof(dev->issm_path), "%s/%s%u",
> -		 umad_dev_dir, "issm", num);
> +		 umad_dev_dir, "issm", 0);
>  
>  	return dev;
>  
> @@ -646,7 +643,7 @@ static void umad2sim_init(void)
>  	DEBUG("umad2sim_init...\n");
>  	snprintf(umad2sim_sysfs_prefix, sizeof(umad2sim_sysfs_prefix),
>  		 "./sys-%d", getpid());
> -	devices[0] = umad2sim_dev_create(0, "ibsim0");
> +	devices[0] = umad2sim_dev_create("ibsim0");
>  	if (!devices[0]) {
>  		ERROR("cannot init umad2sim. Exit.\n");
>  		exit(-1);


From sashak at voltaire.com  Fri Feb 27 01:24:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 11:24:05 +0200
Subject: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr
	print_counters for better nodenames
In-Reply-To: <f0e08f230902260403o2e266802t43fb893f0dd6ade0@mail.gmail.com>
References: <20090219130653.GA29318@comcast.net>
	<20090226061551.GQ11192@sashak.voltaire.com>
	<f0e08f230902260403o2e266802t43fb893f0dd6ade0@mail.gmail.com>
Message-ID: <20090227092405.GD7462@sashak.voltaire.com>

On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
> On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> 
> [snip...]
> 
> > And in general I think it is better to use C-style comments - /* ... */,
> > in C code and not C++-style // ... .
> 
> Is this going to be enforced uniformly across OpenSM ?

I didn't think about it (there are no many '//' comments), but I try to
not introduce a new ones.

Sasha


From vlad at lists.openfabrics.org  Fri Feb 27 03:17:29 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 27 Feb 2009 03:17:29 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090227-0200 daily build status
Message-ID: <20090227111729.38CDDE6101C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Fri Feb 27 03:33:18 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 06:33:18 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey
	table support to osm_get_all_port_attrs
In-Reply-To: <f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
Message-ID: <f0e08f230902270333q4565c467y5ef0d0b1590345f4@mail.gmail.com>

Sasha,

On Thu, Feb 26, 2009 at 7:03 AM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:

[snip...]

>>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>>> index 73a6274..503d7fa 100644
>>> --- a/opensm/opensm/main.c
>>> +++ b/opensm/opensm/main.c
>>> @@ -2,6 +2,7 @@
>>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
>>>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>>   *
>>>   * This software is available to you under a choice of one of two
>>>   * licenses.  You may choose to be licensed under the terms of the GNU
>>> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid)
>>>       uint32_t i, choice = 0;
>>>       ib_api_status_t status;
>>>
>>> +     for (i = 0; i < num_ports; i++) {
>>> +             attr_array[i].num_pkeys = 0;
>>> +             attr_array[i].p_pkey_table = NULL;
>>> +     }
>>> +
>>
>> Here and below. Just
>>
>>        memset(attr_array, 0, sizeof(attr_array));
>>
>> would be enough.
>
> Sure; next version.

The thought above is that it is more efficient to just initialize the
needed fields rather than the entire array which is not required.

-- Hal


From ms at diskware.net  Fri Feb 27 03:35:12 2009
From: ms at diskware.net (Martin Scholl)
Date: Fri, 27 Feb 2009 12:35:12 +0100
Subject: [ofa-general] RDS: add MSG_NOSIGNAL to rds_sendmsg?
Message-ID: <49A7CFF0.4090606@diskware.net>

Hello all,


[although I liked to discuss this at rds-devel@, I post to general@ as
rds-devel@ is still broken for me for several days now.]

I just noticed MSG_NOSIGNAL is not part of the allowed set of msg flags
to rds_sendmsg(). Attached is a hopefully harmless and tiny fix for this.


Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msg_nosignal.diff
Type: text/x-patch
Size: 550 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090227/e7c27083/attachment.bin>

From hal.rosenstock at gmail.com  Fri Feb 27 03:40:54 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 06:40:54 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support 
	to osm_get_all_port_attrs
In-Reply-To: <f0e08f230902261343k66e24b00t1bf5b4c228c13f53@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<20090226212538.GL14238@sashak.voltaire.com>
	<f0e08f230902261343k66e24b00t1bf5b4c228c13f53@mail.gmail.com>
Message-ID: <f0e08f230902270340p129c9c3ex43716cdfb35ebcab@mail.gmail.com>

On Thu, Feb 26, 2009 at 4:43 PM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> On Thu, Feb 26, 2009 at 4:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY;
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log,
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR,
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n",
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j,
>>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size);
>>> >
>>> > Also should it be an error? May be it is just enough to fill requested
>>> > pkey entries?
>>>
>>> I agree that being more forgiving is better but then how would it be
>>> known if the pkeys are being truncated ?
>>
>> You could return a real pkeys_size value with table filled up to
>> provided size.
>>
>> Otherwise (in case of just an error) how an user could know which pkey
>> size to provide?

> The problem with that is that the user needs to remember how many he
> asked for originally. Not hard but just a detail that I expect will
> get lost.

Also, should I assume you don't care about the API inconsistency issue
mentioned in that the user can't just request the first n ports but
only all ports.?

-- Hal

> -- Hal
>
>> Sasha
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Fri Feb 27 03:44:42 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 06:44:42 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate
	unneeded umad2sim_dev num
In-Reply-To: <20090227090838.GC7462@sashak.voltaire.com>
References: <20090219174413.GA29805@comcast.net>
	<20090227090838.GC7462@sashak.voltaire.com>
Message-ID: <f0e08f230902270344q333e7c31yafcca34a97e49cca@mail.gmail.com>

Sasha,

On Fri, Feb 27, 2009 at 4:08 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 12:44 Thu 19 Feb     , Hal Rosenstock wrote:
>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> ---
>> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
>> index e13e30a..aaa6260 100644
>> --- a/umad2sim/umad2sim.c
>> +++ b/umad2sim/umad2sim.c
>> @@ -1,5 +1,6 @@
>>  /*
>>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
>> + * Copyright (c) 2009 HNR Consulting. All rights reserved.
>>   *
>>   * This file is part of ibsim.
>>   *
>> @@ -77,7 +78,6 @@ struct ib_user_mad_reg_req {
>>
>>  struct umad2sim_dev {
>>       int fd;
>> -     unsigned num;
>
> Wouldn't it be useful when more than one CA/host ports will be supported
> using umad2sim?

Then shouldn't it be added at the time that that feature is supported
rather than have the currently unneeded initialization ?

-- Hal

> Sasha


From sashak at voltaire.com  Fri Feb 27 04:20:59 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 14:20:59 +0200
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table
	support to osm_get_all_port_attrs
In-Reply-To: <f0e08f230902270340p129c9c3ex43716cdfb35ebcab@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<20090226212538.GL14238@sashak.voltaire.com>
	<f0e08f230902261343k66e24b00t1bf5b4c228c13f53@mail.gmail.com>
	<f0e08f230902270340p129c9c3ex43716cdfb35ebcab@mail.gmail.com>
Message-ID: <20090227122059.GE7462@sashak.voltaire.com>

On 06:40 Fri 27 Feb     , Hal Rosenstock wrote:
> >>
> >> You could return a real pkeys_size value with table filled up to
> >> provided size.
> >>
> >> Otherwise (in case of just an error) how an user could know which pkey
> >> size to provide?
> 
> > The problem with that is that the user needs to remember how many he
> > asked for originally. Not hard but just a detail that I expect will
> > get lost.
> 
> Also, should I assume you don't care about the API inconsistency issue
> mentioned in that the user can't just request the first n ports but
> only all ports.?

Why not? num_ports pointer is in/out parameter. Could you explain what
do you mean here by API inconsistency?

Sasha


From sashak at voltaire.com  Fri Feb 27 04:49:48 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 14:49:48 +0200
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table
	support to osm_get_all_port_attrs
In-Reply-To: <f0e08f230902270333q4565c467y5ef0d0b1590345f4@mail.gmail.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<f0e08f230902270333q4565c467y5ef0d0b1590345f4@mail.gmail.com>
Message-ID: <20090227124948.GF7462@sashak.voltaire.com>

On 06:33 Fri 27 Feb     , Hal Rosenstock wrote:
> >>
> >> ?? ?? ?? ??memset(attr_array, 0, sizeof(attr_array));
> >>
> >> would be enough.
> >
> > Sure; next version.
> 
> The thought above is that it is more efficient to just initialize the
> needed fields rather than the entire array which is not required.

I don't know for sure about this specific example, but normally memset()
is heavily optimized function so I would expect at least comparable
performance here.

Sasha


From sashak at voltaire.com  Fri Feb 27 04:50:22 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 14:50:22 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_node_info_rcv.c: create
	physp for the newly discovered port of the known node
In-Reply-To: <49A68976.6000404@dev.mellanox.co.il>
References: <49A68976.6000404@dev.mellanox.co.il>
Message-ID: <20090227125022.GG7462@sashak.voltaire.com>

On 14:22 Thu 26 Feb     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> [v2: adding CL_ASSERT() and changing comments]
> 
> This patch fixes bugzilla issue #1515.
> 
> The bug was discovered and analyzed by Line Holen.
> 
> Topology:
>                  |---------------|
>                  |      SW2      |
>                  |---------------|
>                    |x |y    |z |v
>               |----|  |     |  |----|
>               |       |     |       |
>               |  |----|     |----|  |
>               |  |               |  |
>              a| b|              c| d|
>       |---------------|     |---------------|
>       |       SW1     |     |     SW3       |
>       |---------------|     |---------------|
>           |                             |
>           |                             |
>        HCA with SM                      HCA
> 
> During the discovery:
> 
> SM sends NodeInfo request to SW1
> SM sends NodeInfo request to SW2 through link a->x
> SM discovers new node SW2:
>   - updates DR to SW2 to go through link a->x
>   - creates physp x
> SM sends NodeInfo request to SW2 through link b->y
> SM discovers a known node SW2
>   - DOES NOT create physp y
>   - updates DR to SW2 to go through link b->y
> 
> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with
> port y any more, leaving it uninitialized (no physp object for this port).
> 
> The fix is to create physp for the newly discovered port of the known
> switch node, same way as it is done for HCAs.
> I also added one log message for the case that showed the problem - when
> one of the link sides is uninitialized (no valid ports check). Perhaps
> this log message should be an error message instead?
> 
> Debugged-by: Line Holen <Line.Holen at Sun.COM>
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanksa.

Sasha


From sashak at voltaire.com  Fri Feb 27 05:00:34 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 15:00:34 +0200
Subject: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate unneeded
	umad2sim_dev num
In-Reply-To: <f0e08f230902270344q333e7c31yafcca34a97e49cca@mail.gmail.com>
References: <20090219174413.GA29805@comcast.net>
	<20090227090838.GC7462@sashak.voltaire.com>
	<f0e08f230902270344q333e7c31yafcca34a97e49cca@mail.gmail.com>
Message-ID: <20090227130034.GH7462@sashak.voltaire.com>

On 06:44 Fri 27 Feb     , Hal Rosenstock wrote:
> 
> > Wouldn't it be useful when more than one CA/host ports will be supported
> > using umad2sim?
> 
> Then shouldn't it be added at the time that that feature is supported
> rather than have the currently unneeded initialization ?

It is matter of clean interface - I prefer to keep it clean and not to
use hardcoded device number.

Sasha


From sashak at voltaire.com  Fri Feb 27 05:01:36 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 15:01:36 +0200
Subject: [ofa-general] [PATCH] ibsim: fix LocalPortNum in PortInfo response
Message-ID: <20090227130136.GI7462@sashak.voltaire.com>


Fix LocalPortNum encoding in PortInfo responses.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 ibsim/sim_mad.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c
index 6e08031..9253415 100644
--- a/ibsim/sim_mad.c
+++ b/ibsim/sim_mad.c
@@ -483,6 +483,7 @@ do_portinfo(Port * port, unsigned op, uint32_t portnum, uint8_t * data)
 
 	update_portinfo(p);
 	memcpy(data, p->portinfo, IB_SMP_DATA_SIZE);
+	mad_set_field(data, 0, IB_PORT_LOCAL_PORT_F, port->portnum);
 
 	return 0;
 }
-- 
1.6.1.2.319.gbd9e


From hal.rosenstock at gmail.com  Fri Feb 27 05:38:27 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 08:38:27 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey
	table support to osm_get_all_port_attrs
In-Reply-To: <20090227122059.GE7462@sashak.voltaire.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<20090226212538.GL14238@sashak.voltaire.com>
	<f0e08f230902261343k66e24b00t1bf5b4c228c13f53@mail.gmail.com>
	<f0e08f230902270340p129c9c3ex43716cdfb35ebcab@mail.gmail.com>
	<20090227122059.GE7462@sashak.voltaire.com>
Message-ID: <f0e08f230902270538g60109032yb2ef5df00f89f186@mail.gmail.com>

On Fri, Feb 27, 2009 at 7:20 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 06:40 Fri 27 Feb     , Hal Rosenstock wrote:
>> >>
>> >> You could return a real pkeys_size value with table filled up to
>> >> provided size.
>> >>
>> >> Otherwise (in case of just an error) how an user could know which pkey
>> >> size to provide?
>>
>> > The problem with that is that the user needs to remember how many he
>> > asked for originally. Not hard but just a detail that I expect will
>> > get lost.
>>
>> Also, should I assume you don't care about the API inconsistency issue
>> mentioned in that the user can't just request the first n ports but
>> only all ports.?
>
> Why not? num_ports pointer is in/out parameter. Could you explain what
> do you mean here by API inconsistency?

It's implementation rather than API. Not all the vendor
implementations support the semantic where num_ports is not 0 and less
than the total number of ports (and return insufficient memory for
this condition).

-- Hal

> Sasha


From hal.rosenstock at gmail.com  Fri Feb 27 08:39:00 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 11:39:00 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey
	table support to osm_get_all_port_attrs
In-Reply-To: <20090227124948.GF7462@sashak.voltaire.com>
References: <20090218153016.GD8489@comcast.net>
	<20090226070629.GU11192@sashak.voltaire.com>
	<f0e08f230902260403v737c3e91vb817e9638786ebc9@mail.gmail.com>
	<f0e08f230902270333q4565c467y5ef0d0b1590345f4@mail.gmail.com>
	<20090227124948.GF7462@sashak.voltaire.com>
Message-ID: <f0e08f230902270839l65f200aet7d2d9e8c9d103a0a@mail.gmail.com>

Sasha,

On Fri, Feb 27, 2009 at 7:49 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 06:33 Fri 27 Feb     , Hal Rosenstock wrote:
>> >>
>> >> ?? ?? ?? ??memset(attr_array, 0, sizeof(attr_array));
>> >>
>> >> would be enough.
>> >
>> > Sure; next version.
>>
>> The thought above is that it is more efficient to just initialize the
>> needed fields rather than the entire array which is not required.
>
> I don't know for sure about this specific example, but normally memset()
> is heavily optimized function so I would expect at least comparable
> performance here.

It's minor but memset is slower for this.

-- Hal

> Sasha
>


From andi at firstfloor.org  Fri Feb 27 09:08:34 2009
From: andi at firstfloor.org (Andi Kleen)
Date: Fri, 27 Feb 2009 18:08:34 +0100
Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2
In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> (Andy
	Grover's message of "Tue, 24 Feb 2009 17:30:17 -0800")
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
Message-ID: <87myc73izx.fsf@basil.nowhere.org>

Andy Grover <andy.grover at oracle.com> writes:

> This patchset against net-next adds support for RDS sockets. RDS is an
> Oracle-originated protocol used to send IPC datagrams (up to 1MB)
> reliably, and is used currently in Oracle RAC and Exadata products. 

Perhaps I missed it earlier, but what is the rationale for putting 
this as a socket type into the kernel? I assume they also work
directly as implemented in user space using raw sockets or similar, 
don't they?

-Andi

-- 
ak at linux.intel.com -- Speaking for myself only.


From sean.hefty at intel.com  Fri Feb 27 09:51:51 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 27 Feb 2009 09:51:51 -0800
Subject: [ofa-general] RE: [PATCH v2] [ib-diag] saquery: add support for
	WinOF
In-Reply-To: <20090227083200.GA7462@sashak.voltaire.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
	<1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>
	<BA7487E565E84667A1AAB40F6C990DFC@amr.corp.intel.com>
	<20090227083200.GA7462@sashak.voltaire.com>
Message-ID: <2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com>

>>   - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments
>>     are unsigned.  This is kind of confusing, but that's how it appears the
>>     macro is used.  It might be clearer if instead of passing -1 into the
>>     macro, that a SET_VAL macro be used instead.
>
>What do you mean? Another macro?

yes -- instead of passing -1 into CHECK_AND_SET_VAL as the value to compare
against, call a different macro that just sets the value, unless I'm
misunderstanding why -1 is passed in.  Then CHECK_AND_SET_VAL would do unsigned
comparisons.

I can submit a patch for this, but I wasn't completely sure of the intent of
using -1 as the compare value.

- Sean


From andy.grover at gmail.com  Fri Feb 27 10:21:30 2009
From: andy.grover at gmail.com (Andrew Grover)
Date: Fri, 27 Feb 2009 10:21:30 -0800
Subject: [ofa-general] ***SPAM*** Re: [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <20090227.000150.177646700.davem@davemloft.net>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<20090224.232814.227017310.davem@davemloft.net>
	<c0a09e5c0902251043x576e7066kb7aa76fc148a6f76@mail.gmail.com>
	<20090227.000150.177646700.davem@davemloft.net>
Message-ID: <c0a09e5c0902271021y3ba0f3b9r34e696e1e657cc61@mail.gmail.com>

On Fri, Feb 27, 2009 at 12:01 AM, David Miller <davem at davemloft.net> wrote:
> From: Andrew Grover <andy.grover at gmail.com>
> Date: Wed, 25 Feb 2009 10:43:27 -0800
>
>> On Tue, Feb 24, 2009 at 11:28 PM, David Miller <davem at davemloft.net> wrote:
>> > Furthermore the port you've choosen for the protocol is arbitrary, not
>> > properly allocated with the appropriate standards committee, and
>> > therefore could conflict with something other people are using.
>>
>> I'm sure allocating the port won't be too big an issue.
>
> Ok, I added the RDS code to the net-next-2.6 tree, changing
> AF_RDS to be 21

Thanks much!

-- Andy


From rdreier at cisco.com  Fri Feb 27 10:29:36 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 27 Feb 2009 10:29:36 -0800
Subject: [ofa-general] Re: [PATCH] mlx4_core: Add device IDs for MT25458
	10GigE devices
In-Reply-To: <200902261238.26437.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 26 Feb 2009 12:38:26 +0200")
References: <200902261238.26437.jackm@dev.mellanox.co.il>
Message-ID: <adazlg7ohrj.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Fri Feb 27 10:31:00 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 27 Feb 2009 10:31:00 -0800
Subject: [ofa-general] Re: [PATCH] ib/iser: remove hard setting of mtu
In-Reply-To: <Pine.LNX.4.64.0902261056440.26368@zuben.voltaire.com> (Or
	Gerlitz's message of "Thu, 26 Feb 2009 10:57:45 +0200 (IST)")
References: <Pine.LNX.4.64.0902261056440.26368@zuben.voltaire.com>
Message-ID: <adavdqvohp7.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Fri Feb 27 10:32:40 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 27 Feb 2009 10:32:40 -0800
Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime
	manipulation
In-Reply-To: <71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com>
	(Ramachandra K.'s message of "Fri, 27 Feb 2009 00:39:27 +0530")
References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095>
	<2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com>
	<71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com>
Message-ID: <adar61johmf.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Fri Feb 27 10:36:37 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 27 Feb 2009 10:36:37 -0800
Subject: [ofa-general] Re: [PATCH v2] IB/core: fix null pointer dereference
	in local_completions()
In-Reply-To: <1235687968.3948.218.camel@chromite.mv.qlogic.com> (Ralph
	Campbell's message of "Thu, 26 Feb 2009 14:39:28 -0800")
References: <1235687968.3948.218.camel@chromite.mv.qlogic.com>
Message-ID: <adamyc7ohfu.fsf@cisco.com>

thanks, applied


From hnrose at comcast.net  Fri Feb 27 10:46:53 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 27 Feb 2009 13:46:53 -0500
Subject: [ofa-general] [PATCH] opensm/infiniband-diags: Changes for C rather
	than C++ style comments
Message-ID: <20090227184653.GC15668@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c
index 048efc7..0c30726 100644
--- a/infiniband-diags/src/grouping.c
+++ b/infiniband-diags/src/grouping.c
@@ -336,9 +336,9 @@ static void get_router_slot(Node *node, Port *spineport)
 		ch->slotnum = line_slot_2_sfb12[spineport->portnum];
 		/* this is a smart guess based on nodeguids order on sFB-12 module */
 		guessnum = spineport->node->nodeguid % 4;
-		// module 1 <--> remote anafa 3
-		// module 2 <--> remote anafa 2
-		// module 3 <--> remote anafa 1
+		/* module 1 <--> remote anafa 3 */
+		/* module 2 <--> remote anafa 2 */
+		/* module 3 <--> remote anafa 1 */
 		ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2));
 	} else if (is_spine_2004(spineport->node)) {
 		ch->chassistype = ISR2004_CT;
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 6946fd7..2c5240a 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -870,8 +870,10 @@ void dump_ports_report ()
 	Node *node;
 	Port *port;
 
-	// If switch and LID == 0, search of other switch ports with
-	// valid LID and assign it to all ports of that switch
+	/*
+	 * If switch and LID == 0, search of other switch ports with
+	 * valid LID and assign it to all ports of that switch
+	 */
 	for (b = 0; b <= MAXHOPS; b++)
 		for (node = nodesdist[b]; node; node = node->dnext)
 			if (node->type == SWITCH_NODE) {
diff --git a/opensm/include/opensm/osm_console.h b/opensm/include/opensm/osm_console.h
index 3ea8fa5..acb36d9 100644
--- a/opensm/include/opensm/osm_console.h
+++ b/opensm/include/opensm/osm_console.h
@@ -45,7 +45,7 @@
 #endif				/* __cplusplus */
 
 BEGIN_C_DECLS
-// TODO replace p_osm
+/* TODO replace p_osm */
 void osm_console(osm_opensm_t * p_osm);
 END_C_DECLS
 #endif				/* _OSM_CONSOLE_H_ */
diff --git a/opensm/opensm/osm_console_io.c b/opensm/opensm/osm_console_io.c
index 3d3ece4..8953ab7 100644
--- a/opensm/opensm/osm_console_io.c
+++ b/opensm/opensm/osm_console_io.c
@@ -59,7 +59,7 @@
 
 static int is_local(char *str)
 {
-	// convenience - checks if just stdin/stdout
+	/* convenience - checks if just stdin/stdout */
 	if (str)
 		return (strcmp(str, OSM_LOCAL_CONSOLE) == 0);
 	return 0;
@@ -67,7 +67,7 @@ static int is_local(char *str)
 
 static int is_loopback(char *str)
 {
-	// convenience - checks if socket based connection
+	/* convenience - checks if socket based connection */
 	if (str)
 		return (strcmp(str, OSM_LOOPBACK_CONSOLE) == 0);
 	return 0;
@@ -75,7 +75,7 @@ static int is_loopback(char *str)
 
 static int is_remote(char *str)
 {
-	// convenience - checks if socket based connection
+	/* convenience - checks if socket based connection */
 	if (str)
 		return (strcmp(str, OSM_REMOTE_CONSOLE) == 0)
 		    || is_loopback(str);
@@ -84,7 +84,7 @@ static int is_remote(char *str)
 
 int is_console_enabled(osm_subn_opt_t * p_opt)
 {
-	// checks for a variety of types of consoles - default is off or 0
+	/* checks for a variety of types of consoles - default is off or 0 */
 	if (p_opt)
 		return (is_local(p_opt->console)
 			|| is_loopback(p_opt->console)
@@ -210,14 +210,14 @@ int osm_console_init(osm_subn_opt_t * opt, osm_console_t * p_oct, osm_log_t * p_
 /* clean up and release resources */
 void osm_console_exit(osm_console_t * p_oct, osm_log_t * p_log)
 {
-	// clean up and release resources, currently just close the socket
+	/* clean up and release resources, currently just close the socket */
 	osm_console_close(p_oct, p_log);
 }
 
 #ifdef ENABLE_OSM_CONSOLE_SOCKET
 int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log)
 {
-	// returns zero if opened fine, -1 otherwise
+	/* returns zero if opened fine, -1 otherwise */
 	char *p_line;
 	size_t len;
 	ssize_t n;
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 4e783bf..17611f7 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -679,7 +679,7 @@ static void free_lash_structures(lash_t * p_lash)
 
 	OSM_LOG_ENTER(p_log);
 
-	// free cdg_vertex_matrix
+	/* free cdg_vertex_matrix */
 	for (i = 0; i < p_lash->vl_min; i++) {
 		for (j = 0; j < num_switches; j++) {
 			for (k = 0; k < num_switches; k++)
@@ -695,7 +695,7 @@ static void free_lash_structures(lash_t * p_lash)
 	if (p_lash->cdg_vertex_matrix)
 		free(p_lash->cdg_vertex_matrix);
 
-	// free virtual_location
+	/* free virtual_location */
 	for (i = 0; i < num_switches; i++) {
 		for (j = 0; j < num_switches; j++) {
 			if (p_lash->virtual_location[i][j])
@@ -723,7 +723,7 @@ static int init_lash_structures(lash_t * p_lash)
 
 	OSM_LOG_ENTER(p_log);
 
-	// initialise cdg_vertex_matrix[num_switches][num_switches][num_switches]
+	/* initialise cdg_vertex_matrix[num_switches][num_switches][num_switches] */
 	p_lash->cdg_vertex_matrix =
 	    (cdg_vertex_t ****) malloc(vl_min * sizeof(cdg_vertex_t ****));
 	for (i = 0; i < vl_min; i++) {
@@ -749,8 +749,10 @@ static int init_lash_structures(lash_t * p_lash)
 		}
 	}
 
-	// initialise virtual_location[num_switches][num_switches][num_layers],
-	// default value = 0
+	/*
+	 * initialise virtual_location[num_switches][num_switches][num_layers],
+	 * default value = 0
+	 */
 	p_lash->virtual_location =
 	    (int ***)malloc(num_switches * sizeof(int ***));
 	if (p_lash->virtual_location == NULL)
@@ -775,7 +777,7 @@ static int init_lash_structures(lash_t * p_lash)
 		}
 	}
 
-	// initialise num_mst_in_lane[num_switches], default 0
+	/* initialise num_mst_in_lane[num_switches], default 0 */
 	p_lash->num_mst_in_lane = (int *)malloc(num_switches * sizeof(int));
 	if (p_lash->num_mst_in_lane == NULL)
 		goto Exit_Mem_Error;
@@ -997,7 +999,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
 
 	p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl);
 
-	// Go through each swtich individually
+	/* Go through each swtich individually */
 	while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) {
 		uint64_t current_guid;
 		switch_t *sw;
@@ -1051,7 +1053,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
 					dst_lash_switch_id,
 					physical_egress_port);
 			}
-		}		// for
+		}		/* for */
 		osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
 	}
 	OSM_LOG_EXIT(p_log);
@@ -1069,7 +1071,7 @@ static void osm_lash_process_switch(lash_t * p_lash, osm_switch_t * p_sw)
 	switch_a_lash_id = get_lash_id(p_sw);
 	port_count = osm_node_get_num_physp(p_sw->p_node);
 
-	// starting at port 1, ignoring management port on switch
+	/* starting at port 1, ignoring management port on switch */
 	for (i = 1; i < port_count; i++) {
 
 		p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i);
@@ -1148,7 +1150,7 @@ static int discover_network_properties(lash_t * p_lash)
 		return -1;
 	memset(p_lash->switches, 0, p_lash->num_switches * sizeof(switch_t *));
 
-	vl_min = 5;		// set to a high value
+	vl_min = 5;		/* set to a high value */
 
 	p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl);
 	while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) {
@@ -1163,7 +1165,7 @@ static int discover_network_properties(lash_t * p_lash)
 
 		port_count = osm_node_get_num_physp(p_sw->p_node);
 
-		// Note, ignoring port 0. management port
+		/* Note, ignoring port 0. management port */
 		for (i = 1; i < port_count; i++) {
 			osm_physp_t *p_current_physp =
 			    osm_node_get_physp_ptr(p_sw->p_node, i);
@@ -1178,8 +1180,8 @@ static int discover_network_properties(lash_t * p_lash)
 				if (port_vl_min && port_vl_min < vl_min)
 					vl_min = port_vl_min;
 			}
-		}		// for
-	}			// while
+		}		/* for */
+	}			/* while */
 
 	vl_min = 1 << (vl_min - 1);
 	if (vl_min > 15)
@@ -1219,7 +1221,7 @@ static int lash_process(void *context)
 
 	p_lash->balance_limit = 6;
 
-	// everything starts here
+	/* everything starts here */
 	lash_cleanup(p_lash);
 
 	return_status = discover_network_properties(p_lash);


From hnrose at comcast.net  Fri Feb 27 10:45:57 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 27 Feb 2009 13:45:57 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: In
	osm_perfmgr_shutdown, add missing cl_disp_unregister
Message-ID: <20090227184557.GB15668@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 6d325cb..f146fac 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -849,6 +849,7 @@ void osm_perfmgr_shutdown(osm_perfmgr_t * const pm)
 {
 	OSM_LOG_ENTER(pm->log);
 	cl_timer_stop(&pm->sweep_timer);
+	cl_disp_unregister(pm->pc_disp_h);
 	osm_perfmgr_mad_unbind(pm);
 	OSM_LOG_EXIT(pm->log);
 }


From hnrose at comcast.net  Fri Feb 27 10:44:50 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 27 Feb 2009 13:44:50 -0500
Subject: [ofa-general] [PATCHv2] Add pkey table support to
	osm_get_all_port_attrs
Message-ID: <20090227184450.GA15668@comcast.net>


Only supported in osm_vendor_ibumad.c (separate patch for other
vendor layers)
Also, update applications using this (osmtest, opensm)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
Changes from v1:
Only copy number of pkeys indicated
Also, don't indicate insufficient memory error if insufficient pkey space
supplied and always return number of pkeys that the port supports

Note: initialization prior to get_all_port_attrs call not changed
since it is faster this way

Other patch for other vendor layers still appropriate following this
ibutils patch to come 

diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c
index 734a860..7a578ea 100644
--- a/opensm/libvendor/osm_vendor_ibumad.c
+++ b/opensm/libvendor/osm_vendor_ibumad.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	umad_ca_t ca;
 	ib_port_attr_t *attr = p_attr_array;
 	unsigned done = 0;
-	int r, i, j;
+	int r, i, j, k;
 
 	OSM_LOG_ENTER(p_vend->p_log);
 
 	CL_ASSERT(p_vend && p_num_ports);
 
+	r = 0;
 	if (!*p_num_ports) {
 		r = IB_INVALID_PARAMETER;
 		OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: "
@@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	}
 
 	for (i = 0; i < p_vend->ca_count && !done; i++) {
-		/*
-		 * For each CA, retrieve the port guids
-		 */
+		/* For each CA, retrieve the port attributes */
 		if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) {
 			if (ca.node_type < 1 || ca.node_type > 3)
 				continue;
@@ -590,6 +590,12 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 				attr->port_num = ca.ports[j]->portnum;
 				attr->sm_lid = ca.ports[j]->sm_lid;
 				attr->link_state = ca.ports[j]->state;
+				if (attr->num_pkeys && attr->p_pkey_table) {
+					for (k = 0; k < attr->num_pkeys; k++)
+						attr->p_pkey_table[k] =
+							cl_hton16(ca.ports[j]->pkeys[k]);
+				}
+				attr->num_pkeys = ca.ports[j]->pkeys_size;
 				attr++;
 				if (attr - p_attr_array > *p_num_ports) {
 					done = 1;
@@ -601,7 +607,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend,
 	}
 
 	*p_num_ports = attr - p_attr_array;
-	r = 0;
 
 Exit:
 	OSM_LOG_EXIT(p_vend->p_log);
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 47fd658..1507fff 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c
index f87e33b..bc8999d 100644
--- a/opensm/osmtest/main.c
+++ b/opensm/osmtest/main.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt)
 	ib_port_attr_t attr_array[GUID_ARRAY_SIZE];
 	int i;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	   Call the transport layer for a list of local port
 	   GUID values.
@@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid)
 	ib_port_attr_t attr_array[GUID_ARRAY_SIZE];
 	int i;
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	   Call the transport layer for a list of local port
 	   GUID values.
diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c
index 32cfa01..bdfe42c 100644
--- a/opensm/osmtest/osmtest.c
+++ b/opensm/osmtest/osmtest.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt,
 	ib_api_status_t status;
 	uint32_t num_ports = GUID_ARRAY_SIZE;
 	ib_port_attr_t attr_array[GUID_ARRAY_SIZE];
+	int i;
 
 	OSM_LOG_ENTER(&p_osmt->log);
 
+	for (i = 0; i < num_ports; i++) {
+		attr_array[i].num_pkeys = 0;
+		attr_array[i].p_pkey_table = NULL;
+	}
+
 	/*
 	 * Call the transport layer for a list of local port
 	 * GUID values.


From hal.rosenstock at gmail.com  Fri Feb 27 10:53:19 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Feb 2009 13:53:19 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console: Enhance
	perfmgr print_counters for better nodenames
In-Reply-To: <20090227092405.GD7462@sashak.voltaire.com>
References: <20090219130653.GA29318@comcast.net>
	<20090226061551.GQ11192@sashak.voltaire.com>
	<f0e08f230902260403o2e266802t43fb893f0dd6ade0@mail.gmail.com>
	<20090227092405.GD7462@sashak.voltaire.com>
Message-ID: <f0e08f230902271053q16bef315pd524f8722e741c84@mail.gmail.com>

On Fri, Feb 27, 2009 at 4:24 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 07:03 Thu 26 Feb     , Hal Rosenstock wrote:
>> On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>>
>> [snip...]
>>
>> > And in general I think it is better to use C-style comments - /* ... */,
>> > in C code and not C++-style // ... .
>>
>> Is this going to be enforced uniformly across OpenSM ?
>
> I didn't think about it (there are no many '//' comments), but I try to
> not introduce a new ones.

Sent some patches relative to this. I'm not willing to take on
ib_types.h right now. Maybe someone else will.

-- Hal

>
> Sasha
>


From sashak at voltaire.com  Fri Feb 27 10:59:05 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Feb 2009 20:59:05 +0200
Subject: [ofa-general] Re: [PATCH v2] [ib-diag] saquery: add support for
	WinOF
In-Reply-To: <2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com>
References: <E89E915B9031495CB1B17A0ECEF08C82@amr.corp.intel.com>
	<D8B17F48E87548858AA57D0ADC66BDC7@amr.corp.intel.com>
	<20090226101144.GB11192@sashak.voltaire.com>
	<0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com>
	<20090226210211.GK14238@sashak.voltaire.com>
	<45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com>
	<1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com>
	<BA7487E565E84667A1AAB40F6C990DFC@amr.corp.intel.com>
	<20090227083200.GA7462@sashak.voltaire.com>
	<2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com>
Message-ID: <20090227185858.GJ7462@sashak.voltaire.com>

On 09:51 Fri 27 Feb     , Sean Hefty wrote:
> >>   - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments
> >>     are unsigned.  This is kind of confusing, but that's how it appears the
> >>     macro is used.  It might be clearer if instead of passing -1 into the
> >>     macro, that a SET_VAL macro be used instead.
> >
> >What do you mean? Another macro?
> 
> yes -- instead of passing -1 into CHECK_AND_SET_VAL as the value to compare
> against, call a different macro that just sets the value, unless I'm
> misunderstanding why -1 is passed in.  Then CHECK_AND_SET_VAL would do unsigned
> comparisons.
> 
> I can submit a patch for this, but I wasn't completely sure of the intent of
> using -1 as the compare value.

For some parameters (such as SL) "0" is valid value and could be
specified using command line options, so I used -1 as initial value to
mark such parameters as non-requested for the query (so its comp_mask is
not selected at all).

Sasha


From neutronsharc at gmail.com  Fri Feb 27 11:01:50 2009
From: neutronsharc at gmail.com (neutron)
Date: Fri, 27 Feb 2009 14:01:50 -0500
Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results
	in crash
In-Reply-To: <499E6826.704@sun.com>
References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com>
	<adavdr7z2be.fsf@cisco.com>
	<7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com>
	<499E6826.704@sun.com>
Message-ID: <7d5928b30902271101h589ad61cha59f626572a24802@mail.gmail.com>

It might be related to new ConnectX card (with mlx4_ib module).

Now I tried the same program on a machine with only "mthca" card, it
succeeds without any problems.

thanks.


I remember one guy in this list also reported a similar issue:
ib_phys_reg_mr( )  fails with mlx4 module.


On Fri, Feb 20, 2009 at 3:21 AM, Liang Zhen <Zhen.Liang at sun.com> wrote:
> Hmm, I didn't see any problem in your code. Have you installed
> ofa_kernel_devel (kernel headers of  OFED) after installation of
> ofa_kernel_1_3_1?
>
> Regards
> Liang
>
> neutron:
>>
>> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version:
>> 2.6.18-53.1.14.el5,  ofed 1.3.1.
>>
>> The failed function call is like:
>>
>> {
>>
>> ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE,
>>                &dma_addr, GFP_KERNEL);
>>
>> ctx->phy_buf[0].addr = dma_addr;
>> ctx->phy_buf[0].size = MAX_SIZE;
>> ctx->iovstart = (u64) ctx->send_buf;
>>
>> printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n",
>>       ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart
>> );
>>
>> send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1,
>>                        IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ
>>                         | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart));
>> }
>>
>> The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf".
>>
>> Below is /var/log/messages output around the crash.
>> ----------------
>> Feb 19 12:50:22 wci30 kernel:  pd=ffff8101da3ddce0,
>> phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000
>>
>> Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer
>> dereference at 0000000000000000
>>  RIP:
>> Feb 19 12:50:22 wci30 kernel:  [<0000000000000000>]
>> _stext+0x7ffff000/0x1000
>> Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0
>> Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP
>> Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version
>> Feb 19 12:50:22 wci30 kernel: CPU 0
>> Feb 19 12:54:05 wci30 syslogd 1.4.1: restart.
>> Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5
>> (brewbuilder at hs20-bc2-3.build.redha
>> t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb
>> 19 07:18:46 EST 2008
>> Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet
>>
>> ====================
>> It's strange that the kernel doesn't print out the function call stack
>> before crashing.
>>
>> Any hints?  Thanks a lot!
>>
>> On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier <rdreier at cisco.com> wrote:
>>
>>>
>>>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments
>>>  > are valid.  But the system always crashes immediately after entering
>>>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!!
>>>
>>> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do
>>> you get an oops message?  If so that would be very important info for
>>> debugging this.
>>>
>>> - R.
>>>
>>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>


From vst at vlnb.net  Fri Feb 27 11:49:03 2009
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Fri, 27 Feb 2009 22:49:03 +0300
Subject: [Scst-devel]	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A73164.3010109@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl	nb.net>	<4980B8DE.3060806@harr.org>
	<4995D1EE.4000807@vlnb.net>	<49A42BE9.4030603@har r.org>
	<49A43439.7080405@vlnb.net>	<49A4812A.8050202@harr.org>
	<49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org>
	<49A6F25F.8060306@vlnb .net> <49A73164.3010109@har r.org>
Message-ID: <49A843AF.3010106@vlnb.net>

Cameron Harr, on 02/27/2009 03:18 AM wrote:
> Vladislav Bolkhovitin wrote:
>> Cameron Harr, on 02/26/2009 08:19 PM wrote:
>>> Cameron Harr wrote:
>>>> Cameron Harr wrote:
>>>> I re-compiled and re-ran the tests and numbers are a little better 
>>>> but performance still seems to have gone down from 673:
>>>> Test 1:373751.66
>>>> Test 2:371242.6067
>>>> Test 3:347988.1467
>>>> Test 4:378247.31
>>>> Test 5:375616.53
>>> I was curious and did a regression test with 673 and those numbers 
>>> are now even worse, so I'll presume there is an issue on my system 
>>> and not the SCST code:
>>> Test 1:365204.3067
>>> Test 2:364152.2067
>>> Test 3:340665.7633
>>> Test 4:369916.8133
>>> Test 5:369093.5833
>> It's known that any OS, including Linux, is getting "tired" under load 
>> with time from boot, which leads to worse performance. I guess, you 
>> can experience such effect.
>>
>> Check with r634. R635 has cache locality in data structures related 
>> change, which intended to improve performance a bit, but might make it 
>> worse instead.
>>
> 
> This is with 634. It's pretty bad:
> 338316.44
> 329698.04
> 307972.7133
> 345682.4733
> 344165.08

And 633 is better?

Definitely, you suffer from the system "tiring" effect. So, to get 
comparable results you should do measurements in a predefined state of 
the system, for instance just after boot, and in a row, i.e. one 
immediately after one.

Vlad


From cameron at harr.org  Fri Feb 27 11:56:31 2009
From: cameron at harr.org (Cameron Harr)
Date: Fri, 27 Feb 2009 12:56:31 -0700
Subject: [Scst-devel]	[ofa-general]	SRP/mlx4	interrupts	throttling	performance
In-Reply-To: <49A843AF.3010106@vlnb.net>
References: <48E386F6.5040502@fusionio.com>	<48FEDA26.4080304@vlnb.net>	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>	<48FF60D3.9020809@harr.org>	<4901F14C.6000006@harr.org>	<490210EE.2070000@vlnb.net>	<49022553.1020804@harr.org>	<490B45ED.3020203@vlnb.net>	<4910A622.4050906@harr.org>	<4911D827.10705@vlnb.net>	<49121715.4040804@harr.org>	<4912C684.5000505@vlnb.net>	<491307C7.50008@harr.org>	<49131A85.2010102@vlnb.net>	<49189567.1010804@harr.org>	<49258122.6040808@vlnb.net>	<496687DA.6010707@harr.org>	<496B98DF.4050305@vlnb.net>	<496BD8CA.7050503@harr.org>	<496C81E3.2050105@vlnb.net>	<496CC493.3040207@harr.org>	<496CD883.8040906@vlnb.net>	<496CDFE0.2030601@harr.org>	<4970F014.2030101@vl	nb.net>	<4980B8DE.3060806@harr.org>
	<4995D1EE.4000807@vlnb.net>	<49A42BE9.4030603@har r.org>
	<49A43439.7080405@vlnb.net>	<49A4812A.8050202@harr.org>
	<49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org>
	<49A6F25F.8060306@vlnb .net> <49A73164.3010109@har r.org>
	<49A843AF.3010106@vl nb.net>
Message-ID: <49A8456F.6000908@harr.org>

Vladislav Bolkhovitin wrote:
>>
>> This is with 634. It's pretty bad:
>> 338316.44
>> 329698.04
>> 307972.7133
>> 345682.4733
>> 344165.08
>
> Definitely, you suffer from the system "tiring" effect. So, to get 
> comparable results you should do measurements in a predefined state of 
> the system, for instance just after boot, and in a row, i.e. one 
> immediately after one.
I think you're right, that the system is getting "tired." I think I am 
going to rest with the benchmarking for now though and just stick with 
the latest code in trunk, noting that my "Test 4" reliably produces the 
best results.
Cameron


From Ted.Kim at Sun.COM  Fri Feb 27 11:59:18 2009
From: Ted.Kim at Sun.COM (Ted H. Kim)
Date: Fri, 27 Feb 2009 11:59:18 -0800
Subject: [ofa-general] 4K MTU for ISR-9024D-M?
Message-ID: <49A84616.1000606@sun.com>

Folks,

Anyone know off hand if 4K MTU firmware/setting is available for an
ISR-9024D-M? Please reply to me directly. Just trying to save time,
before trying to navigate customer service.

-ted


 > Subject: RE: [ofa-general] Configuring a 4 KB InfniBand link MTU
 > From: Boris Shpolyansky <boris at mellanox.com>
 > Date:Fri, 09 Jan 2009 09:49:35 -0800
 > To: James Lentini <jlentini at netapp.com>, Hal Rosenstock 
<hal.rosenstock at gmail.com>
 > CC: general at lists.openfabrics.org
 >
 > James,
 >
 > Mellanox InfiniScale III switch chip does support 4K MTU as stated in
 > the product brief. However it requires special FW settings that might
 > or might not be available from/supported by particular switch system
 > vendor.
 >
 > Boris Shpolyansky
 > Sr. Member of Technical Staff, Applications
 >
 > Mellanox Technologies Inc.
 > 350 Oakmead Parkway
 > Sunnyvale, CA 94085
 > Tel.: (408) 916 0014
 > Fax: (408) 970 3403
 > Cell: (408) 834 9365
 > www.mellanox.com


-- 
Ted H. Kim
Sun Microsystems, Inc.                  ted.kim at sun.com
222 North Sepulveda Blvd., 10th Floor   (310) 341-1116
El Segundo, CA  90245                   (310) 341-1120 FAX


From jgunthorpe at obsidianresearch.com  Fri Feb 27 13:10:33 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 27 Feb 2009 14:10:33 -0700
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console:
	Enhance perfmgr print_counters for better nodenames
In-Reply-To: <f0e08f230902271053q16bef315pd524f8722e741c84@mail.gmail.com>
References: <20090219130653.GA29318@comcast.net>
	<20090226061551.GQ11192@sashak.voltaire.com>
	<f0e08f230902260403o2e266802t43fb893f0dd6ade0@mail.gmail.com>
	<20090227092405.GD7462@sashak.voltaire.com>
	<f0e08f230902271053q16bef315pd524f8722e741c84@mail.gmail.com>
Message-ID: <20090227211033.GC16941@obsidianresearch.com>

On Fri, Feb 27, 2009 at 01:53:19PM -0500, Hal Rosenstock wrote:
> On Fri, Feb 27, 2009 at 4:24 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 07:03 Thu 26 Feb ?? ?? , Hal Rosenstock wrote:
> >> On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >>
> >> [snip...]
> >>
> >> > And in general I think it is better to use C-style comments - /* ... */,
> >> > in C code and not C++-style // ... .
> >>
> >> Is this going to be enforced uniformly across OpenSM ?
> >
> > I didn't think about it (there are no many '//' comments), but I try to
> > not introduce a new ones.
> 
> Sent some patches relative to this. I'm not willing to take on
> ib_types.h right now. Maybe someone else will.

Is this really worth doing? // is included in C99 and many other
C99isms are already used in the source (well, until Sean removes
them.. :)

Jason


From hnrose at comcast.net  Fri Feb 27 13:18:51 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 27 Feb 2009 16:18:51 -0500
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: Improve
	assert in osm_pc_rcv_process
Message-ID: <20090227211851.GA25061@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 6d325cb..a74d35e 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -1106,6 +1106,9 @@ static void osm_pc_rcv_process(void *context, void *data)
 		"Processing received MAD status 0x%x context 0x%"
 		PRIx64 " port %u\n", p_mad->status, node_guid, port);
 
+	CL_ASSERT(p_mad->attr_id == IB_MAD_ATTR_PORT_CNTRS ||
+		  p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO);
+
 	/* Response could also be redirection (IBM eHCA PMA does this) */
 	if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) {
 		char gid_str[INET6_ADDRSTRLEN];
@@ -1165,8 +1168,6 @@ static void osm_pc_rcv_process(void *context, void *data)
 		goto Exit;
 	}
 
-	CL_ASSERT(p_mad->attr_id == IB_MAD_ATTR_PORT_CNTRS);
-
 	perfmgr_db_fill_err_read(wire_read, &err_reading);
 	/* FIXME separate query for extended counters if they are supported
 	 * on the port.


From ralph.campbell at qlogic.com  Fri Feb 27 13:38:11 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 27 Feb 2009 13:38:11 -0800
Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv before
	putting on lists
Message-ID: <1235770691.3948.229.camel@chromite.mv.qlogic.com>

There is a potential race in ib_register_mad_agent() where the struct
ib_mad_agent_private is not fully initialized before it is added
to the list of agents per IB port. This means the ib_mad_agent_private
could be seen before the refcount, spin locks, and linked lists
are initialized.  The fix is to initialize the structure earlier.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 735ad4e..dbcd285 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -301,6 +301,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	mad_agent_priv->agent.context = context;
 	mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp;
 	mad_agent_priv->agent.port_num = port_num;
+	spin_lock_init(&mad_agent_priv->lock);
+	INIT_LIST_HEAD(&mad_agent_priv->send_list);
+	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
+	INIT_LIST_HEAD(&mad_agent_priv->done_list);
+	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
+	INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends);
+	INIT_LIST_HEAD(&mad_agent_priv->local_list);
+	INIT_WORK(&mad_agent_priv->local_work, local_completions);
+	atomic_set(&mad_agent_priv->refcount, 1);
+	init_completion(&mad_agent_priv->comp);
 
 	spin_lock_irqsave(&port_priv->reg_lock, flags);
 	mad_agent_priv->agent.hi_tid = ++ib_mad_client_id;
@@ -350,17 +360,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list);
 	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 
-	spin_lock_init(&mad_agent_priv->lock);
-	INIT_LIST_HEAD(&mad_agent_priv->send_list);
-	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
-	INIT_LIST_HEAD(&mad_agent_priv->done_list);
-	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
-	INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends);
-	INIT_LIST_HEAD(&mad_agent_priv->local_list);
-	INIT_WORK(&mad_agent_priv->local_work, local_completions);
-	atomic_set(&mad_agent_priv->refcount, 1);
-	init_completion(&mad_agent_priv->comp);
-
 	return &mad_agent_priv->agent;
 
 error4:


From ralph.campbell at qlogic.com  Fri Feb 27 13:45:57 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 27 Feb 2009 13:45:57 -0800
Subject: [ofa-general] [PATCH] IB/core: ib_post_send_mad() returns zero but
	doesn't generate send completion
Message-ID: <1235771157.3948.233.camel@chromite.mv.qlogic.com>

If ib_post_send_mad() returns zero, it guarantees that there will be
a callback to the send_buf->mad_agent->send_handler() so that the
sender can call ib_free_send_mad(). Otherwise, the ib_mad_send_buf
will be leaked and the mad_agent reference count will never go to zero
and the IB device module cannot be unloaded.
The above can happen without this patch if process_mad() returns
(IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED).

If process_mad() returns IB_MAD_RESULT_SUCCESS and there is no agent
registered to receive the mad being sent, handle_outgoing_dr_smp()
returns zero which causes a MAD packet which is at the end of the
directed route to be incorrectly sent on the wire but doesn't cause
a hang since the HCA generates a send completion.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index dbcd285..62a99dc 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -742,9 +742,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv,
 		break;
 	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
 		kmem_cache_free(ib_mad_cache, mad_priv);
-		kfree(local);
-		ret = 1;
-		goto out;
+		break;
 	case IB_MAD_RESULT_SUCCESS:
 		/* Treat like an incoming receive MAD */
 		port_priv = ib_get_mad_port(mad_agent_priv->agent.device,
@@ -755,10 +753,12 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv,
 						        &mad_priv->mad.mad);
 		}
 		if (!port_priv || !recv_mad_agent) {
+			/*
+			 * No receiving agent so drop packet and
+			 * generate send completion.
+			 */
 			kmem_cache_free(ib_mad_cache, mad_priv);
-			kfree(local);
-			ret = 0;
-			goto out;
+			break;
 		}
 		local->mad_priv = mad_priv;
 		local->recv_mad_agent = recv_mad_agent;


From sean.hefty at intel.com  Fri Feb 27 14:23:11 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 27 Feb 2009 14:23:11 -0800
Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv
	before	putting on lists
In-Reply-To: <1235770691.3948.229.camel@chromite.mv.qlogic.com>
References: <1235770691.3948.229.camel@chromite.mv.qlogic.com>
Message-ID: <41FB31A236F64B6685D188CF3CCDF995@amr.corp.intel.com>

>There is a potential race in ib_register_mad_agent() where the struct
>ib_mad_agent_private is not fully initialized before it is added
>to the list of agents per IB port. This means the ib_mad_agent_private
>could be seen before the refcount, spin locks, and linked lists
>are initialized.  The fix is to initialize the structure earlier.

This looks correct and needed to me.


From rdreier at cisco.com  Fri Feb 27 14:44:45 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 27 Feb 2009 14:44:45 -0800
Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv before
	putting on lists
In-Reply-To: <1235770691.3948.229.camel@chromite.mv.qlogic.com> (Ralph
	Campbell's message of "Fri, 27 Feb 2009 13:38:11 -0800")
References: <1235770691.3948.229.camel@chromite.mv.qlogic.com>
Message-ID: <adaprh3mrdu.fsf@cisco.com>

thanks, applied


From sumeet.lahorani at oracle.com  Fri Feb 27 15:36:14 2009
From: sumeet.lahorani at oracle.com (Sumeet Lahorani)
Date: Fri, 27 Feb 2009 15:36:14 -0800
Subject: [ofa-general] Measuring SDP throughput
Message-ID: <49A878EE.70908@oracle.com>


Hi,

Is there a tool to observe the SDP throughput while a workload is in 
progress? I'm not looking for a tool such as qperf which generates it's 
own workload.

We have voltaire switches which give us overall throughput numbers 
through the PortCounters.csv file but these are not limited to just the 
SDP traffic.

- Sumeet


From andy.grover at gmail.com  Fri Feb 27 17:53:19 2009
From: andy.grover at gmail.com (Andrew Grover)
Date: Fri, 27 Feb 2009 17:53:19 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <87myc73izx.fsf@basil.nowhere.org>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
Message-ID: <c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>

On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen <andi at firstfloor.org> wrote:
>> This patchset against net-next adds support for RDS sockets. RDS is an
>> Oracle-originated protocol used to send IPC datagrams (up to 1MB)
>> reliably, and is used currently in Oracle RAC and Exadata products.
>
> Perhaps I missed it earlier, but what is the rationale for putting
> this as a socket type into the kernel? I assume they also work
> directly as implemented in user space using raw sockets or similar,
> don't they?

You want me to implement my fancy protocol in userspace???

Do I even get to write it in C or do I need to use Ruby?

Regards -- Andy


From andi at firstfloor.org  Fri Feb 27 21:56:08 2009
From: andi at firstfloor.org (Andi Kleen)
Date: Sat, 28 Feb 2009 06:56:08 +0100
Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2
In-Reply-To: <c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
	<c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
Message-ID: <20090228055608.GB26292@one.firstfloor.org>

On Fri, Feb 27, 2009 at 05:53:19PM -0800, Andrew Grover wrote:
> On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen <andi at firstfloor.org> wrote:
> >> This patchset against net-next adds support for RDS sockets. RDS is an
> >> Oracle-originated protocol used to send IPC datagrams (up to 1MB)
> >> reliably, and is used currently in Oracle RAC and Exadata products.
> >
> > Perhaps I missed it earlier, but what is the rationale for putting
> > this as a socket type into the kernel? I assume they also work
> > directly as implemented in user space using raw sockets or similar,
> > don't they?
> 
> You want me to implement my fancy protocol in userspace???

I just asked why you're putting it in kernel space.

> Do I even get to write it in C or do I need to use Ruby?

Well normally people who add new subsystems to the kernel explain
why they do that. Perhaps it's obvious to you, but at least to
me it isn't.

-Andi

-- 
ak at linux.intel.com -- Speaking for myself only.


From vlad at lists.openfabrics.org  Sat Feb 28 03:17:10 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 28 Feb 2009 03:17:10 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20090228-0200 daily build status
Message-ID: <20090228111711.2FFCDE60FB5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From sashak at voltaire.com  Sat Feb 28 09:13:44 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:13:44 +0200
Subject: [ofa-general] [ANNOUNCE] management tarballs release
Message-ID: <20090228171344.GK7462@sashak.voltaire.com>

Hi,

There is a new release of the management (OpenSM and infiniband
diagnostics) tarballs available in:

http://www.openfabrics.org/downloads/management/

md5sum:

97b2609f5eaaf4320b39f44a50500b70  libibumad-1.3.1.tar.gz
e60b1c787d7cd2768967ca4766238210  libibmad-1.3.1.tar.gz
8c8c153f21d9f6cee51fc3d501c54fe7  opensm-3.3.1.tar.gz
6b6c87ed01291a2a3322b0ff696c5a11  infiniband-diags-1.5.1.tar.gz

All component versions are from recent master branch. Full change log is
below.

Sasha


Arlin Davis (3):
      libibmad: add os dependent definitions.
      libibmad: remove c99 definitions within the ib_mad_f structure
      libibmad: minor changes to source to allow portability to WinOF.

David McMillen (1):
      infiniband-diags/src/ibnetdiscover.c missing LID information on --ports

Eli Dorfman (1):
      opensm/osm_inform.c report IB traps to plugin

Eli Dorfman (Voltaire) (10):
      opensm/osm_subnet.c Fix memory leak for QOS string parameters.
      libibmad add PortXmitWait and CounterSelect2 to fields.
      opensm: Add new partition keyword for all hca, switches and routers
      docs update documenatation about new partition keywords
      infiniband-diags support PortXmitWait get and set
      opensm/osm_log.c save log_max_size in subnet opt in MB
      opensm/osm_subnet.c support subnet configuration rescan and update
      libibmad/src/dump.c fix dump functions for big endian machines
      opensm/osm_subnet.c enable log_max_size opt update
      opensm/osm_subnet.c fix parse functions for big endian machines

Hal Rosenstock (23):
      opensm/libvendor/osm_vendor_sa_api.h: Fix commentary typo
      opensm/osm_inform.c: Eliminate compile warning
      opensm/osm_perfmgr_db.h: Remove unused typedef
      opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak on error
      libibmad/(mad.h fields.c): Add support for PerfMgt ClassPortInfo
      opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters
      opensm/PerfMgr: Mainly cosmetic changes
      opensm/osm_node.h: Fix osm_node_get_num_physp description
      opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation
      opensm/doc/perf-manager-arch.txt: Fix some commentary typos
      opensm/PerfMgr: Add copyrights
      libibmad: lid print format changed to unsigned
      libibumad/umad.c: Change lid print format to unsigned
      infiniband-diags/perfquery: Change option name for extended counters
      opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec
      management/libibmad.txt: Remove madrpc_lock/unlock
      opensm/man/opensm.8.in: Indicate ROUTER_EXP obsoleted
      opensm/osm_console.c: Improve perfmgr print_counters error message
      infiniband-diags/smpdump.c: Fix usage examples
      infiniband-diags/smpdump.c: Release umad resources on exit
      opensm/console: Enhance perfmgr print_counters for better nodenames
      libibmad/fields.c: Dump LIDs as unsigned decimal
      infiniband-diags/saquery.c: Convert more LID prints to unsigned decimal

Ira Weiny (3):
      opensm/opensm/osm_console.c: move reporting of plugins to "status" command.
      OpenSM: update osmeventplugin example for the new TRAP event.
      libibmad: Use enum types for function parameters

Mike Heinz (1):
      opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM

Nicolas Morey Chaisemartin (4):
      Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments
      opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation
      opensm/osm_ucast_ftree.c Fixed bad init value for down port index
      opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps

Ralph Campbell (2):
      libibumad: get_ca() can call release_ca() with uninitialized data
      opensm: fix structure definition for trap 257-258

Robert Pearson (10):
      mesh analysis - skeleton
      mesh analysis - mesh_t data structure
      mesh analysis - node and link structures
      mesh analysis - matrix/polynomial routines
      mesh analysis - local geometry
      mesh analysis - mesh info table
      mesh analysis - induce global geometry
      mesh analysis - reorder links
      mesh analysis - lash preparation
      mesh analysis - integrate into lash core

Sasha Khapyorsky (111):
      opensm: remove some unused variables and funcs
      opensm/osm_ucast_mgr: indentation fix
      infiniband-diags/saquery: indentation fixes
      infiniband-diabs/saquery: unify SA queries processors
      infiniband-diags/saquery: separate queries and commands
      infiniband-diags/saquery: PortInfoRecord query
      infinabd-diags: convert type uint -> unsigned int
      opensm: remove unused header osm_pkey_mgr.h
      opensm/osm_sm.c: fix MC group creation in race condition
      opensm/osm_sa_mcmember_record: improve __cleanup_mgrp()
      opensm/multicast: remove some unused parameters.
      opensm/osm_subnet: consolidate some duplicated code
      opensm/event_plugin: link opensm with -rdynamic flag
      opensm/vendor: save some stack memory
      infiniband-diags/saquery: minor indentation fixes
      opensm/osm_subnet.c: indentation fixes
      opensm/man/opensm.8.in: add descrition for --do_mesh_analysis option
      opensm: add do_mesh_analysis configuration parameter
      opensm/mesh: fix memory leaks
      opensm/lash: fix memory leaks
      infiniband-diags/ibstat,smpdump: kill unused includes
      opensm/osm_mesh: make mesh_info static and const
      opensm/osm_mesh: simplify mesh node links and ports allocation
      opensm/lash: simplify some memory allocations
      opensm/opensm.spec: fix event plugin config options
      libibmad: remove hidden _set/_get_field*() API
      management: move sysfs()_* function to libibumad
      opensm: remove libibcommon build dependencies
      management: remove libibcommon dependencies
      libibmad: remove not needed header files inclusion
      libibmad: remove functions which use pthread
      infiniband-diags/perfquery: indentation fixes
      opensm: update LFTs when entering master
      opensm/osm_subnet.c: drop some unneeded braces
      opensm: invalidate routing cache when entering master state
      opensm/osm_subnet.c: fix warnings in subn_free_qos_options()
      infiniband-diags/perfquery.c: fix typo
      libibmad: cleanup mad.h include path
      libibmad: indentation fixes
      libibmad/fields.c: fix MAD MKey offset
      libibmad: use mad_set_field64() for mkey encoding
      infiniband-diags/Makefile.am: kill -rpath
      infiniband-diags/Makefile.am: merge CFLAGS
      infiniband-diags/Makefile.am: use common library
      infiniband-diags/ibdiag_common: cosmetic
      infiniband-diags/ibdiag_common: move get_build_version()
      infiniband-diags: remove duplicated ibdebug prototype
      infiniband-diags/smpdump.c: use common ib definitions
      infiniband-diags/ibdiag_common: cleanup argv0 prototype
      infiniband-diags/dump_lfts.sh: fix -D format parsing
      infiniband-diags/dump_mfts.sh: fix -D format parsing
      libibcommon: remove from the management tree
      infiniband-diags: command line option processing framework
      infiniband-diags: using common command line option processing
      infiniband-diags: remove argv0 global variable
      infiniband-diags: make get_build_version() static
      infiniband-diags: remove unneeded includes
      infiniband-diags/smpquery: usage improvement
      infiniband-diags/saquery: add lid parameter to NodeRecord query
      infiniband-diags/ibsysstat: use RMPP for client/server communication
      infiniband-diags/ibsysstat: backward compatibility fixes
      infiniband-diags/saquery: fix backward compatibility bug
      infiniband-diags/smpdump: fix SL value encoding
      infiniband-diags/saquery: fix encoding of SA queries
      infiniband-diags/saquery: cosmetic
      infiniband-diags/saquery: CHECK_AND_SET_VAL() macro
      infiniband-diags/saquery: adding query params
      infiniband-diags/saquery: more params for Path and MCMember Records
      infiniband-diags/saquery: merge PathRecord query functions
      opensm/osm_subnet.c: fix compile warnings
      opensm: fix port chooser
      opensm/main.c: indentation fixes in get_port_guid()
      opensm/osm_sw_info_rcv.c: cosmetic changes
      opensm/osm_perfmgr.c: kill some redundant tests
      infiniband-diags/common: use enum MAD_DEST as ibd_dest_type type
      opensm: rescan config file even in standby
      opensm/ib_types.h: cosmetic
      opensm/osm_subnet.c: indentation fixes
      opensm/osm_subnet.c: clean_val() remove trailing quotation
      opensm/osm_subnet.c: break matching when config parameter already found
      opensm/osm_ucast_ftree.c: cosmetic improvements
      opensm: avoid memory leaks on config parameters reloading
      opensm/qos_config: no invalid option message on default values
      opensm: sort port order for routing by switch loads
      opensm/ftree: cleanup ftree_sw_tbl_element_t use
      opensm/ftree: simplify root guids setup.
      opensm/ftree: make unsigned sw->down_port_groups_idx
      opensm/osm_helper.c: print port number as decimal
      libibmad/mad.h: define more SA attributed
      libibmad/fields.c: define SA SM_Key field details
      infiniband-diags/saquery: remove osm vendor layer
      infiniband-diags/saquery: fix types and some cleanup
      infiniband-diags: some code consolidation
      infiniabnd-diags/common: wrap debug macros with do {} while (0)
      opensm/console: dump_portguid command fixes
      opensm/console: dump_portguid - don't duplicate matched guids
      opensm/console/dump_portguid: minor improvements
      opensm: pre-scan command line for config file option
      opensm/osm_subnet.c: move parse and setup functions
      opensm: proper config file rescan
      opensm/osm_subnet: fix crash in qos string config parameters reloading
      opensm/main.c: remove enable_stack_dump() call
      opensm/osm_qos.c: cosmetic: remove empty line
      opensm/Makefile.am: remove osm_build_id.h junk file generation
      opensm/lid_mgr: fix duplicated lid assignment
      opensm/lid_mgr: simplify lmc_mask initialization
      opensm/sweep: add log message before lid assignment
      opensm/osm_lid_mgr.c: consolidate flows
      infiniband-diags/ibroute: fix warning
      opensm: OpenSM Release Notes for 3.3
      management: bump all package versions

Sean Hefty (17):
      sminfo: add support for WinOF
      vendstat: add support for WinOF
      ibaddr: add support for WinOF
      perfquery: add support for WinOF
      ibportstate: add support for WinOF
      ibstat: add support for WinOF
      smpdump: add support for WinOF
      ibping: add support for WinOF
      smpquery: add support for WinOF
      [ib-diag] ibnetdiscover: add support for WinOF
      [ib-diag] ibroute: add support for WinOF
      [ib-diag] ibtracert: add support for WinOF
      [ib-diag] ibsendtrap: add support for WinOF
      [ib-diag] mcm_rereg_test: add support for WinOF
      [ib-diag] ibsysstat: add support for WinOF
      [ib-diags] saquery: set correct pkey table field
      [ib-diag] saquery: add support for WinOF

Stan Smith (1):
      libibmad: add MAD_EXPORT to exported calls

Yevgeny Kliteynik (6):
      opensm/osm_ucast_ftree.c: fixing errors in comments
      opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active
      opensm/osm_ucast_ftree.c: fix full topology dump
      opensm/osm_sa.c: fixing SA MAD dump
      opensm/osm_state_mgr.c: small bug in scanning lid table
      opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node

hnrose at comcast.net (5):
      opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin
      opensm/osm_helper.c: Add port counters to __osm_disp_msg_str
      opensm/osm_console.c: Add list of SMs to status command
      opensm/osm_console.c: Eliminate some extraneous parentheses
      opensm/osm_console.c: Add missing command in help_perfmgr


From sashak at voltaire.com  Sat Feb 28 09:31:40 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:31:40 +0200
Subject: [ofa-general] [PATCH] opensm/osm_console.c: kill warning: defined
	but not used
Message-ID: <20090228173140.GM7462@sashak.voltaire.com>


Kill compile warning: osm_console.c:82: warning: 'name_token' defined
but not used

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_console.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index e1936fb..63c5ea8 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -78,10 +78,12 @@ static char *next_token(char **p_last)
 	return strtok_r(NULL, " \t\n\r", p_last);
 }
 
+#ifdef ENABLE_OSM_PERF_MGR
 static char *name_token(char **p_last)
 {
 	return strtok_r(NULL, "\t\n\r", p_last);
 }
+#endif
 
 static void help_command(FILE * out, int detail)
 {
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 09:32:47 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:32:47 +0200
Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr: use single array for
	used_lids
Message-ID: <20090228173247.GN7462@sashak.voltaire.com>


Use single array (instead of ptr vector) for used_lids.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_lid_mgr.h |    4 +-
 opensm/opensm/osm_lid_mgr.c         |   60 +++++++++--------------------------
 2 files changed, 17 insertions(+), 47 deletions(-)

diff --git a/opensm/include/opensm/osm_lid_mgr.h b/opensm/include/opensm/osm_lid_mgr.h
index 714ba41..d6d1ab8 100644
--- a/opensm/include/opensm/osm_lid_mgr.h
+++ b/opensm/include/opensm/osm_lid_mgr.h
@@ -98,8 +98,8 @@ typedef struct osm_lid_mgr {
 	cl_plock_t *p_lock;
 	boolean_t send_set_reqs;
 	osm_db_domain_t *p_g2l;
-	cl_ptr_vector_t used_lids;
 	cl_qlist_t free_ranges;
+	uint8_t used_lids[IB_LID_UCAST_END_HO + 1];
 } osm_lid_mgr_t;
 /*
 * FIELDS
@@ -125,7 +125,7 @@ typedef struct osm_lid_mgr {
 *		Pointer to the database domain storing guid to lid mapping.
 *
 *	used_lids
-*		A vector the maps from the lid to its guid. keeps track of
+*		An array of used lids. keeps track of
 *		existing and non existing mapping of guid->lid
 *
 *	free_ranges
diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index 63c3bb9..e527337 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -109,7 +109,6 @@ typedef struct osm_lid_mgr_range {
 void osm_lid_mgr_construct(IN osm_lid_mgr_t * const p_mgr)
 {
 	memset(p_mgr, 0, sizeof(*p_mgr));
-	cl_ptr_vector_construct(&p_mgr->used_lids);
 }
 
 /**********************************************************************
@@ -120,7 +119,6 @@ void osm_lid_mgr_destroy(IN osm_lid_mgr_t * const p_mgr)
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
-	cl_ptr_vector_destroy(&p_mgr->used_lids);
 	p_item = cl_qlist_remove_head(&p_mgr->free_ranges);
 	while (p_item != cl_qlist_end(&p_mgr->free_ranges)) {
 		free((osm_lid_mgr_range_t *) p_item);
@@ -188,11 +186,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr)
 			} else {
 				/* check if the lids were not previously assigned */
 				for (lid = min_lid; lid <= max_lid; lid++) {
-					if ((cl_ptr_vector_get_size
-					     (&p_mgr->used_lids) > lid)
-					    &&
-					    (cl_ptr_vector_get
-					     (&p_mgr->used_lids, lid))) {
+					if (p_mgr->used_lids[lid]) {
 						OSM_LOG(p_mgr->p_log,
 							OSM_LOG_ERROR, "ERR 0314: "
 							"0x%04x for guid:0x%016"
@@ -215,8 +209,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr)
 			} else {
 				/* mark it was visited */
 				for (lid = min_lid; lid <= max_lid; lid++)
-					cl_ptr_vector_set(&p_mgr->used_lids,
-							  lid, (void *)1);
+					p_mgr->used_lids[lid] = 1;
 			}
 		}		/* got a lid */
 		free(p_item);
@@ -252,7 +245,6 @@ osm_lid_mgr_init(IN osm_lid_mgr_t * const p_mgr, IN osm_sm_t *sm)
 		goto Exit;
 	}
 
-	cl_ptr_vector_init(&p_mgr->used_lids, 100, 40);
 	cl_qlist_init(&p_mgr->free_ranges);
 
 	/* we use the stored guid to lid table if not forced to reassign */
@@ -303,7 +295,6 @@ static uint16_t __osm_trim_lid(IN uint16_t lid)
 static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 {
 	cl_ptr_vector_t *p_discovered_vec = &p_mgr->p_subn->port_lid_tbl;
-	cl_ptr_vector_t *p_persistent_vec = &p_mgr->used_lids;
 	uint16_t max_defined_lid;
 	uint16_t max_persistent_lid;
 	uint16_t max_discovered_lid;
@@ -335,10 +326,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 			OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
 				"Ignore guid2lid file when coming out of standby\n");
 			osm_db_clear(p_mgr->p_g2l);
-			for (lid = 0;
-			     lid < cl_ptr_vector_get_size(&p_mgr->used_lids);
-			     lid++)
-				cl_ptr_vector_set(p_persistent_vec, lid, NULL);
+			memset(p_mgr->used_lids, 0, sizeof(p_mgr->used_lids));
 		} else {
 			OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
 				"Honor current guid2lid file when coming out "
@@ -413,7 +401,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 					       cl_ntoh64
 					       (osm_port_get_guid(p_port)));
 			for (lid = db_min_lid; lid <= db_max_lid; lid++)
-				cl_ptr_vector_set(p_persistent_vec, lid, NULL);
+				p_mgr->used_lids[lid] = 0;
 		}
 	}
 
@@ -437,14 +425,11 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 	/* find the range of lids to scan */
 	max_discovered_lid =
 	    (uint16_t) cl_ptr_vector_get_size(p_discovered_vec);
-	max_persistent_lid =
-	    (uint16_t) cl_ptr_vector_get_size(p_persistent_vec);
+	max_persistent_lid = sizeof(p_mgr->used_lids) - 1;
 
 	/* but the vectors have one extra entry for lid=0 */
 	if (max_discovered_lid)
 		max_discovered_lid--;
-	if (max_persistent_lid)
-		max_persistent_lid--;
 
 	if (max_persistent_lid > max_discovered_lid)
 		max_defined_lid = max_persistent_lid;
@@ -454,8 +439,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 	for (lid = 1; lid <= max_defined_lid; lid++) {
 		is_free = TRUE;
 		/* first check to see if the lid is used by a persistent assignment */
-		if ((lid <= max_persistent_lid)
-		    && cl_ptr_vector_get(p_persistent_vec, lid)) {
+		if (lid <= max_persistent_lid && p_mgr->used_lids[lid]) {
 			OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
 				"0x%04x is not free as its mapped by the "
 				"persistent db\n", lid);
@@ -515,11 +499,9 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr)
 					for (req_lid = disc_min_lid + 1;
 					     req_lid <= disc_max_lid;
 					     req_lid++) {
-						if ((req_lid <=
-						     max_persistent_lid) &&
-						    cl_ptr_vector_get
-						    (p_persistent_vec,
-						     req_lid)) {
+						if (req_lid <=
+						    max_persistent_lid &&
+						    p_mgr->used_lids[req_lid]) {
 							OSM_LOG(p_mgr->p_log,
 								OSM_LOG_DEBUG,
 								"0x%04x is free as it was discovered "
@@ -604,28 +586,16 @@ __osm_lid_mgr_is_range_not_persistent(IN osm_lid_mgr_t * const p_mgr,
 				      IN const uint16_t num_lids)
 {
 	uint16_t i;
-	cl_status_t status;
-	osm_port_t *p_port;
 	const uint8_t start_lid = (uint8_t) (1 << p_mgr->p_subn->opt.lmc);
-	const cl_ptr_vector_t *const p_tbl = &p_mgr->used_lids;
 
 	if (lid < start_lid)
-		return (FALSE);
+		return FALSE;
 
-	for (i = lid; i < lid + num_lids; i++) {
-		status = cl_ptr_vector_at(p_tbl, i, (void *)&p_port);
-		if (status == CL_SUCCESS) {
-			if (p_port != NULL)
-				return (FALSE);
-		} else
-			/*
-			   We are out of range in the array.
-			   Consider all further entries "free".
-			 */
-			return (TRUE);
-	}
+	for (i = lid; i < lid + num_lids; i++)
+		if (p_mgr->used_lids[lid])
+			return FALSE;
 
-	return (TRUE);
+	return TRUE;
 }
 
 /**********************************************************************
@@ -824,7 +794,7 @@ NewLidSet:
 	/* update the guid2lid db and used_lids */
 	osm_db_guid2lid_set(p_mgr->p_g2l, guid, *p_min_lid, *p_max_lid);
 	for (lid = *p_min_lid; lid <= *p_max_lid; lid++)
-		cl_ptr_vector_set(&p_mgr->used_lids, lid, (void *)1);
+		p_mgr->used_lids[lid] = 1;
 
 Exit:
 	/* make sure the assigned lids are marked in port_lid_tbl */
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 09:34:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:34:00 +0200
Subject: [ofa-general] [PATCH] opensm: initialize all switch ports
Message-ID: <20090228173400.GO7462@sashak.voltaire.com>


Initialize all switch port when NodeInfo is received. This addresses the
issue described in 8a2d2ddee7 where link could leave uninitialized
when SwitchInfo and PortInfo receiving races during discovery and also
simplify OpenSM discovery process implementation slightly.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_node.h  |    5 ++---
 opensm/opensm/osm_node.c          |   32 ++++++++------------------------
 opensm/opensm/osm_node_info_rcv.c |    4 ++--
 3 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h
index fec24ba..c7befff 100644
--- a/opensm/include/opensm/osm_node.h
+++ b/opensm/include/opensm/osm_node.h
@@ -443,9 +443,8 @@ osm_node_get_lmc(IN const osm_node_t * const p_node, IN const uint32_t port_num)
 *
 * SYNOPSIS
 */
-void
-osm_node_init_physp(IN osm_node_t * const p_node,
-		    IN const osm_madw_t * const p_madw);
+void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num,
+			 IN const osm_madw_t * const p_madw);
 /*
 * PARAMETERS
 *	p_node
diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c
index 07371a2..a97477a 100644
--- a/opensm/opensm/osm_node.c
+++ b/opensm/opensm/osm_node.c
@@ -51,20 +51,17 @@
 
 /**********************************************************************
  **********************************************************************/
-void
-osm_node_init_physp(IN osm_node_t * const p_node,
-		    IN const osm_madw_t * const p_madw)
+void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num,
+			 IN const osm_madw_t * const p_madw)
 {
 	ib_net64_t port_guid;
 	ib_smp_t *p_smp;
 	ib_node_info_t *p_ni;
-	uint8_t port_num;
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 
 	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
 	port_guid = p_ni->port_guid;
-	port_num = ib_node_info_get_local_port_num(p_ni);
 
 	CL_ASSERT(port_num < p_node->physp_tbl_size);
 
@@ -76,23 +73,6 @@ osm_node_init_physp(IN osm_node_t * const p_node,
 
 /**********************************************************************
  **********************************************************************/
-static void node_init_physp0(IN osm_node_t * const p_node,
-			     IN const osm_madw_t * const p_madw)
-{
-	ib_smp_t *p_smp;
-	ib_node_info_t *p_ni;
-
-	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
-
-	osm_physp_init(&p_node->physp_table[0],
-		       p_ni->port_guid, 0, p_node,
-		       osm_madw_get_bind_handle(p_madw),
-		       p_smp->hop_count, p_smp->initial_path);
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw)
 {
 	osm_node_t *p_node;
@@ -133,9 +113,13 @@ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw)
 	for (i = 0; i < p_node->physp_tbl_size; i++)
 		osm_physp_construct(&p_node->physp_table[i]);
 
-	osm_node_init_physp(p_node, p_madw);
 	if (p_ni->node_type == IB_NODE_TYPE_SWITCH)
-		node_init_physp0(p_node, p_madw);
+		for (i = 0; i <= p_ni->num_ports; i++)
+			osm_node_init_physp(p_node, i, p_madw);
+	else
+		osm_node_init_physp(p_node,
+				    ib_node_info_get_local_port_num(p_ni),
+				    p_madw);
 	p_node->print_desc = strdup(OSM_NODE_DESC_UNKNOWN);
 
 	return (p_node);
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index a37630a..9de68f9 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -414,7 +414,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm,
 			"Creating new port object with GUID 0x%" PRIx64 "\n",
 			cl_ntoh64(p_ni->port_guid));
 
-		osm_node_init_physp(p_node, p_madw);
+		osm_node_init_physp(p_node, port_num, p_madw);
 
 		p_port = osm_port_new(p_ni, p_node);
 		if (p_port == NULL) {
@@ -545,7 +545,7 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
 			PRIx64 ", port %u\n",
 			cl_ntoh64(osm_node_get_node_guid(p_node)),
 			port_num);
-		osm_node_init_physp(p_node, p_madw);
+		osm_node_init_physp(p_node, port_num, p_madw);
 	}
 
 	/*
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 09:35:09 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:35:09 +0200
Subject: [ofa-general] [PATCH] opensm: remove unneeded anymore physp
	initializations
In-Reply-To: <20090228173400.GO7462@sashak.voltaire.com>
References: <20090228173400.GO7462@sashak.voltaire.com>
Message-ID: <20090228173509.GP7462@sashak.voltaire.com>


Removed unneeded anymore physical port initializations - all should be
already initialized in osm_node_new(). Also put some debug assertions
(CL_ASSERT()).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_node_info_rcv.c |   28 +++-------------------------
 opensm/opensm/osm_port_info_rcv.c |   32 +++++++-------------------------
 2 files changed, 10 insertions(+), 50 deletions(-)

diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 9de68f9..ac86b9a 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -155,14 +155,9 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm,
 
 	/* When setting the link, ports on both
 	   sides of the link should be initialized */
-	if (!osm_node_link_has_valid_ports(p_node, port_num, p_neighbor_node,
-					   p_ni_context->port_num)) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-			"Link at node 0x%" PRIx64 ", port %u - no valid ports\n",
-			cl_ntoh64(osm_node_get_node_guid(p_node)), port_num);
-		CL_ASSERT(0);
-		goto _exit;
-	}
+	CL_ASSERT(osm_node_link_has_valid_ports(p_node, port_num,
+						p_neighbor_node,
+						p_ni_context->port_num));
 
 	if (osm_node_link_exists(p_node, port_num,
 				 p_neighbor_node, p_ni_context->port_num)) {
@@ -529,25 +524,8 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
 				     IN osm_node_t * const p_node,
 				     IN const osm_madw_t * const p_madw)
 {
-	ib_smp_t *p_smp;
-	ib_node_info_t *p_ni;
-	uint8_t port_num;
-
 	OSM_LOG_ENTER(sm->p_log);
 
-	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
-	port_num = ib_node_info_get_local_port_num(p_ni);
-
-	if (!osm_node_get_physp_ptr(p_node, port_num)) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-			"Creating physp for node GUID:0x%"
-			PRIx64 ", port %u\n",
-			cl_ntoh64(osm_node_get_node_guid(p_node)),
-			port_num);
-		osm_node_init_physp(p_node, port_num, p_madw);
-	}
-
 	/*
 	   If this switch has already been probed during this sweep,
 	   then don't bother reprobing it.
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 95ebdb4..654ede7 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -614,31 +614,13 @@ void osm_pi_rcv_process(IN void *context, IN void *data)
 
 		p_physp = osm_node_get_physp_ptr(p_node, port_num);
 
-		/*
-		   Determine if we encountered a new Physical Port.
-		   If so, initialize the new Physical Port then
-		   continue processing as normal.
-		 */
-		if (!p_physp) {
-			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-				"Initializing port number %u\n", port_num);
-			p_physp = &p_node->physp_table[port_num];
-			osm_physp_init(p_physp,
-				       port_guid,
-				       port_num,
-				       p_node,
-				       osm_madw_get_bind_handle(p_madw),
-				       p_smp->hop_count, p_smp->initial_path);
-		} else {
-			/*
-			   Update the directed route path to this port
-			   in case the old path is no longer usable.
-			 */
-			p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
-			osm_dr_path_init(p_dr_path,
-					 osm_madw_get_bind_handle(p_madw),
-					 p_smp->hop_count, p_smp->initial_path);
-		}
+		CL_ASSERT(p_physp);
+
+		/* Update the directed route path to this port
+		   in case the old path is no longer usable. */
+		p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
+		osm_dr_path_init(p_dr_path, osm_madw_get_bind_handle(p_madw),
+				 p_smp->hop_count, p_smp->initial_path);
 
 		/* if port just inited or reached INIT state (external reset)
 		   request update for port related tables */
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 09:36:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:36:35 +0200
Subject: [ofa-general] [PATCH] opensm: PortInfo requests for discovered
	switches
Message-ID: <20090228173635.GQ7462@sashak.voltaire.com>


Request PortInfo for all switch ports right on first NodeInfo
receiving and don't wait for SwitchInfo request results. This will
simplify a subnet discovery flow and speed it up.
Remove switch->discovery_count which is not needed anymore.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_switch.h |    6 ---
 opensm/opensm/osm_node_info_rcv.c  |   83 ++++++++++++++----------------------
 opensm/opensm/osm_perfmgr.c        |    1 -
 opensm/opensm/osm_state_mgr.c      |    1 -
 opensm/opensm/osm_sw_info_rcv.c    |   71 ------------------------------
 5 files changed, 32 insertions(+), 130 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 6279727..3e3626b 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -103,7 +103,6 @@ typedef struct osm_switch {
 	uint8_t *lft;
 	uint8_t *new_lft;
 	osm_mcast_tbl_t mcast_tbl;
-	uint32_t discovery_count;
 	unsigned endport_links;
 	unsigned need_update;
 	void *priv;
@@ -145,11 +144,6 @@ typedef struct osm_switch {
 *	mcast_tbl
 *		Multicast forwarding table for this switch.
 *
-*	discovery_count
-*		The number of times this switch has been discovered
-*		during the current fabric sweep.  This number is reset
-*		to zero at the start of a sweep.
-*
 *	need_update
 *		When set indicates that switch was probably reset, so
 *		fwd tables and rest cached data should be flushed
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index ac86b9a..e40fc82 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -244,51 +244,43 @@ _exit:
 }
 
 /**********************************************************************
- The plock must be held before calling this function.
 **********************************************************************/
-static void
-__osm_ni_rcv_process_new_node(IN osm_sm_t * sm,
-			      IN osm_node_t * const p_node,
-			      IN const osm_madw_t * const p_madw)
+static void ni_rcv_get_port_info(IN osm_sm_t * sm, IN osm_node_t * node,
+				 IN const osm_madw_t * madw)
 {
-	ib_api_status_t status = IB_SUCCESS;
 	osm_madw_context_t context;
-	osm_physp_t *p_physp;
-	ib_node_info_t *p_ni;
-	ib_smp_t *p_smp;
-	uint8_t port_num;
+	osm_physp_t *physp;
+	ib_node_info_t *ni;
+	unsigned port, num_ports;
+	ib_api_status_t status;
 
-	OSM_LOG_ENTER(sm->p_log);
+	ni = ib_smp_get_payload_ptr(osm_madw_get_smp_ptr(madw));
 
-	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
-	port_num = ib_node_info_get_local_port_num(p_ni);
+	if (ni->node_type == IB_NODE_TYPE_SWITCH) {
+		port = 0;
+		num_ports = osm_node_get_num_physp(node);
+	} else {
+		port = ib_node_info_get_local_port_num(ni);
+		num_ports = port + 1;
+	}
 
-	/*
-	   Request PortInfo & NodeDescription attributes for the port
-	   that responded to the NodeInfo attribute.
-	   Because this is a channel adapter or router, we are
-	   not allowed to request PortInfo for the other ports.
-	   Set the context union properly, so the recipient
-	   knows which node & port are relevant.
-	 */
-	p_physp = osm_node_get_physp_ptr(p_node, port_num);
+	physp = osm_node_get_physp_ptr(node, port);
 
-	context.pi_context.node_guid = p_ni->node_guid;
-	context.pi_context.port_guid = p_ni->port_guid;
+	context.pi_context.node_guid = osm_node_get_node_guid(node);
+	context.pi_context.port_guid = osm_physp_get_port_guid(physp);
 	context.pi_context.set_method = FALSE;
 	context.pi_context.light_sweep = FALSE;
 	context.pi_context.active_transition = FALSE;
 
-	status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp),
-			     IB_MAD_ATTR_PORT_INFO,
-			     cl_hton32(port_num), CL_DISP_MSGID_NONE, &context);
-	if (status != IB_SUCCESS)
-		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D02: "
-			"Failure initiating PortInfo request (%s)\n",
-			ib_get_err_str(status));
-
-	OSM_LOG_EXIT(sm->p_log);
+	for (; port < num_ports; port++) {
+		status = osm_req_get(sm, osm_physp_get_dr_path_ptr(physp),
+				     IB_MAD_ATTR_PORT_INFO, cl_hton32(port),
+				     CL_DISP_MSGID_NONE, &context);
+		if (status != IB_SUCCESS)
+			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR OD02: "
+				"Failure initiating PortInfo request (%s)\n",
+				ib_get_err_str(status));
+	}
 }
 
 /**********************************************************************
@@ -359,7 +351,7 @@ __osm_ni_rcv_process_new_ca_or_router(IN osm_sm_t * sm,
 {
 	OSM_LOG_ENTER(sm->p_log);
 
-	__osm_ni_rcv_process_new_node(sm, p_node, p_madw);
+	ni_rcv_get_port_info(sm, p_node, p_madw);
 
 	/*
 	   A node guid of 0 is the corner case that indicates
@@ -384,10 +376,8 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm,
 	ib_smp_t *p_smp;
 	osm_port_t *p_port;
 	osm_port_t *p_port_check;
-	osm_madw_context_t context;
 	uint8_t port_num;
 	osm_physp_t *p_physp;
-	ib_api_status_t status;
 	osm_dr_path_t *p_dr_path;
 	osm_bind_handle_t h_bind;
 
@@ -461,19 +451,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm,
 				 p_smp->initial_path);
 	}
 
-	context.pi_context.node_guid = p_ni->node_guid;
-	context.pi_context.port_guid = p_ni->port_guid;
-	context.pi_context.set_method = FALSE;
-	context.pi_context.light_sweep = FALSE;
-
-	status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp),
-			     IB_MAD_ATTR_PORT_INFO,
-			     cl_hton32(port_num), CL_DISP_MSGID_NONE, &context);
-
-	if (status != IB_SUCCESS)
-		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D13: "
-			"Failure initiating PortInfo request (%s)\n",
-			ib_get_err_str(status));
+	ni_rcv_get_port_info(sm, p_node, p_madw);
 
 Exit:
 	OSM_LOG_EXIT(sm->p_log);
@@ -513,6 +491,9 @@ __osm_ni_rcv_process_switch(IN osm_sm_t * sm,
 			"Failure initiating SwitchInfo request (%s)\n",
 			ib_get_err_str(status));
 
+	if (p_node->discovery_count == 1)
+		ni_rcv_get_port_info(sm, p_node, p_madw);
+
 	OSM_LOG_EXIT(sm->p_log);
 }
 
@@ -536,7 +517,7 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm,
 	 */
 	if (p_node->discovery_count == 1)
 		__osm_ni_rcv_process_switch(sm, p_node, p_madw);
-	else if (!p_node->sw || p_node->sw->discovery_count == 0) {
+	else if (!p_node->sw) {
 		/* we don't have the SwitchInfo - retry to get it */
 		OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
 			"Retry to get SwitchInfo on node GUID:0x%" PRIx64 "\n",
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 6d325cb..58b5dc2 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -726,7 +726,6 @@ static void reset_port_count(cl_map_item_t * const p_map_item, void *cxt)
 static void reset_switch_count(cl_map_item_t * const p_map_item, void *cxt)
 {
 	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
-	p_sw->discovery_count = 0;
 	p_sw->need_update = 0;
 }
 
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index a1efd1a..0d7cf15 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -115,7 +115,6 @@ __osm_state_mgr_reset_switch_count(IN cl_map_item_t * const p_map_item,
 {
 	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
 
-	p_sw->discovery_count = 0;
 	p_sw->need_update = 1;
 }
 
diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c
index 751c6f4..2f2775a 100644
--- a/opensm/opensm/osm_sw_info_rcv.c
+++ b/opensm/opensm/osm_sw_info_rcv.c
@@ -55,53 +55,6 @@
 #include <opensm/osm_helper.h>
 #include <opensm/osm_opensm.h>
 
-/**********************************************************************
- The plock must be held before calling this function.
-**********************************************************************/
-static void si_rcv_get_port_info(IN osm_sm_t * sm, IN osm_switch_t * const p_sw)
-{
-	osm_madw_context_t context;
-	uint8_t port_num;
-	osm_physp_t *p_physp;
-	osm_node_t *p_node;
-	uint8_t num_ports;
-	ib_api_status_t status = IB_SUCCESS;
-
-	OSM_LOG_ENTER(sm->p_log);
-
-	CL_ASSERT(p_sw);
-
-	p_node = p_sw->p_node;
-
-	CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH);
-
-	/*
-	   Request PortInfo attribute for each port on the switch.
-	 */
-	p_physp = osm_node_get_physp_ptr(p_node, 0);
-
-	context.pi_context.node_guid = osm_node_get_node_guid(p_node);
-	context.pi_context.port_guid = osm_physp_get_port_guid(p_physp);
-	context.pi_context.set_method = FALSE;
-	context.pi_context.light_sweep = FALSE;
-	context.pi_context.active_transition = FALSE;
-
-	num_ports = osm_node_get_num_physp(p_node);
-
-	for (port_num = 0; port_num < num_ports; port_num++) {
-		status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp),
-				     IB_MAD_ATTR_PORT_INFO, cl_hton32(port_num),
-				     CL_DISP_MSGID_NONE, &context);
-		if (status != IB_SUCCESS)
-			/* continue the loop despite the error */
-			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3602: "
-				"Failure initiating PortInfo request (%s)\n",
-				ib_get_err_str(status));
-	}
-
-	OSM_LOG_EXIT(sm->p_log);
-}
-
 #if 0
 /**********************************************************************
  The plock must be held before calling this function.
@@ -307,12 +260,6 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * const p_node,
 	   info we just received.
 	 */
 	osm_switch_set_switch_info(p_sw, p_si);
-	p_sw->discovery_count++;
-
-	/*
-	   Get the PortInfo attribute for every port.
-	 */
-	si_rcv_get_port_info(sm, p_sw);
 
 	/*
 	   Don't bother retrieving the current unicast and multicast tables
@@ -392,24 +339,6 @@ static boolean_t si_rcv_process_existing(IN osm_sm_t * sm,
 						     OSM_LOG_DEBUG);
 				is_change_detected = TRUE;
 			}
-		} else {
-			/*
-			   This is a heavy sweep.  Get information regardless
-			   of the state change bit.
-			 */
-			p_sw->discovery_count++;
-			OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
-				"discovery_count is:%u\n",
-				p_sw->discovery_count);
-
-			/* If this is the first discovery - then get the port_info */
-			if (p_sw->discovery_count == 1)
-				si_rcv_get_port_info(sm, p_sw);
-			else
-				OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-					"Not discovering again through switch:0x%"
-					PRIx64 "\n",
-					osm_node_get_node_guid(p_sw->p_node));
 		}
 	}
 
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 09:37:17 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 19:37:17 +0200
Subject: [ofa-general] [PATCH] opensm: remove casting of
	ib_smp_get_payload_ptr()
Message-ID: <20090228173717.GR7462@sashak.voltaire.com>


ib_smp_get_payload_ptr() returns void pointer - casting is not needed.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_lin_fwd_rcv.c   |    2 +-
 opensm/opensm/osm_mcast_fwd_rcv.c |    2 +-
 opensm/opensm/osm_node.c          |    4 ++--
 opensm/opensm/osm_node_desc_rcv.c |    2 +-
 opensm/opensm/osm_node_info_rcv.c |   10 +++++-----
 opensm/opensm/osm_pkey_rcv.c      |    2 +-
 opensm/opensm/osm_port_info_rcv.c |    4 ++--
 opensm/opensm/osm_slvl_map_rcv.c  |    2 +-
 opensm/opensm/osm_sw_info_rcv.c   |    6 +++---
 opensm/opensm/osm_switch.c        |    2 +-
 opensm/opensm/osm_vl_arb_rcv.c    |    2 +-
 11 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c
index c3d8633..2edb8d3 100644
--- a/opensm/opensm/osm_lin_fwd_rcv.c
+++ b/opensm/opensm/osm_lin_fwd_rcv.c
@@ -70,7 +70,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_block = (uint8_t *) ib_smp_get_payload_ptr(p_smp);
+	p_block = ib_smp_get_payload_ptr(p_smp);
 	block_num = cl_ntoh32(p_smp->attr_mod);
 
 	/*
diff --git a/opensm/opensm/osm_mcast_fwd_rcv.c b/opensm/opensm/osm_mcast_fwd_rcv.c
index 635c7da..f3d0183 100644
--- a/opensm/opensm/osm_mcast_fwd_rcv.c
+++ b/opensm/opensm/osm_mcast_fwd_rcv.c
@@ -77,7 +77,7 @@ void osm_mft_rcv_process(IN void *context, IN void *data)
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_block = (uint16_t *) ib_smp_get_payload_ptr(p_smp);
+	p_block = ib_smp_get_payload_ptr(p_smp);
 	block_num = cl_ntoh32(p_smp->attr_mod) & IB_MCAST_BLOCK_ID_MASK_HO;
 	position = (uint8_t) ((cl_ntoh32(p_smp->attr_mod) &
 			       IB_MCAST_POSITION_MASK_HO) >>
diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c
index a97477a..ee2fbed 100644
--- a/opensm/opensm/osm_node.c
+++ b/opensm/opensm/osm_node.c
@@ -60,7 +60,7 @@ void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num,
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 	port_guid = p_ni->port_guid;
 
 	CL_ASSERT(port_num < p_node->physp_tbl_size);
@@ -82,7 +82,7 @@ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw)
 	uint32_t size;
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 
 	/*
 	   The node object already contains one physical port object.
diff --git a/opensm/opensm/osm_node_desc_rcv.c b/opensm/opensm/osm_node_desc_rcv.c
index f6178b9..a79fa22 100644
--- a/opensm/opensm/osm_node_desc_rcv.c
+++ b/opensm/opensm/osm_node_desc_rcv.c
@@ -106,7 +106,7 @@ void osm_nd_rcv_process(IN void *context, IN void *data)
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_nd = (ib_node_desc_t *) ib_smp_get_payload_ptr(p_smp);
+	p_nd = ib_smp_get_payload_ptr(p_smp);
 
 	/*
 	   Acquire the node object and add the node description.
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index e40fc82..f5a5082 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -323,7 +323,7 @@ __osm_ni_rcv_get_node_desc(IN osm_sm_t * sm,
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 	port_num = ib_node_info_get_local_port_num(p_ni);
 
 	/*
@@ -384,7 +384,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm,
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 	port_num = ib_node_info_get_local_port_num(p_ni);
 	h_bind = osm_madw_get_bind_handle(p_madw);
 
@@ -573,7 +573,7 @@ __osm_ni_rcv_process_new(IN osm_sm_t * sm,
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 	p_ni_context = osm_madw_get_ni_context_ptr(p_madw);
 	port_num = ib_node_info_get_local_port_num(p_ni);
 
@@ -719,7 +719,7 @@ __osm_ni_rcv_process_existing(IN osm_sm_t * sm,
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 	p_ni_context = osm_madw_get_ni_context_ptr(p_madw);
 	port_num = ib_node_info_get_local_port_num(p_ni);
 
@@ -776,7 +776,7 @@ void osm_ni_rcv_process(IN void *context, IN void *data)
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_ni = ib_smp_get_payload_ptr(p_smp);
 
 	CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_NODE_INFO);
 
diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c
index 7061941..84d8ce1 100644
--- a/opensm/opensm/osm_pkey_rcv.c
+++ b/opensm/opensm/osm_pkey_rcv.c
@@ -77,7 +77,7 @@ void osm_pkey_rcv_process(IN void *context, IN void *data)
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 
 	p_context = osm_madw_get_pkey_context_ptr(p_madw);
-	p_pkey_tbl = (ib_pkey_table_t *) ib_smp_get_payload_ptr(p_smp);
+	p_pkey_tbl = ib_smp_get_payload_ptr(p_smp);
 
 	port_guid = p_context->port_guid;
 	node_guid = p_context->node_guid;
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 654ede7..3e39dff 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -473,7 +473,7 @@ osm_pi_rcv_process_set(IN osm_sm_t * sm, IN osm_node_t * const p_node,
 	port_guid = osm_physp_get_port_guid(p_physp);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_pi = (ib_port_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_pi = ib_smp_get_payload_ptr(p_smp);
 
 	/* check for error */
 	if (cl_ntoh16(p_smp->status) & 0x7fff) {
@@ -532,7 +532,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data)
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 	p_context = osm_madw_get_pi_context_ptr(p_madw);
-	p_pi = (ib_port_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_pi = ib_smp_get_payload_ptr(p_smp);
 
 	CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_PORT_INFO);
 
diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c
index e177345..b3f0a4c 100644
--- a/opensm/opensm/osm_slvl_map_rcv.c
+++ b/opensm/opensm/osm_slvl_map_rcv.c
@@ -82,7 +82,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data)
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 	p_context = osm_madw_get_slvl_context_ptr(p_madw);
-	p_slvl_tbl = (ib_slvl_table_t *) ib_smp_get_payload_ptr(p_smp);
+	p_slvl_tbl = ib_smp_get_payload_ptr(p_smp);
 
 	port_guid = p_context->port_guid;
 	node_guid = p_context->node_guid;
diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c
index 2f2775a..14df1fd 100644
--- a/opensm/opensm/osm_sw_info_rcv.c
+++ b/opensm/opensm/osm_sw_info_rcv.c
@@ -208,7 +208,7 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * const p_node,
 
 	p_sw_guid_tbl = &sm->p_subn->sw_guid_tbl;
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_si = ib_smp_get_payload_ptr(p_smp);
 
 	osm_dump_switch_info(sm->p_log, p_si, OSM_LOG_DEBUG);
 
@@ -302,7 +302,7 @@ static boolean_t si_rcv_process_existing(IN osm_sm_t * sm,
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_si = ib_smp_get_payload_ptr(p_smp);
 	p_si_context = osm_madw_get_si_context_ptr(p_madw);
 
 	if (p_si_context->set_method) {
@@ -365,7 +365,7 @@ void osm_si_rcv_process(IN void *context, IN void *data)
 	CL_ASSERT(p_madw);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_si = ib_smp_get_payload_ptr(p_smp);
 	p_context = osm_madw_get_si_context_ptr(p_madw);
 	node_guid = p_context->node_guid;
 
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index 9807791..6dde47c 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -87,7 +87,7 @@ osm_switch_init(IN osm_switch_t * const p_sw,
 	uint32_t port_num;
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
-	p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp);
+	p_si = ib_smp_get_payload_ptr(p_smp);
 	num_ports = osm_node_get_num_physp(p_node);
 
 	CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO);
diff --git a/opensm/opensm/osm_vl_arb_rcv.c b/opensm/opensm/osm_vl_arb_rcv.c
index ec04d67..89cf7b2 100644
--- a/opensm/opensm/osm_vl_arb_rcv.c
+++ b/opensm/opensm/osm_vl_arb_rcv.c
@@ -83,7 +83,7 @@ void osm_vla_rcv_process(IN void *context, IN void *data)
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 
 	p_context = osm_madw_get_vla_context_ptr(p_madw);
-	p_vla_tbl = (ib_vl_arb_table_t *) ib_smp_get_payload_ptr(p_smp);
+	p_vla_tbl = ib_smp_get_payload_ptr(p_smp);
 
 	port_guid = p_context->port_guid;
 	node_guid = p_context->node_guid;
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Sat Feb 28 11:19:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 21:19:21 +0200
Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm:   Added io_guid_file and
	max_reverse_hops options
In-Reply-To: <49953C48.3030203@ext.bull.net>
References: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>
	<49953C48.3030203@ext.bull.net>
Message-ID: <20090228191921.GA3936@sashak.voltaire.com>

On 10:24 Fri 13 Feb     , Nicolas Morey Chaisemartin wrote:
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied (I will push it to main stream tomorrow). Thanks.

All your patches are whitespace mangled - non-diff lines are started
from two spaces. I fixed it with "sed -e 's/^  / /'", but please check
your mailer.

Also small note below.

> ---
> Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl 
> and wouldn't be read from the cached option file.
>
>  opensm/include/opensm/osm_subnet.h |    6 ++++++
>  opensm/opensm/main.c               |   26 +++++++++++++++++++++++++-
>  opensm/opensm/osm_subnet.c         |   14 ++++++++++++++
>  3 files changed, 45 insertions(+), 1 deletions(-)
>
> diff --git a/opensm/include/opensm/osm_subnet.h 
> b/opensm/include/opensm/osm_subnet.h
> index 8863e47..671b51f 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -190,6 +190,8 @@ typedef struct osm_subn_opt {
>  	char *lfts_file;
>  	char *root_guid_file;
>  	char *cn_guid_file;
> +	char *io_guid_file;
> +       uint16_t max_reverse_hops;

Why should max_reverse_hops be 16 bits long? In IB max hops value is 64.

(and of course - use tab as indentation character, next time).

Sasha


From sashak at voltaire.com  Sat Feb 28 11:22:22 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 21:22:22 +0200
Subject: [ofa-general] Re: [PATCH 2/3] opensm/osm_ucast_ftree.c: Added
	possible reverse hops for Ftree algorithm.
In-Reply-To: <4993E7CA.60103@ext.bull.net>
References: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
	<4993E7CA.60103@ext.bull.net>
Message-ID: <20090228192210.GB3936@sashak.voltaire.com>

On 10:11 Thu 12 Feb     , Nicolas Morey Chaisemartin wrote:
>     This allows connectivity between nodes declared in the io_guid_file 
> when they had none with the regular algorithm
>     and it can be solved by doin less than max_reverse_hops in the tree.
>     This is meant to be used for I/O  and service nodes connected to the 
> Top Switches of a Fat Tree, that need connectivity
>     but no real bandwidth.
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Please next time don't mix indentation (if necessary send it as separate
patch) and functional changes in one patch.

Sasha


From sashak at voltaire.com  Sat Feb 28 11:23:00 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 21:23:00 +0200
Subject: [ofa-general] Re: [PATCH 3/3] Added documentation for io_guid_file
	and max_reverse_hop feature
In-Reply-To: <4993E7CE.3090908@ext.bull.net>
References: <cover.1234429755.git.nicolas.morey-chaisemartin@ext.bull.net>
	<4993E7CE.3090908@ext.bull.net>
Message-ID: <20090228192300.GC3936@sashak.voltaire.com>

On 10:11 Thu 12 Feb     , Nicolas Morey Chaisemartin wrote:
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From devel at morey-chaisemartin.com  Sat Feb 28 12:43:12 2009
From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Sat, 28 Feb 2009 21:43:12 +0100
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH 1/3 v2] opensm: Added
	io_guid_file and	max_reverse_hops options
In-Reply-To: <20090228191921.GA3936@sashak.voltaire.com>
References: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>	<49953C48.3030203@ext.bull.net>
	<20090228191921.GA3936@sashak.voltaire.com>
Message-ID: <49A9A1E0.4050005@morey-chaisemartin.com>

Sasha Khapyorsky a écrit :
> On 10:24 Fri 13 Feb     , Nicolas Morey Chaisemartin wrote:
>   
>> Signed-off-by: Nicolas Morey-Chaisemartin 
>> <nicolas.morey-chaisemartin at ext.bull.net>
>>     
>
> Applied (I will push it to main stream tomorrow). Thanks.
>
> All your patches are whitespace mangled - non-diff lines are started
> from two spaces. I fixed it with "sed -e 's/^  / /'", but please check
> your mailer.
>
> Also small note below.
>   
Thanks for applying and sorry for the indentation.
I tried to put my patches inline as Yevgeni advised me but it seems
thunderbird messes things up though I directly output git format-patch
into a thunderbird draft file.
I guess I'll stick to attachment from now on...
>   
>> ---
>> Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl 
>> and wouldn't be read from the cached option file.
>>
>>  opensm/include/opensm/osm_subnet.h |    6 ++++++
>>  opensm/opensm/main.c               |   26 +++++++++++++++++++++++++-
>>  opensm/opensm/osm_subnet.c         |   14 ++++++++++++++
>>  3 files changed, 45 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/include/opensm/osm_subnet.h 
>> b/opensm/include/opensm/osm_subnet.h
>> index 8863e47..671b51f 100644
>> --- a/opensm/include/opensm/osm_subnet.h
>> +++ b/opensm/include/opensm/osm_subnet.h
>> @@ -190,6 +190,8 @@ typedef struct osm_subn_opt {
>>  	char *lfts_file;
>>  	char *root_guid_file;
>>  	char *cn_guid_file;
>> +	char *io_guid_file;
>> +       uint16_t max_reverse_hops;
>>     
>
> Why should max_reverse_hops be 16 bits long? In IB max hops value is 64.
>   

In OpenSM Fat-tree max height is 8.
So except on really irregular topology, max_reverse_hops shouldn't be
more than one byte. For security reasons I chose 2 bytes so it should
never overflow. Anyway more than 2^16 reverse hops is a really bad idea
I guess.


Nicolas


From andy.grover at gmail.com  Sat Feb 28 12:44:37 2009
From: andy.grover at gmail.com (Andrew Grover)
Date: Sat, 28 Feb 2009 12:44:37 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <20090228055608.GB26292@one.firstfloor.org>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
	<c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
	<20090228055608.GB26292@one.firstfloor.org>
Message-ID: <c0a09e5c0902281244u356acbecn98f37ac3abd5ebc0@mail.gmail.com>

On Fri, Feb 27, 2009 at 9:56 PM, Andi Kleen <andi at firstfloor.org> wrote:
> On Fri, Feb 27, 2009 at 05:53:19PM -0800, Andrew Grover wrote:
>> On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen <andi at firstfloor.org> wrote:
>> >> This patchset against net-next adds support for RDS sockets. RDS is an
>> >> Oracle-originated protocol used to send IPC datagrams (up to 1MB)
>> >> reliably, and is used currently in Oracle RAC and Exadata products.
>> >
>> > Perhaps I missed it earlier, but what is the rationale for putting
>> > this as a socket type into the kernel? I assume they also work
>> > directly as implemented in user space using raw sockets or similar,
>> > don't they?
>>
>> You want me to implement my fancy protocol in userspace???
>
> I just asked why you're putting it in kernel space.
>
>> Do I even get to write it in C or do I need to use Ruby?
>
> Well normally people who add new subsystems to the kernel explain
> why they do that. Perhaps it's obvious to you, but at least to
> me it isn't.

Sure thing, sorry to be flippant :-)

The previous solution for IPC that Oracle was using was based on UDP,
which I think could be considered very close to using raw sockets --
each process is responsible for its own acks, retransmits, everything.
Doing this on a highly loaded machine resulted in a cascade where
performance got worse and worse. Moving this to kernel code made a big
difference.

Additionally, our interconnect is primarily Infiniband. It natively
implements a reliable datagram connection type so RDS leverages that.
RDS multiplexes all processes' traffic between two hosts over a single
IB connection. Since RDS is managing IB connections at the host level
(but based on socket traffic) this is also more naturally a fit for
kernel code.

Regards -- Andy


From sashak at voltaire.com  Sat Feb 28 13:56:55 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Feb 2009 23:56:55 +0200
Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm:   Added io_guid_file
	and max_reverse_hops options
In-Reply-To: <49A9A1E0.4050005@morey-chaisemartin.com>
References: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>
	<49953C48.3030203@ext.bull.net>
	<20090228191921.GA3936@sashak.voltaire.com>
	<49A9A1E0.4050005@morey-chaisemartin.com>
Message-ID: <20090228215645.GD3936@sashak.voltaire.com>

On 21:43 Sat 28 Feb     , Nicolas Morey-Chaisemartin wrote:
> I tried to put my patches inline as Yevgeni advised me but it seems
> thunderbird messes things up though I directly output git format-patch
> into a thunderbird draft file.
> I guess I'll stick to attachment from now on...

Attached patches are not friendly for reviewing. Look at Thunderbird
related section of:

http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches

Sasha


From andi at firstfloor.org  Sat Feb 28 14:36:53 2009
From: andi at firstfloor.org (Andi Kleen)
Date: Sat, 28 Feb 2009 23:36:53 +0100
Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2
In-Reply-To: <c0a09e5c0902281244u356acbecn98f37ac3abd5ebc0@mail.gmail.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
	<c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
	<20090228055608.GB26292@one.firstfloor.org>
	<c0a09e5c0902281244u356acbecn98f37ac3abd5ebc0@mail.gmail.com>
Message-ID: <20090228223653.GD26292@one.firstfloor.org>

> The previous solution for IPC that Oracle was using was based on UDP,
> which I think could be considered very close to using raw sockets --
> each process is responsible for its own acks, retransmits, everything.
> Doing this on a highly loaded machine resulted in a cascade where
> performance got worse and worse.

Could you describe that cascade in more detail? 

The problem was that the retransmits didn't have high enough priority? 

> Additionally, our interconnect is primarily Infiniband. It natively
> implements a reliable datagram connection type so RDS leverages that.

So perhaps it would make more sense to have a thin direct interface
to that IB service? Or perhaps it already exists? (I admit I don't know
the IB interfaces very well) 

-andi

-- 
ak at linux.intel.com -- Speaking for myself only.


From sashak at voltaire.com  Sat Feb 28 16:59:52 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 1 Mar 2009 02:59:52 +0200
Subject: [ofa-general] Re: [PATCH 2/2] perfquery: add
	PortXmtDataSL/PortRcvDataSL read and reset
In-Reply-To: <Pine.LNX.4.64.0902261441010.29110@zuben.voltaire.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
	<Pine.LNX.4.64.0902261438200.29061@zuben.voltaire.com>
	<Pine.LNX.4.64.0902261441010.29110@zuben.voltaire.com>
Message-ID: <20090301005952.GE3936@sashak.voltaire.com>

Hi Or,

On 14:41 Thu 26 Feb     , Or Gerlitz wrote:
> 
> For some reason the Xmt SL help is printed twice, any idea why?

Yes. You added '-s' option, but 's' letter is used already by
ibdiag_common:

Usage: perfquery [options]  [<lid|guid> [[port] [reset_mask]]]

Options:
  --extended, -x          show extended port counters
  --all_ports, -a         show aggregated counters
  --loop_ports, -l        iterate through each port
  --reset_after_read, -r  reset counters after read
  --Reset_only, -R        only reset counters
  --Ca, -C <ca>           Ca name to use
  --Port, -P <port>       Ca port number to use
  --Lid, -L               use LID address argument
  --Guid, -G              use GUID address argument
  --timeout, -t <ms>      timeout in ms
  --sm_port, -s <lid>     SM port lid
  ^^^^^^^^^^^^^^^
  ...


You can mask it by passing 's' as part of exclude string to
ibdiag_process_opts(). Or just find another, "free" latter for your
option.

Sasha


From andy.grover at gmail.com  Sat Feb 28 16:58:25 2009
From: andy.grover at gmail.com (Andrew Grover)
Date: Sat, 28 Feb 2009 16:58:25 -0800
Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets
	(RDS), take 2
In-Reply-To: <20090228223653.GD26292@one.firstfloor.org>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
	<c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
	<20090228055608.GB26292@one.firstfloor.org>
	<c0a09e5c0902281244u356acbecn98f37ac3abd5ebc0@mail.gmail.com>
	<20090228223653.GD26292@one.firstfloor.org>
Message-ID: <c0a09e5c0902281658v48084cd8w8f95871395020e20@mail.gmail.com>

On Sat, Feb 28, 2009 at 2:36 PM, Andi Kleen <andi at firstfloor.org> wrote:
>> The previous solution for IPC that Oracle was using was based on UDP,
>> which I think could be considered very close to using raw sockets --
>> each process is responsible for its own acks, retransmits, everything.
>> Doing this on a highly loaded machine resulted in a cascade where
>> performance got worse and worse.
>
> Could you describe that cascade in more detail?
> The problem was that the retransmits didn't have high enough priority?

I think the gist of it is:

Higher load -> more time before a process runs -> rcvbuf overfills ->
ACKs dropped -> timeouts -> more retransmissions -> even higher load.

Things are fine until they hit a point where everything goes to hell.

>> Additionally, our interconnect is primarily Infiniband. It natively
>> implements a reliable datagram connection type so RDS leverages that.
> So perhaps it would make more sense to have a thin direct interface
> to that IB service? Or perhaps it already exists? (I admit I don't know
> the IB interfaces very well)

The most direct userspace API is uDAPL -- apps can create IB
connections (queue pairs) directly. This was tried but didn't work out
so well. A queue pair (QP) is a TX/RX ring -- a nontrivial amount of
memory. If each process needs a new QP to talk to every other process
then the number of RAM-hungry QPs becomes huge.

RDS is only slightly less direct -- apps don't create queue pairs,
they create RDS sockets. RDS uses only one QP for all traffic to each
remote node, so the number of QPs on a node is equal to the number of
remote nodes, as opposed to (number of local processes * number of
remote processes).

Regards -- Andy


From andi at firstfloor.org  Sat Feb 28 17:50:20 2009
From: andi at firstfloor.org (Andi Kleen)
Date: Sun, 1 Mar 2009 02:50:20 +0100
Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2
In-Reply-To: <c0a09e5c0902281658v48084cd8w8f95871395020e20@mail.gmail.com>
References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com>
	<87myc73izx.fsf@basil.nowhere.org>
	<c0a09e5c0902271753n1a522647g6e7ac5465149c5a4@mail.gmail.com>
	<20090228055608.GB26292@one.firstfloor.org>
	<c0a09e5c0902281244u356acbecn98f37ac3abd5ebc0@mail.gmail.com>
	<20090228223653.GD26292@one.firstfloor.org>
	<c0a09e5c0902281658v48084cd8w8f95871395020e20@mail.gmail.com>
Message-ID: <20090301015020.GH26292@one.firstfloor.org>

> Higher load -> more time before a process runs -> rcvbuf overfills ->

How can the rcvbuf overfill if the sender doesn't run?

> ACKs dropped -> timeouts -> more retransmissions -> even higher load.
> 
> Things are fine until they hit a point where everything goes to hell.

-Andi


From sashak at voltaire.com  Sat Feb 28 23:00:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 1 Mar 2009 09:00:20 +0200
Subject: [ofa-general] Re: [PATCH 1/10] libibmad: Clean up "new" interface
In-Reply-To: <20090219190525.322681b8.weiny2@llnl.gov>
References: <20090219190525.322681b8.weiny2@llnl.gov>
Message-ID: <20090301070013.GF3936@sashak.voltaire.com>

Hi Ira,

On 19:05 Thu 19 Feb     , Ira Weiny wrote:
> From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 18 Feb 2009 16:37:36 -0800
> Subject: [PATCH] libibmad: Clean up "new" interface

Please don't put email header into commit message body, it breaks tools
like 'git rebase' and similar. At least put '>' before first 'From '
line.

> 
>    type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *

Do you plan to expose 'struct ibmad_port' (I see later in patches that
it is going to some libibmad internal header file)?

>    Create new mad_rpc_portid(struct ibmad_port *srcport) function
>       which mirrors madrpc_portid(void)
>    Mark all "old" functions with __attribute__ ((deprecated))

This generates a lot of warnings right now (even after all patch series
applying it still have deprecated usages in libibmad itself). And this
is not very good. I think our flow should have opposite direction - first
to convert, then mark deprecated functions.

Now as fast workaround I can mask depreciation by macro:

#define DEPRECATED /* __attribute__ ((deprecated)) */

, and we will uncomment this when everything in tree will be converted.

Also after looking over patch series I see that all "original" function
names become deprecated and replaces by its *_via() brothers. How do
you see the next step? Will we remove old names and have almost all API
calls with useless then _via suffix?

Sasha

> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
>  libibmad/src/gs.c                 |   19 +++---
>  libibmad/src/libibmad.map         |    1 +
>  libibmad/src/resolve.c            |   10 ++-
>  libibmad/src/rpc.c                |   29 ++++----
>  libibmad/src/sa.c                 |    4 +-
>  libibmad/src/smp.c                |    4 +-
>  7 files changed, 118 insertions(+), 88 deletions(-)
> 
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 1aaaa1b..80e38be 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
>  }
>  
>  /* rpc.c */
> -MAD_EXPORT int madrpc_portid(void);
> -MAD_EXPORT int madrpc_set_retries(int retries);
> -MAD_EXPORT int madrpc_set_timeout(int timeout);
> -void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata);
> -void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp,
> -		  void *data);
> +MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated));
> +void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata)
> +		__attribute__ ((deprecated));
> +void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data)
> +		__attribute__ ((deprecated));
>  MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes,
> -			    int num_classes);
> -void madrpc_save_mad(void *madbuf, int len);
> -MAD_EXPORT void madrpc_show_errors(int set);
> +			    int num_classes) __attribute__ ((deprecated));
> +void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated));
>  
> -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
> +/* New interface */
> +MAD_EXPORT void madrpc_show_errors(int set);
> +MAD_EXPORT int madrpc_set_retries(int retries);
> +MAD_EXPORT int madrpc_set_timeout(int timeout);
> +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
>  			int num_classes);
> -void mad_rpc_close_port(void *ibmad_port);
> -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> -	      void *payload, void *rcvdata);
> -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport,
> -		   ib_rmpp_hdr_t * rmpp, void *data);
> +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
> +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> +			void *payload, void *rcvdata);
> +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
> +			ib_rmpp_hdr_t * rmpp, void *data);
> +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
>  
>  /* smp.c */
>  MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid,
> -			      unsigned mod, unsigned timeout);
> +		      unsigned mod, unsigned timeout) __attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid,
> -			    unsigned mod, unsigned timeout);
> +		    unsigned mod, unsigned timeout) __attribute__ ((deprecated));
> +
> +/* smp.c new interface */
>  MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid,
> -		       unsigned mod, unsigned timeout, const void *srcport);
> -uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> -		     unsigned timeout, const void *srcport);
> +		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod,
> +		     unsigned timeout, const struct ibmad_port *srcport);
>  
>  /* sa.c */
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> -		 unsigned timeout);
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> +		 unsigned timeout) __attribute__ ((deprecated));
> +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id,
> +		void *buf) __attribute__ ((deprecated));
> +
> +/* sa.c new interface */
> +MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid,
>  		     ib_sa_call_t * sa, unsigned timeout);
> -MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);	/* returns lid */
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>  		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf);
> +	/* returns lid */
>  
>  /* resolve.c */
> -MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout);
> +MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
> +				__attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
> -			       ib_portid_t * sm_id, int timeout);
> +			       ib_portid_t * sm_id, int timeout)
> +				__attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
> -				     enum MAD_DEST dest, ib_portid_t * sm_id);
> +				     enum MAD_DEST dest, ib_portid_t * sm_id)
> +				__attribute__ ((deprecated));
>  MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum,
> -			       ibmad_gid_t * gid);
> -
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport);
> -int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -			ib_portid_t * sm_id, int timeout, const void *srcport);
> -int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
> +			       ibmad_gid_t * gid)
> +				__attribute__ ((deprecated));
> +
> +/* resolve.c new interface */
> +MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +			const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> +			ib_portid_t * sm_id, int timeout,
> +			const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>  			      enum MAD_DEST dest, ib_portid_t * sm_id,
> -			      const void *srcport);
> -int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -			const void *srcport);
> +			      const struct ibmad_port *srcport);
> +MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> +			const struct ibmad_port *srcport);
>  
>  /* gs.c */
>  MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest,
> -					     int port, unsigned timeout);
> +					     int port, unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest,
> -					   int port, unsigned timeout);
> +					   int port, unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest,
>  					   int port, unsigned mask,
> -					   unsigned timeout);
> +					   unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest,
> -					       int port, unsigned timeout);
> +					       int port, unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest,
>  					       int port, unsigned mask,
> -					       unsigned timeout);
> +					       unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest,
> -					       int port, unsigned timeout);
> +					       int port, unsigned timeout)
> +						__attribute__ ((deprecated));
>  MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest,
> -					      int port, unsigned timeout);
> +					      int port, unsigned timeout)
> +						__attribute__ ((deprecated));
>  
> -uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
> +/* gs.c new interface */
> +MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>  				      int port, unsigned timeout,
> -				      const void *srcport);
> -uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -				    unsigned timeout, const void *srcport);
> -uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
> +				      const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> +				    unsigned timeout, const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>  				    unsigned mask, unsigned timeout,
> -				    const void *srcport);
> -uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
> +				    const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned timeout,
> -					const void *srcport);
> -uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
> +					const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned mask,
> -					unsigned timeout, const void *srcport);
> -uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
> +					unsigned timeout,
> +					const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned timeout,
> -					const void *srcport);
> -uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
> +					const struct ibmad_port *srcport);
> +MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>  				       int port, unsigned timeout,
> -				       const void *srcport);
> +				       const struct ibmad_port *srcport);
>  /* dump.c */
>  MAD_EXPORT ib_mad_dump_fn
>      mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex,
> diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c
> index d2c4574..e302caf 100644
> --- a/libibmad/src/gs.c
> +++ b/libibmad/src/gs.c
> @@ -47,7 +47,7 @@
>  
>  static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port,
>  			      unsigned timeout, unsigned id,
> -			      const void *srcport)
> +			      const struct ibmad_port *srcport)
>  {
>  	ib_rpc_t rpc = { 0 };
>  	int lid = dest->lid;
> @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout,
>  
>  uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest,
>  				      int port, unsigned timeout,
> -				      const void *srcport)
> +				      const struct ibmad_port *srcport)
>  {
>  	return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO,
>  			     srcport);
> @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port,
>  }
>  
>  uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port,
> -				    unsigned timeout, const void *srcport)
> +				    unsigned timeout, const struct ibmad_port *srcport)
>  {
>  	return pma_query_via(rcvbuf, dest, port, timeout,
>  			     IB_GSI_PORT_COUNTERS, srcport);
> @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest,
>  				      int port, unsigned mask, unsigned timeout,
> -				      unsigned id, const void *srcport)
> +				      unsigned id, const struct ibmad_port *srcport)
>  {
>  	ib_rpc_t rpc = { 0 };
>  	int lid = dest->lid;
> @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port,
>  				    unsigned mask, unsigned timeout,
> -				    const void *srcport)
> +				    const struct ibmad_port *srcport)
>  {
>  	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>  				     IB_GSI_PORT_COUNTERS, srcport);
> @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned timeout,
> -					const void *srcport)
> +					const struct ibmad_port *srcport)
>  {
>  	return pma_query_via(rcvbuf, dest, port, timeout,
>  			     IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned mask,
> -					unsigned timeout, const void *srcport)
> +					unsigned timeout,
> +					const struct ibmad_port *srcport)
>  {
>  	return performance_reset_via(rcvbuf, dest, port, mask, timeout,
>  				     IB_GSI_PORT_COUNTERS_EXT, srcport);
> @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest,
>  					int port, unsigned timeout,
> -					const void *srcport)
> +					const struct ibmad_port *srcport)
>  {
>  	return pma_query_via(rcvbuf, dest, port, timeout,
>  			     IB_GSI_PORT_SAMPLES_CONTROL, srcport);
> @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port,
>  
>  uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest,
>  				       int port, unsigned timeout,
> -				       const void *srcport)
> +				       const struct ibmad_port *srcport)
>  {
>  	return pma_query_via(rcvbuf, dest, port, timeout,
>  			     IB_GSI_PORT_SAMPLES_RESULT, srcport);
> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> index f944d86..94d7762 100644
> --- a/libibmad/src/libibmad.map
> +++ b/libibmad/src/libibmad.map
> @@ -69,6 +69,7 @@ IBMAD_1.3 {
>  		mad_rpc_close_port;
>  		mad_rpc;
>  		mad_rpc_rmpp;
> +		mad_rpc_portid;
>  		madrpc;
>  		madrpc_def_timeout;
>  		madrpc_init;
> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> index 553949d..3291f43 100644
> --- a/libibmad/src/resolve.c
> +++ b/libibmad/src/resolve.c
> @@ -45,7 +45,8 @@
>  #undef DEBUG
>  #define DEBUG 	if (ibdebug)	IBWARN
>  
> -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport)
> +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
> +			const struct ibmad_port *srcport)
>  {
>  	ib_portid_t self = { 0 };
>  	uint8_t portinfo[64];
> @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
>  }
>  
>  int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
> -			ib_portid_t * sm_id, int timeout, const void *srcport)
> +			ib_portid_t * sm_id, int timeout,
> +			const struct ibmad_port *srcport)
>  {
>  	ib_portid_t sm_portid;
>  	char buf[IB_SA_DATA_SIZE] = { 0 };
> @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>  
>  int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str,
>  			      enum MAD_DEST dest_type, ib_portid_t * sm_id,
> -			      const void *srcport)
> +			      const struct ibmad_port *srcport)
>  {
>  	uint64_t guid;
>  	int lid;
> @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
>  }
>  
>  int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
> -			const void *srcport)
> +			const struct ibmad_port *srcport)
>  {
>  	ib_portid_t self = { 0 };
>  	uint8_t portinfo[64];
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index e811526..d47873b 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -100,6 +100,11 @@ int madrpc_portid(void)
>  	return mad_portid;
>  }
>  
> +int mad_rpc_portid(struct ibmad_port *srcport)
> +{
> +	return (srcport->port_id);
> +}
> +
>  static int
>  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>  	   int timeout)
> @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>  	return -1;
>  }
>  
> -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>  	      void *payload, void *rcvdata)
>  {
> -	const struct ibmad_port *p = port_id;
>  	int status, len;
>  	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>  
> @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>  	if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
>  		return 0;
>  
> -	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -			      p->class_agents[rpc->mgtclass],
> +	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +			      port->class_agents[rpc->mgtclass],
>  			      len, rpc->timeout)) < 0) {
>  		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>  		return 0;
> @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>  	return rcvdata;
>  }
>  
> -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
> +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport,
>  		   ib_rmpp_hdr_t * rmpp, void *data)
>  {
> -	const struct ibmad_port *p = port_id;
>  	int status, len;
>  	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>  
> @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport,
>  	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
>  		return 0;
>  
> -	if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf,
> -			      p->class_agents[rpc->mgtclass],
> +	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
> +			      port->class_agents[rpc->mgtclass],
>  			      len, rpc->timeout)) < 0) {
>  		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>  		return 0;
> @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes)
>  	}
>  }
>  
> -void *mad_rpc_open_port(char *dev_name, int dev_port,
> +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
>  			int *mgmt_classes, int num_classes)
>  {
>  	struct ibmad_port *p;
> @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port,
>  	return p;
>  }
>  
> -void mad_rpc_close_port(void *port_id)
> +void mad_rpc_close_port(struct ibmad_port *port)
>  {
> -	struct ibmad_port *p = port_id;
> -
> -	umad_close_port(p->port_id);
> -	free(p);
> +	umad_close_port(port->port_id);
> +	free(port);
>  }
>  
>  uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa,
> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c
> index 7403d4f..ddeb152 100644
> --- a/libibmad/src/sa.c
> +++ b/libibmad/src/sa.c
> @@ -44,7 +44,7 @@
>  #undef DEBUG
>  #define DEBUG 	if (ibdebug)	IBWARN
>  
> -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
> +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>  		     ib_sa_call_t * sa, unsigned timeout)
>  {
>  	ib_rpc_t rpc = { 0 };
> @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid,
>  			IB_PR_COMPMASK_SGID |\
>  			IB_PR_COMPMASK_NUMBPATH)
>  
> -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid,
> +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid,
>  		      ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf)
>  {
>  	int npath;
> diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c
> index fad263c..e5489b3 100644
> --- a/libibmad/src/smp.c
> +++ b/libibmad/src/smp.c
> @@ -45,7 +45,7 @@
>  #define DEBUG 	if (ibdebug)	IBWARN
>  
>  uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid,
> -		     unsigned mod, unsigned timeout, const void *srcport)
> +		     unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>  	ib_rpc_t rpc = { 0 };
>  
> @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid,
>  }
>  
>  uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid,
> -		       unsigned mod, unsigned timeout, const void *srcport)
> +		       unsigned mod, unsigned timeout, const struct ibmad_port *srcport)
>  {
>  	ib_rpc_t rpc = { 0 };
>  
> -- 
> 1.5.4.5
> 


From kliteyn at dev.mellanox.co.il  Sat Feb 28 23:16:52 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 01 Mar 2009 09:16:52 +0200
Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm:   Added io_guid_file
	and max_reverse_hops options
In-Reply-To: <20090228215645.GD3936@sashak.voltaire.com>
References: <cover.1234517001.git.nicolas.morey-chaisemartin@ext.bull.net>	<49953C48.3030203@ext.bull.net>	<20090228191921.GA3936@sashak.voltaire.com>	<49A9A1E0.4050005@morey-chaisemartin.com>
	<20090228215645.GD3936@sashak.voltaire.com>
Message-ID: <49AA3664.8090104@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 21:43 Sat 28 Feb     , Nicolas Morey-Chaisemartin wrote:
>> I tried to put my patches inline as Yevgeni advised me but it seems
>> thunderbird messes things up though I directly output git format-patch
>> into a thunderbird draft file.
>> I guess I'll stick to attachment from now on...
> 
> Attached patches are not friendly for reviewing. Look at Thunderbird
> related section of:
> 
> http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches

The Thunderbird section describes two options.
There's also a third option - the QuickText Thunderbird extension:
 
https://addons.mozilla.org/en-US/thunderbird/addon/640
 
With this extension you will get the new bar when composing mail.
Go to "Other"->"Insert file as text" and insert the patch.

-- Yevgeny

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Sat Feb 28 23:26:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 1 Mar 2009 09:26:31 +0200
Subject: [ofa-general] [PATCH 11/10] libibmad:infiniband-diags:
	deprecate madrpc_set_[retries|timeout]  WAS: [PATCH 1/10] libibmad:
	Clean up  "new" interface
In-Reply-To: <20090220143402.c3b23b0a.weiny2@llnl.gov>
References: <20090219190525.322681b8.weiny2@llnl.gov>
	<f0e08f230902200541x5869effbv64b2f782d5f9cdec@mail.gmail.com>
	<f0e08f230902201024t671ad122t2072c519b6d8f772@mail.gmail.com>
	<20090220143402.c3b23b0a.weiny2@llnl.gov>
Message-ID: <20090301072622.GG3936@sashak.voltaire.com>

On 14:34 Fri 20 Feb     , Ira Weiny wrote:
> On Fri, 20 Feb 2009 13:24:35 -0500
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> 
> > On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock
> > <hal.rosenstock at gmail.com> wrote:
> > > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > >> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001
> > >> From: Ira Weiny <weiny2 at llnl.gov>
> > >> Date: Wed, 18 Feb 2009 16:37:36 -0800
> > >> Subject: [PATCH] libibmad: Clean up "new" interface
> > >>
> > >>   type all "void *ibmad_port" and "void *srcport" with struct ibmad_port *
> > >>   Create new mad_rpc_portid(struct ibmad_port *srcport) function
> > >>      which mirrors madrpc_portid(void)
> > >>   Mark all "old" functions with __attribute__ ((deprecated))
> > >>
> > >> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> > >> ---
> > >>  libibmad/include/infiniband/mad.h |  139 ++++++++++++++++++++++---------------
> > >>  libibmad/src/gs.c                 |   19 +++---
> > >>  libibmad/src/libibmad.map         |    1 +
> > >>  libibmad/src/resolve.c            |   10 ++-
> > >>  libibmad/src/rpc.c                |   29 ++++----
> > >>  libibmad/src/sa.c                 |    4 +-
> > >>  libibmad/src/smp.c                |    4 +-
> > >>  7 files changed, 118 insertions(+), 88 deletions(-)
> > >>
> > >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> > >> index 1aaaa1b..80e38be 100644
> > >> --- a/libibmad/include/infiniband/mad.h
> > >> +++ b/libibmad/include/infiniband/mad.h
> > >> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt)
> > >>  }
> > >>
> > >>  /* rpc.c */
> > >> -MAD_EXPORT int madrpc_portid(void);
> > >> -MAD_EXPORT int madrpc_set_retries(int retries);
> > >> -MAD_EXPORT int madrpc_set_timeout(int timeout);
> > 
> > retries and timeouts could also be made per ibmad_port struct basis
> > rather than one for all clients. Those two APIs would be deprecated in
> > favor of new ones (mad_rpc_set_retries/timeout).
> > 
> 
> Patch below.  (To be applied after the others.)
> 
> 
> >From d12b291041bdfe0d3bddecb7a71ee769a601fd83 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Fri, 20 Feb 2009 14:30:52 -0800
> Subject: [PATCH] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout]
> 
> 	replace with mad_rpc_set_[retries|timeout] which are per ibmad_port
> 	object
> 	Update all diags with new functions
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/src/ibaddr.c        |    1 +
>  infiniband-diags/src/ibdiag_common.c |    1 -
>  infiniband-diags/src/ibping.c        |    1 +
>  infiniband-diags/src/ibportstate.c   |    1 +
>  infiniband-diags/src/ibroute.c       |    1 +
>  infiniband-diags/src/ibsendtrap.c    |    1 +
>  infiniband-diags/src/ibsysstat.c     |    1 +
>  infiniband-diags/src/ibtracert.c     |    1 +
>  infiniband-diags/src/perfquery.c     |    1 +
>  infiniband-diags/src/saquery.c       |    1 +
>  infiniband-diags/src/sminfo.c        |    1 +
>  infiniband-diags/src/smpquery.c      |    1 +
>  infiniband-diags/src/vendstat.c      |    1 +
>  libibmad/include/infiniband/mad.h    |    6 ++++--
>  libibmad/src/libibmad.map            |    2 ++
>  libibmad/src/mad_internal.h          |    2 ++
>  libibmad/src/rpc.c                   |   29 ++++++++++++++++++++---------
>  17 files changed, 40 insertions(+), 12 deletions(-)
> 
> diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
> index bb22be9..e782b36 100644
> --- a/infiniband-diags/src/ibaddr.c
> +++ b/infiniband-diags/src/ibaddr.c
> @@ -142,6 +142,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (argc) {
>  		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
> index 609df69..38d6cd3 100644
> --- a/infiniband-diags/src/ibdiag_common.c
> +++ b/infiniband-diags/src/ibdiag_common.c
> @@ -175,7 +175,6 @@ static int process_opt(int ch, char *optarg)
>  		break;
>  	case 't':
>  		val = strtoul(optarg, 0, 0);
> -		madrpc_set_timeout(val);
>  		ibd_timeout = val;
>  		break;
>  	case 's':
> diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c
> index 901079f..28e3a64 100644
> --- a/infiniband-diags/src/ibping.c
> +++ b/infiniband-diags/src/ibping.c
> @@ -213,6 +213,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (server) {
>  		if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0)
> diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
> index 65c9ca1..deaad51 100644
> --- a/infiniband-diags/src/ibportstate.c
> +++ b/infiniband-diags/src/ibportstate.c
> @@ -228,6 +228,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
>  				ibd_sm_id, srcport) < 0)
> diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c
> index 60bfdd8..07eddc4 100644
> --- a/infiniband-diags/src/ibroute.c
> +++ b/infiniband-diags/src/ibroute.c
> @@ -410,6 +410,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (!argc) {
>  		if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0)
> diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
> index 75120f0..916b537 100644
> --- a/infiniband-diags/src/ibsendtrap.c
> +++ b/infiniband-diags/src/ibsendtrap.c
> @@ -143,6 +143,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	rc = send_trap(trap_name);
>  	mad_rpc_close_port(srcport);
> diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c
> index d7daa37..7e668e8 100644
> --- a/infiniband-diags/src/ibsysstat.c
> +++ b/infiniband-diags/src/ibsysstat.c
> @@ -339,6 +339,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (server) {
>  		if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0)
> diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c
> index 1965aa0..87b5b17 100644
> --- a/infiniband-diags/src/ibtracert.c
> +++ b/infiniband-diags/src/ibtracert.c
> @@ -753,6 +753,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	node_name_map = open_node_name_map(node_name_map_file);
>  
> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
> index 2f104b8..3d89cc7 100644
> --- a/infiniband-diags/src/perfquery.c
> +++ b/infiniband-diags/src/perfquery.c
> @@ -389,6 +389,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (argc) {
>  		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
> index e6cbe50..43eff85 100644
> --- a/infiniband-diags/src/saquery.c
> +++ b/infiniband-diags/src/saquery.c
> @@ -1323,6 +1323,7 @@ static bind_handle_t get_bind_handle(void)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport);
>  	if (!handle.dport.lid)
> diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c
> index ebf6a47..0caa3f3 100644
> --- a/infiniband-diags/src/sminfo.c
> +++ b/infiniband-diags/src/sminfo.c
> @@ -118,6 +118,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	if (argc) {
>  		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c
> index 2ed1e65..dc6b685 100644
> --- a/infiniband-diags/src/smpquery.c
> +++ b/infiniband-diags/src/smpquery.c
> @@ -455,6 +455,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);
>  
>  	node_name_map = open_node_name_map(node_name_map_file);
>  
> diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c
> index d001a01..1c1c08f 100644
> --- a/infiniband-diags/src/vendstat.c
> +++ b/infiniband-diags/src/vendstat.c
> @@ -157,6 +157,7 @@ int main(int argc, char **argv)
>  	srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
>  	if (!srcport)
>  		IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
> +	mad_rpc_set_timeout(ibd_timeout, srcport);

Now you need to duplicate this single call over all tools. For me it
looks like an overkill. Probably it would be simpler to just read global
ibd_timeout variable in rpc.c?

>  
>  	if (argc) {
>  		if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 5cf135e..cbd3049 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -693,8 +693,6 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport,
>  
>  /* New interface */
>  MAD_EXPORT void madrpc_show_errors(int set);
> -MAD_EXPORT int madrpc_set_retries(int retries);
> -MAD_EXPORT int madrpc_set_timeout(int timeout);
>  MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes,
>  			int num_classes);
>  MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport);
> @@ -703,6 +701,8 @@ MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_po
>  MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport,
>  			ib_rmpp_hdr_t * rmpp, void *data);
>  MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport);
> +MAD_EXPORT int mad_rpc_set_retries(int retries, struct ibmad_port *srcport);
> +MAD_EXPORT int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport);
>  
>  /* register.c */
>  MAD_EXPORT int mad_register_port_client(int port_id, int mgmt,
> @@ -761,6 +761,8 @@ static inline int mad_is_vendor_range2(int mgmt)
>  }
>  
>  /* rpc.c */
> +MAD_EXPORT int madrpc_set_retries(int retries) __attribute__ ((deprecated));
> +MAD_EXPORT int madrpc_set_timeout(int timeout) __attribute__ ((deprecated));
>  MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated));
>  void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata)
>  		__attribute__ ((deprecated));
> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
> index 0412027..f231485 100644
> --- a/libibmad/src/libibmad.map
> +++ b/libibmad/src/libibmad.map
> @@ -80,6 +80,8 @@ IBMAD_1.3 {
>  		madrpc_save_mad;
>  		madrpc_set_retries;
>  		madrpc_set_timeout;
> +		mad_rpc_set_retries;
> +		mad_rpc_set_timeout;
>  		madrpc_show_errors;
>  		ib_path_query;
>  		sa_call;
> diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h
> index 9afe7a9..3991cc3 100644
> --- a/libibmad/src/mad_internal.h
> +++ b/libibmad/src/mad_internal.h
> @@ -39,6 +39,8 @@
>  struct ibmad_port {
>  	int port_id;		/* file descriptor returned by umad_open() */
>  	int class_agents[MAX_CLASS];	/* class2agent mapper */
> +	int retries;
> +	int timeout_ms;
>  };
>  
>  #endif /* _MAD_INTERNAL_H_ */
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index 210f0c2..229020d 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -49,7 +49,7 @@ int ibdebug;
>  
>  static int mad_portid = -1;
>  static int iberrs;
> -
> +	int timeout;

Typo?

>  static int madrpc_retries = MAD_DEF_RETRIES;
>  static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS;
>  static void *save_mad;
> @@ -85,9 +85,17 @@ int madrpc_set_timeout(int timeout)
>  	return 0;
>  }
>  
> -int madrpc_def_timeout(void)
> +int mad_rpc_set_retries(int retries, struct ibmad_port *srcport)
> +{
> +	if (retries > 0)
> +		srcport->retries = retries;
> +	return srcport->retries;
> +}
> +
> +int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport)
>  {
> -	return def_madrpc_timeout;
> +	srcport->timeout_ms = timeout_ms;
> +	return 0;
>  }
>  
>  int madrpc_portid(void)
> @@ -102,14 +110,14 @@ int mad_rpc_portid(struct ibmad_port *srcport)
>  
>  static int
>  _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
> -	   int timeout)
> +	   int timeout, const struct ibmad_port *srcport)
>  {
>  	uint32_t trid;		/* only low 32 bits */
> -	int retries;
> +	int retries, max_retries;
>  	int length, status;
>  
>  	if (!timeout)
> -		timeout = def_madrpc_timeout;
> +		timeout = srcport ? srcport->timeout_ms : def_madrpc_timeout;

Now you have three timeouts - one in rpc struct, another is per port and
default one. Isn't it too much?

>  
>  	if (ibdebug > 1) {
>  		IBWARN(">>> sending: len %d pktsz %zu", len, umad_size() + len);
> @@ -125,7 +133,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
>  	trid =
>  	    (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F);
>  
> -	for (retries = 0; retries < madrpc_retries; retries++) {
> +	max_retries = srcport ? srcport->retries : madrpc_retries;
> +	for (retries = 0; retries < max_retries; retries++) {

Same with retries - it is hard for me to believe that any multithreaded
application will try to setup different retry values per port, for
different threads, "on the fly".... (rpc.c with all its limited
functionality will not be sufficient for such flexibility level anyway
:)).

Sasha

>  		if (retries) {
>  			ERRS("retry %d (timeout %d ms)", retries, timeout);
>  		}
> @@ -178,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport
>  
>  	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
>  			      port->class_agents[rpc->mgtclass],
> -			      len, rpc->timeout)) < 0) {
> +			      len, rpc->timeout, port)) < 0) {
>  		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>  		return 0;
>  	}
> @@ -217,7 +226,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t *
>  
>  	if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf,
>  			      port->class_agents[rpc->mgtclass],
> -			      len, rpc->timeout)) < 0) {
> +			      len, rpc->timeout, port)) < 0) {
>  		IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport));
>  		return 0;
>  	}
> @@ -356,6 +365,8 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port,
>  	}
>  
>  	p->port_id = port_id;
> +	p->retries = MAD_DEF_RETRIES;
> +	p->timeout_ms = MAD_DEF_TIMEOUT_MS;
>  	return p;
>  }
>  
> -- 
> 1.5.4.5
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ogerlitz at Voltaire.com  Sat Feb 28 23:50:42 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Sun, 01 Mar 2009 09:50:42 +0200
Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL
	/	PortRcvDataSL support
In-Reply-To: <3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com>
References: <Pine.LNX.4.64.0902261436380.29061@zuben.voltaire.com>
	<3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com>
Message-ID: <49AA3E52.30804@Voltaire.com>

Sean Hefty wrote:
> Rather than continue to add more and more interfaces to the library, can we just
> export a couple of more generic calls?

Hi Sasha,

So how you'd like to get this done? should I just expose pma_query, pma_query_via, 
performance_reset, etc through mad.h?

Or.